Skip to content
Microsoft Secure

Measure Twice, Cut Once, With RMA Methodology

I’ve been beating our drum for a while now about the inevitability of failure in cloud-based systems. Simply put, the complexities and interdependencies of the cloud make it nearly impossible to avoid service failure, so instead we have to go against our instincts and actually design for this eventuality. Once you accept this basic premise,...

Read more

Designing for Failure: The Changing Face of Reliability

I’ve written about reliability and resilience before, but the topic is so important it’s worth revisiting again, using an example from the real world I think you’ll appreciate. Imagine the pressure the architects and engineers were under when they designed and built the Channel Tunnel connecting England to France via rail. The so-called “Chunnel” would...

Read more

Antifragility – the goal for high-performance IT organizations

In a recent post, I shared a short list of my favorite books and articles, related to reliability. Each one has influenced my thinking with respect to how to go about creating a high-performing IT organization, despite the fact not all of these publications are IT-centric in terms of subject matter. In this post, I’m...

Read more

My “Desert Island Half-Dozen” – recommended reading for resilience

When I speak with customers, they often ask how they can successfully change the culture of their IT organization when deciding to implement a resilience engineering practice. Over the past decade I’ve collected a number of books and articles which I have found to be helpful in this regard, and I often recommend these resources...

Read more

Reliability Series #4: Reliability-enhancing techniques (Part 2)

In my previous post in this series, I discussed the Discovery and Authorization/Authentication categories of the “DIAL” acronym to share mitigations targeting specific failure modes. In this article I’ll discuss the “Limits/Latency” and “Incorrectness” categories represented by the “DIAL” acronym, and I’ll also share example mitigations targeting specific failure modes for each.  See more >>...

Read more

Reliability Series #3: Reliability-enhancing techniques (Part 1)

In my previous post, I discussed “DIAL”, an approach we use to categorize common service component interaction failures when applying Resilience Modeling & Analysis, (RMA), to an online service design.  In the next two posts,  I’ll discuss some mitigation strategies and design patterns intended to reduce the likelihood of the types of failures described by...

Read more

Reliability Series #2: Categorizing reliability threats to your service

Online services face ongoing reliability-related threats represented by device failures, latent flaws in software being triggered by environmental change, and mistakes made by human beings. At Microsoft, one of the ways we’re helping to improve the reliability of our services is by investing in resilience modeling and analysis (RMA) as a way for online service...

Read more

Reliability Series #1: Reliability vs. resilience

Whenever I speak to customers and partners about reliability I’m reminded that while objectives and priorities differ between organizations and customers, at the end of the day, everyone wants their service to work. As a customer, you want to be able to do things online, at a time convenient to you. As an organization –...

Read more

Want more information on Trustworthy Computing? Check out our other blogs

The Trustworthy Computing blog covers Microsoft’s perspective on security, privacy, online safety, and reliability, especially as they relate to the cloud. For readers who want additional information on those topics, check out our other TwC Blogs, which provide insights from Microsoft experts, plus information on mitigation tools, secure development, security updates, online safety, and more. ...

Read more

Suggested Resolutions for Cloud Providers in 2014 #1: Reinforce that security is a shared responsibility

Happy 2014! The arrival of a new year is always a great time to reflect on where you’ve been over the past 12 months, and more importantly, where you are headed. I was recently asked to share some New Year’s Resolutions for cloud providers for an article in Security Week and I thought I’d expand...

Read more