Reliability Series #1: Reliability vs. resilience

By David Bills

March 24, 2014

Whenever I speak to customers and partners about reliability I’m reminded that while objectives and priorities differ between organizations and customers, at the end of the day, everyone wants their service to work. As a customer, you want to be able to do things online, at a time convenient to you. As an organization – or a provider of a service – you want your customers to carry out the tasks they want to, whenever they want to do so.

This article is the first in a four-part series on building a resilient service. In my first two posts, I will discuss the topic as it relates to business strategy, and then we’ll dive deeper into the technical details. The full series of four posts will cover:

1.   Reliability vs. resilience – What is the difference between reliability and resilience and why does it matter?
2.   Common reliability-related threats – DIAL (Discovery, Incorrectness, Authorization/Authentication, Limits/Latency) is a handy mnemonic to help teams brainstorm potential failures of interactions between components for their service in a structured way. Brainstorming about failure modes and failure points is a key phase in resilience modeling and analysis (RMA) and can help teams improve the reliability of their service.
3.   Reliability-enhancing techniques – Taking the “D” and “A” in DIAL, we’ll look at some reliability enhancing techniques you can incorporate into your design related to discovery and authentication.
4.   Reliability-enhancing techniques – Taking the “I” and “L” in DIAL, we’ll look at some reliability enhancing techniques you can incorporate into your design related to incorrectness and limits.

My intention is to provide insight into how Microsoft thinks about reliability and the processes and techniques we’re employing to improve the reliability of our services for our customers.

So what is reliability? When I ask customers and partners, the most common responses refer to consistency in performance, speed, availability – and perhaps most significantly –resilience. One thing we all agree on is that for a system or service to be reliable, the user has to believe ‘it just works’.

The Institute of Electrical and Electronics Engineers (IEEE) Reliability Society states reliability [engineering] is “a design engineering discipline which applies scientific knowledge to assure that a system will perform its intended function for the required duration within a given environment, including the ability to test and support the system through its total lifecycle.” For software, it defines reliability as “the probability of failure-free software operation for a specified period of time in a specified environment.”

A reliable cloud service is essentially one that functions as the designer intended it to, when it is expected to, and wherever the customer is connected. That’s not to say every component must operate flawlessly 100 percent of the time. This last point brings us to what I believe is the difference between reliability and resiliency.

Reliability is the outcome cloud service providers strive for – it’s the result. Resiliency is the ability of a cloud-based service to withstand certain types of failure and yet remain functional from the customer perspective. In other words, reliability is the outcome and resilience is the way you achieve the outcome. A service could be characterized as reliable simply because no part of the service has ever failed, and yet the service couldn’t be regarded as resilient because those reliability-enhancing capabilities may never have been tested.

The key takeaway here is the importance of focusing on resilience and designing and building resiliency into your service at every stage of the software development lifecycle. To find out more about the fundamentals of building a reliable online service, read our whitepaper ‘An introduction to designing reliable cloud services’.

**Next up: Reliability Series #2: Categorizing reliability threats to your service

Best practices

Incident response

Microsoft Incident Response

Cybercrime
Published Jun 29, 2023

3 min read
Patch me if you can: Cyberattack Series

The Microsoft Incident Response team takes swift action to help contain a ransomware attack and regain positive administrative control of the customer environment.
Best practices

AI and machine learning

Microsoft Intune
Published Jun 26, 2023

7 min read
Why endpoint management is key to securing an AI-powered future

With the coming wave of AI, this is precisely the time for organizations to prepare for the future. To be properly ready for AI, Zero Trust principles take on new meaning and scope. The right endpoint management strategy can help provide the broadest signal possible and make your organization more secure and productive for years to come.
News

Email security
Published May 19, 2023

3 min read
Cyber Signals: Shifting tactics fuel surge in business email compromise

Business email operators seek to exploit the daily sea of email traffic to lure victims into providing financial and other sensitive business information.
Events

Security management

Microsoft Defender
Published May 15, 2023

8 min read
Microsoft Security highlights from RSA Conference 2023

At RSA Conference April 24 to 26, 2023, Microsoft Security shared solution news and insights. Watch Vasu Jakkal’s keynote on-demand (video courtesy of RSA conference).

Reliability Series #1: Reliability vs. resilience

Related Posts

Patch me if you can: Cyberattack Series

Why endpoint management is key to securing an AI-powered future

Cyber Signals: Shifting tactics fuel surge in business email compromise

Microsoft Security highlights from RSA Conference 2023

Get started with Microsoft Security