Themes in reliable computing

By Cameron Laird, Technical Writer at Microsoft

01/04/2022

Tags
TechNet UK

An image of Earth, next to an illustration of Bit the Raccoon.

Programming applications that truly must work – in medicine, transportation, factory automation, and so on – is different from the software development that most of us do on a daily basis. In general, no one dies or crashes or explodes when our pixels don’t line up or the supermarket’s quarterly sales total is centred rather than right-justified.

Trust in our computations is increasingly a requirement, though, and even those who develop for the relatively forgiving commercial sectors have lessons to learn from those who think about guarantees on a daily basis.

What’s reliable about computing?

Start with a few vocabulary clarifications. “Trust” in computing usually refers to such security considerations as authentication – are the users who they say they are? – and authorisation – are those same users actually permitted to do what they’re asking? “Reliable computing” generally is understood in terms of guarantees: these inputs will yield desired results every time, without exception.

Programmers usually classify a denial-of-service exploit, for instance, as a security problem, or possibly even a merely operational detail outside programmers’ responsibility. Consider this example, though: suppose an account holder should see in a particular circumstance, “Your current balance is £154.38.” As a result of an attack on another part of the system resulting in network congestion or the other usual causes, something goes wrong, and the best the application can offer is, “… try again later.” That’s a good message – certainly better than the esoteric tracebacks some retail applications still show on occasion. The application is correct. It meets its specifications.

Notice, though, that, for consumers, the application has become unreliable. At this point, consumers hope for a different answer, with no guarantee when it will arrive. The application is simultaneously correct, yet not reliable.

At an IT level, “reliability” has to do with blowing dust off motherboards, installation of antivirus software, hard-drive defragmentation and minimisation of plugins.

In some circles, “reliable computing” aims at the rather narrow branch of mathematics that analyses arithmetic results and how precision propagates through them. Most of us, most of the time, have the luxury of indifference about whether “1/5” displays as “0.20000000”, “0.20000001”, or even something else. For certain specialists, those differences are exactly the domain of “reliable computing”.

Some dialects of computing culture even swap the meanings of “trust” and “reliability”. The most important point in any teamwork is to agree lucid definitions of what you’re aiming at. For today, we have a general interest in the software dimensions of computing results and the factors that support or interfere with getting the right answers every time.

Tackling the hazards

The sheer volume of categories of interference surprises a newcomer to the field, including at least:

Security considerations: is this computer allowed to make this computation in this circumstance?
Platform health: are power, networking, security certificates, mass storage, and third-party libraries as available as designed?
Memory exhaustion: does the system have enough memory for stack and heap to work as intended? Who scrambles when they do not?
Deployment: how graceful is it to update the system? Is there a defined process for deployment, or is it left to the ingenuity of the operator? What criteria are in place for agreement that the system is back to “normal” after a restart?
Logging: suppose a system is performing perfectly, but the logger to which it reports periodically itself goes off-line. What should happen then?

The biggest countermeasures we’ve invented have to do with our tools. Rather than “artisanal C” on a repurposed operating system (OS), we’re more likely nowadays to program in a language which builds in memory management, to execute on a real-time OS, and to have both in-process memory scanners and OS-level dashboards. As a result, memory misuses are far less common today.

At least, I believe so, and I can find multiple other practitioners who echo this. At the same time, software engineering is barely into its first century of practice; it has been filled with intense change throughout, and we’ve collectively had little opportunity to make even such basic measurements as the frequency through time of memory faults. We’re just at the beginning of the research that can lead to any meaningful conclusions.

Explore and adapt best practice

What does make for reliable computing, then? To a large extent, you’ll need to research your own particular domain to answer that question well. Study what has worked with medical devices or space probes for ideas and practices you can test in your own situation, more than “silver bullets” that work magic. In general:

High-level languages are more expressive.
Testing probably helps.
Doubt everyone’s estimates.
Organisational dynamics trump geography.
Inspection probably helps.
“Good team communication” might be indispensable.

Reliable computing is a worthy goal. Its achievement seems to depend on more than just purchase of the right products or adoption of a particular methodology. Until we know more, assume you’ll need to customise the themes of reliable computing to the specific programming you do.