In Part 1 we explored how a PAYG mindset applied to your capacity planning approach can bring real financial benefits. In Part 2 we will continue our discussions into areas more traditionally governed by operations.
Building on the comments around leveraging automation for achieving PAYG, another dimension we can look at is runtime. Previously we discussed capacity where our goal was to ensure we had just enough available to support the workload. As we moved through the gears, we looked to bring that planning horizon down from perhaps yearly reviews to hourly or real-time. Auto-scale aside, this still assumes that the workload is always running; with run-time optimisation we are looking to reduce the need to purchase Reserved Instances wherever possible.
Good – Snoozing
The concept of snoozing is generally well understood – shutdown systems overnight that are not in use and then start them up again in the morning. The downtime will be transparent to users but you are not paying for the compute whilst the systems are not being used.
What organisations tend to miss here is in restricting this capability to non-production only. Again as we drive home the PAYG mindset, why can we not shutdown colleague-facing systems if they are not being used? Virtual Desktop farms need not be running overnight, or at least not all of them, and store support systems for locations that close at 8pm don’t need to be on overnight.
When thinking about optimisations such as these you should not just think about virtual machines. For example you can, and should be, looking at applying snoozing to containers. Use policies to scale container instances (replicas) when not in use and gain further savings by pausing your AKS cluster.
Better – Manual Start-up
You can use functionality such as the shutdown option on a virtual machine or automation to automatically shut down resources overnight. By requiring users to start systems when they require them (rather than auto-starting them) you gain further savings. The user may be on holiday, off sick, working elsewhere or have left the company – for whatever reason this can represent a good number of compute hours saved over a year across your organisation. One customer, when implementing auto-down manual-up, saw one team’s costs “fall off a cliff”, saving over $100k per year in one of their products.
This approach opens up another saving opportunity – if resources are only started when needed, we can now identify resources that have not been started for the last 30 days; these are good candidates to be destroyed – whereas under the auto-start approach we would be unable to identify unrequired resources.
Best – Ephemeral and Spot Instances
Ephemeral in this case refers to resources that only exist for the time they are needed, for example resources provisioned to support testing, which are destroyed as soon as the tests are complete. Ephemeral resources are not stopped/started – they are destroyed and recreated when required. For all other options, whilst you can reduce the cost of compute, you are always paying storage (auto-scale aside). Regardless of if the system is in use, Ephemeral brings savings in the storage. To fully utilise this approach your teams need to be comfortable with Infrastructure as Code and pipelines.
Spot Instances allow you to achieve significant savings on VM run costs. However, they run the risk that Azure may claim the capacity back with 30 seconds notice, so obviously they do not lend themselves to critical workloads. That said, for batch processes or many non-production use cases, they can be a good fit. Coupled with automation they are a key tool for achieving greater savings, and they are supported in scale sets and AKS.
DR – Honest RTO / RPO
Traditional approaches to availability and recoverability are fairly simple and well understood:
- Redundancy – reduce the impact of a single failure by having redundant instances, be it striping data across hard disks to survive a disk failure or clustering/scaling out multiple machines to survive a node failure.
- Backup – ensure you have a valid, readily available and usefully current version of your operational data.
The points below look at redundancy options. Backups are a key component of your DR, you need to ensure that your deployment scripts are version controlled and protected in terms of availability and with data backups – a disaster may not be the loss of an Azure region, it could be a corrupted database which needs recovery. Azure backup is a native capability within Azure which also includes capabilities to secure your backups.
One of the most common DR timings we see is 4 hours to recover (RTO) and no more than 15 mins data loss (RPO). In reality this is generally driven by what the underlying infrastructure provides for the bulk of the organisation, as it is generally not worth the effort to get too fine-grained around individual sets of disks – and these figures have then been adopted by the consuming services. In reality, in a true DR scenario, could every service be back up and running in 4 hours considering the involvement required from the infrastructure teams and cross dependencies with every service?
The good news is that the investments you make in automation to improve your deployment, operational and governance processes can be built upon to improve your DR.
Review what you are doing in non-production. For example you will have “like live” environments where it is appropriate to have high levels of availability. In physical and Virtual Machines this is well understood – but what about containers? How many of your teams are running replica counts above 1 for non-production, and how much extra capacity is this driving?
Ensure all your workloads are tagged to reflect their criticality and review your Advisor Score and recommendations.
Ensure that your backup strategies are appropriate to the environment. For non-production, do we need daily incremental backups? For critical production machines determine if Azure Site Recovery would be appropriate and implement and test your recovery plans.
Update your processes to ensure the level of protection afforded to your workloads is appropriate to the current impact. For example, if you have a new service you are deploying, when you look at the expected ramp up of usage and growth in criticality do you need that warm standby system running today or perhaps could you wait until it hits a revenue impact of £1m per hour? Set clear metrics for helping teams to understand what level of protection they need; too often we see instances where teams are investing in too much protection – with the associated opportunity cost of being unable to invest in new features – as well as mission critical systems having inadequate protection for fear of spending more.
Leverage automation to “inflate” your DR environment when you require it, this way you can avoid paying for any reservations or disk storage for your standby environment. When going down this route you will obviously need to ensure you have any data (e.g. databases) replicated and take into account how long the end-to-end process takes – will it meet your Recovery Time Objective?
Unless you have recently tested your disaster recovery (DR) procedures and run actual production workload, for the author at least, it is difficult to be confident in your position. There are few instances where a workload is a true island, instead it will exist amongst an ecosystem of systems it needs to consume and which consume it. It is highly likely that the ecosystem as a whole is going through an ever-increasing amount of change, so potentially your DR test is invalid the day after you switch back to your “primary” location.
Instead you should look to use automation to update your processes, so that you could perhaps leverage your DR plan as part of major releases. This would mean load would regularly switch between regions and your teams would be familiar and comfortable with switching the traffic.
Tag your workloads by criticality, ensure that not only are you not overinvesting in DR but that you are not underinvesting – this is a frustratingly common scenario where a non-mission critical application will have over-invested in availability whilst a critical application has zero chance of hitting their recovery objectives.
Active/Active in this case refers to your workload always serving traffic from multiple regions. This is the gold standard in terms of continuously validating your ability to survive a regional outage.
It is critical of course that you remember our first point – ensure that your level of protection is appropriate for each system. It is very easy to get carried away and overspend, while under-investing can lead to audit failures in terms of risk management and of course headline grabbing outages.
In this post we have explored how we can apply a Pay As You Go Mindset to challenge our “run” assumptions, such as production machines that must be available 24/7, or that disaster recovery is an all-or-nothing process.
Whilst we have focused on Virtual Machines for these posts, the concepts outlined can be applied to Platform as a Service (PaaS) capabilities too. For example you can run Active/Active, programmatically change service tiers to reflect current demand or evolving criticality, and of course use automation capabilities to build your DR on demand.
The core messages are firstly that it is all too easy to apply your hard earned knowledge from a traditional data centre and make your cloud deployments look like just another DC. Many of the lessons you learned from running infrastructure are still valid – it is through combining that knowledge with the on-demand possibilities of the cloud you can transform your organisation’s cost base and the speed at which it can operate. The second message is that to truly own and manage your costs you need to make the application owners accountable and responsible for their spend – it isn’t just the fun bits that get democratised!