Azure Cosmos DB: A Completely Unofficial Primer

A Cosmos DB logo, next to an illustration of Bit the Raccoon.

Azure Cosmos DB – a service that is growing at a phenomenal rate as developers and data scientists across the world start taking advantage of its remarkable capabilities. As in my previous blogs, please accept the usual caveats i.e. be aware that I have over-simplified somewhat and that you should read the official documentation for the full and totally accurate picture.

With that out of the way let’s cut to the chase: Cosmos DB is, as the name suggests, a database and it resides within the Azure cloud. But it’s a database with a difference. In this article I’m going to focus on just two significant benefits of Cosmos DB:

  • Your data is available globally with incredibly low latency. This is no small feat – I’ll explain why shortly
  • Cosmos DB is a “No SQL” database. What does that mean and why should you care? Let’s explore that one first

 

What is a “No SQL” database and why should I care?

Before we get into “No SQL” we should ask: What is “SQL”? It stands for “Structured Query Language” which is a fairly gross misnomer given that arguably it’s not particularly structured, is about more than just queries, and isn’t really a language. But let’s not get bogged down with poor nomenclature. SQL has been used for many years as the way to interact with data – I spent much of my youth as a developer building applications that used Microsoft SQL Server and it works very well. You can use SQL Server to define the schema – a sort of blue-print of the structure of the information you need to store and access – and the commands themselves that you use to access it. The problem is that while SQL is useful in many situations it isn’t always the best way to either store the data or access it.

Let me give you an example: If you’re storing documents you may find that each one might contain completely different data and you may need to change its structure on the fly. SQL is fairly poor at doing this – it likes a nice, ordered world where data structures don’t change very often and everything can conform to a template.

For this reason Cosmos DB provides a number of different ways to store and access data, it isn’t just restricted to the “SQL way” of doing things. For sure, it has the concept of a key-value pair which is similar in many ways to how SQL works but it goes well beyond this, having document stores (which address the issue mentioned above) as well as others such graph stores that focus on relationships (great for social-media type data for example), column stores that store data in columns rather than rows which massively increases read speeds for certain types of data, and so on.

The other amazing feature is that Cosmos DB indexes all of your data automatically, no matter which data model you’re using. Indexes equate to blisteringly-fast data access so this is a big deal.

So while it’s called “No SQL” it’s actually much more than SQL: It stores and provides access to data with super-efficiency due to its flexible nature, no need for schemas, and automatic indexing taking it well beyond what we think of as “SQL”.

 

Ok, so what about this global availability thing?

In case you hadn’t noticed, the world has changed a bit in the last few years. These days people (and I absolutely include myself in this) expect…well, they expect far too much. By which I mean, and I’ll use myself as an example, I expect to be able to use my phone anywhere in the world and get instant access to everything. I would be furious if I found that, simply because I’m on the other side of the planet, my apps are slow and it takes ages to get at data or update things. Such is the height of the expectation bar these days. And of course the problem is worse for business that need to process masses of data.

Now if you’re a developer or data architect this is a problem. The challenge, you see, is the speed of light. No really. The speed of light. As you may know if you remember your school physics the speed of light is a cosmic speed limit – nothing can travel faster than the speed of light. This isn’t a physical limitation that we haven’t figured out how to crack yet, it’s a part of the actual structure of the universe. And that includes the data sloshing around your network and the internet. Now, you might think that light travels pretty fast which it does – 196,000 miles per second. But even those monstrous speeds can cause a problem if your data resides on the other side of the planet. In short, there will be a very noticeable delay compared to the access times you would experience if your data were, say, just a few miles away.

So what’s the answer? There’s only one way to address this (unless you’re able to change the structure of the universe): Have copies of your data dotted around the planet so that wherever you are you’re never far from a copy. And hey presto, you have fast access wherever you are.

However as is so often the case the answer to one question causes new questions to arise. Or in this case, new problems. The first one is quite obvious – storing your data in locations around the planet is far from trivial and likely to be insanely expensive. Microsoft Azure has data centres in regions across the planet and so addresses this issue. Then there’s the second problem, how on earth do you keep everything in sync so that everyone sees and uses the same version of the data, even though there are multiple copies of it? Fortunately some extremely brainy data scientists have spent a lot of time thinking about this and much of the fruits of their labour are built into Cosmos DB. Without getting into details (although there’s lots of information here if you want to know more) Cosmos DB is able to balance the need for speed of access with the need for data consistency in a number of clever ways so that you don’t need to worry about it.

 

Summary

I’ve only really scratched the surface in this article. There is lots more to Cosmos DB – for example it would be remiss of me not to mention that Cosmos DB offers a money-back guarantee via comprehensive Service Level Agreements. Or the multiple APIs for accessing it. Or the incredible uptime and low-latency guarantees. Plus it’s in the cloud so has almost limitless capacity. And lots more.

All of which adds up to an incredibly powerful, flexible and global-scale way of storing and accessing data. Start learning more here. Or try it for free (for a limited time).

 

Resources