Azure Cosmos DB - Low Latency and High Availability at Planet Scale

Azure Cosmos DB is a fully-managed, multi-tenant, distributed, shared-nothing, horizontally scalable database that provides planet-scale capabilities and multi-model APIs for Apache Cassandra, MongoDB, Gremlin, Tables, and the Core (SQL) APIs. It currently powers many mission-critical services both within Microsoft (such as Microsoft Teams and Active Directory) and across large-scale Fortune 500 organizations (such as Walmart and Adobe).

Database

Architecture

Cloud

This talk covers the internal architecture of Azure Cosmos DB and how it achieves high availability, low latency, and scalability. I will first cover the design of the storage engine, with particular emphasis on ensuring high availability and scalability through partitioning and replication. Next, I will zoom in on the request routing gateway to see how it has evolved to solve the well-known multi-tenant cloud infrastructure challenges of containing noisy neighbors and limiting blast radius. Lastly, I will discuss performance as a feature and as a culture. I will cover what I measure and how we think about SLOs to achieve and maintain low latency.

Building planet-scale services necessitates solving complex scalability challenges and making numerous tradeoffs across various components in the product. I look forward to sharing my experiences and lessons learned in building Azure Cosmos DB.

NDC { London }

Azure Cosmos DB - Low Latency and High Availability at Planet Scale

Kevin Pilch