Friday
Room 2 - Level 3
13:40 - 14:40
(UTC±00)
Talk (60 min)
Azure Cosmos DB - Low Latency and High Availability at Planet Scale
Azure Cosmos DB is a fully-managed, multi-tenant, distributed, shared-nothing, horizontally scalable database that provides planet-scale capabilities and multi-model APIs for Apache Cassandra, MongoDB, Gremlin, Tables, and the Core (SQL) APIs. It currently powers many mission-critical services both within Microsoft (such as Microsoft Teams and Active Directory) and across large-scale Fortune 500 organizations (such as Walmart and Adobe).
This talk covers the internal architecture of Azure Cosmos DB and how it achieves high availability, low latency, and scalability. I will first cover the design of the storage engine, with particular emphasis on ensuring high availability and scalability through partitioning and replication. Next, I will zoom in on the request routing gateway to see how it has evolved to solve the well-known multi-tenant cloud infrastructure challenges of containing noisy neighbors and limiting blast radius. Lastly, I will discuss performance as a feature and as a culture. I will cover what I measure and how we think about SLOs to achieve and maintain low latency.
Building planet-scale services necessitates solving complex scalability challenges and making numerous tradeoffs across various components in the product. I look forward to sharing my experiences and lessons learned in building Azure Cosmos DB.