Friday 

Room 2 - Level 3 

13:40 - 14:40 

(UTC±00

Talk (60 min)

Azure Cosmos DB - Low Latency and High Availability at Planet Scale

Azure Cosmos DB is a fully-managed, multi-tenant, distributed, shared-nothing, horizontally scalable database that provides planet-scale capabilities and multi-model APIs for Apache Cassandra, MongoDB, Gremlin, Tables, and the Core (SQL) APIs. It currently powers many mission-critical services both within Microsoft (such as Microsoft Teams and Active Directory) and across large-scale Fortune 500 organizations (such as Walmart and Adobe).

Database
Architecture
Cloud


This talk covers the internal architecture of Azure Cosmos DB and how it achieves high availability, low latency, and scalability. I will first cover the design of the storage engine, with particular emphasis on ensuring high availability and scalability through partitioning and replication. Next, I will zoom in on the request routing gateway to see how it has evolved to solve the well-known multi-tenant cloud infrastructure challenges of containing noisy neighbors and limiting blast radius. Lastly, I will discuss performance as a feature and as a culture. I will cover what I measure and how we think about SLOs to achieve and maintain low latency.

Building planet-scale services necessitates solving complex scalability challenges and making numerous tradeoffs across various components in the product. I look forward to sharing my experiences and lessons learned in building Azure Cosmos DB.

Kevin Pilch

Kevin has worked at Microsoft since 2002. During that time he has worked on things like C#/VB/F#, Roslyn, MSBuild, ASP.NET Core, Entity Framework, Winforms, Orleans, and SignalR. Currently he manages the Developer Experience team for Azure Cosmos DB. Outside of work he enjoys training for marathons and playing hockey. His weaknesses include beer and chocolate chip cookies.