Distributed Metrics Aggregation System

Systems Design

High Level requirements

Clients send:
- counters
- gauges
- timers / histograms
Ingest up to 20M events/sec
Queries:
- aggregates over last N minutes
- percentiles (p50 / p95 / p99)
Horizontal scale
Eventual consistency acceptable
Retention: 7 days
Cardinality: millions of metric series

Restating: Clients will aggregate defined metrics locally, and we need to keep track of these client defined metrics. These metrics can be counts, gauges (what are gauges?), or time based (not a one size fits all data structure). Ingesting 20M events / second so heavy write throughput on our system. Clients will query specific aggregations and other statistics like percentiles. We should handle counts, percentiles, and time based visualizations (antything else specifically?). Eventual consistency is acceptable, no hard requirements on timing and consistency from our ingest and aggregate to our query layer. Millions of metric series means we need to somehow serve these millions of metrics in an efficient manner

What's the QPS of each user, and how many typical users?
What are the metric types we will handle? i.e. what aggregations, and should these aggregations be specified in the design, or are we focusing only on ingest layer and query layers for now?
How can we identify clients? Are there many users per a firm like client, are clients singular entities that only focus on their metrics, etc?
Do all retained metrics need to have the same efficiency and query time as new metrics, or are there layers to our query layer SLA's?
What tolerance do we have for fault tolerance, partitioning, and consistency? Which are we most concerned on high availability, consistency, accuracy under node failure?

API Surface

Clarifying Questions

Questions around implementations, strict vs approximate, and anything that would kill any choices made in future

System Constraints

Summarize system constraints in your own words

QPS
latency SLO
consistency requirements
failure tolerance

Identify Core Challenges

What makes this problem hard in distributed manner (MOST IMPORTANT PART)

Distributed correctness (no double allow)
Routing / management
Data structure
Scale + latency
Failure behavior
single-writer vs linearizable store
availability vs consistency
data plane vs control plane

Starter Architecture

Distributed compatible, but high level components and algorithms, focus on partitioning and scaling, not frontend or anything. Some data structure information is OK, specifically ones that relate to the core problem.

Components + Flow + network, partition, and some data structures

Real Architecture

Now you make in depth choices on literally everything, and draw the entire thing out

Component Deep Dive

Usually one or two, picked by interviewer or you know to dive into them

Potentially pseucode or pseudo architecture

Systems Design​

High Level requirements​

API Surface​

Clarifying Questions​

System Constraints​

Identify Core Challenges​

Starter Architecture​

Real Architecture​

Component Deep Dive​