Service Dependency Health Checker
Initial Problem Statement
You are given a collection of microservices and their dependencies
When a service depends on another, it cannot operate unless all of its dependencies are currently healthy
API
Must build a class for following API:
class ServiceHealth()
def add_service(service: str) -> None
def add_dependency(service: str, depends_on: str) -> None
def set_unhealthy(service: str) -> None
def set_healthy(service: str) -> None
def is_operational(service: str) -> bool
add_service(service: str) -> Noneregisters a new service with no dependencies initiallyadd_dependency(service: str, depends_on: str) -> Nonedeclares that service requiresdepends_on- Dependencies may form arbitrary directed graphs
- Cycles are allowed in the input, but cycles mean no service in that cycle can ever be healthy
set_unhealthy(service: str) -> Nonesets a serviceshealthflag- A service is considered healthy only if:
- It's own
health_flg = 1 - All services it depends on are operational
- It's own
- A service is considered healthy only if:
set_healthy(service: str) -> Nonesets a serviceshealthflag- A service is considered healthy only if:
- It's own
health_flg = 1 - All services it depends on are operational
- It's own
- A service is considered healthy only if:
is_operational(service: str) -> boolreturns whether the service is operational or not
Example
A depends on B
B depends on C
set_healthy(A), set_healthy(B), set_healthy(C)
is_operational(A) → True
set_unhealthy(C)
is_operational(A) → False
is_operational(B) → False
Cycle: X → Y → Z → X
Even if all are set healthy:
is_operational(X) → False
is_operational(Y) → False
is_operational(Z) → False
Constraints
- Up to ~10,000 services
- Dependency graph may contain cycles
is_operational(service)must run efficiently- Health changes (
set_healthy/set_unhealthy) will be frequent - No external libraries beyond standard Python
Initial Thoughts, Questions
Data Structures
- Each
serviceis a data structure with certain attributes itself- No need to go over the top here, just need a dictionary of
service_dict: service --> Dict[str, X]with attributes around health and other items - Any call to
add_servicesets this up
- No need to go over the top here, just need a dictionary of
- Service list can be stored as edges at first
- Any single call to
add_dependencywould re-calculate ALL of the health attributes around health- In a future update, caching and only traversing the affected percentage could be done with color masking and topological sort at the same time
- Should I implement this in initial try, or keep it simple?
- There should be a direct service health flag
in_cyclestored inservice_dictthat is updates to1if there's a cycle, and then this can be directly referenced in call tois_operational- if it's 1 then it's in a cycle and can just return- Same thing store
is_unhealthyflag that's also checked at this API call set_healthyandset_unhealthywould update theservice_dict[service][is_unhealthy]flag
- Same thing store
- Any single call to
Caching and traversal of "only affected parts" is difficult, if there are 2 disconnected components that are then joined by an add_dependency call, we'd potentially miss out on them unless we stored some sort of component ID. This is usually done via UnionFind data structure, but I've only implemented that on undirected graphs. There's a way to do it I can try and walk through, but before implementing going to check in
Initial Implementation
from typing import Dict
from collections import defaultdict
class ServiceHealth():
def __init__(self):
self.service_dict: Dict[str, Dict] = {}
# The services that each key needs to function
self.service_to_depends_on = defaultdict(list)
# Each service that's needed by key to function
self.service_to_depended_by = defaultdict(list)
def add_service(self, service: str) -> None:
if service in self.service_dict.keys():
raise ValueError(f"{service} already in dictionary")
self.service_dict[service] = {
"is_operational": False,
"is_healthy": True,
}
return
def add_dependency(self, service: str, depends_on: str) -> None:
self.service_to_depends_on[service].append(depends_on)
self.service_to_depended_by[depends_on].append(service)
return None
def set_unhealthy(self, service: str) -> None:
# Sets the flg per service, doesn't compute the flag for all other dependent services
if service not in self.service_dict.keys():
raise ValueError(f"{service} not currently in dictionary")
self.service_dict[service]["is_healthy"] = False
return None
def set_healthy(self, service: str) -> None:
# Sets the flg per service, doesn't compute the flag for all other dependent services
if service not in self.service_dict.keys():
raise ValueError(f"{service} not currently in dictionary")
self.service_dict[service]["is_healthy"] = True
return None
def is_operational(self, service: str) -> bool:
if service not in self.service_dict.keys():
raise ValueError(f"{service} not currently in dictionary")
if self.service_dict[service]["is_healthy"] == False:
return(False)
has_cycle = False
state = defaultdict(int)
def dfs(service: str):
nonlocal has_cycle
nonlocal state
count = 0
if has_cycle:
return(0)
# If this doesn't exist, raise a ValueError
if service not in self.service_dict.keys(): # or self.service_dict[service]["is_healthy"] == False:
raise ValueError(f"{service} not currently in dictionary")
# We will set cycle here
if state[service] == 1:
has_cycle = True
return(0)
# if this service has already been seen, we should still cover it
# elif state[service] == 2:
# return(0)
else:
state[service] = 1
# service_to_depends_on = {
# 'a': [b, c],
# 'b': [d, f],
# 'c': [d],
# 'd': [e]
# 'e': []
# 'f': []
# }
# D -> A
# Because we're using `service_to_depends_on`, we're only touching the relevant
# subgraph, and has_cycle wouldn't be set on a cycle in a separate component
for upstream_service in self.service_to_depends_on[service]:
count += dfs(upstream_service)
state[service] = 2
# if all our dependent services are covered, then we're also covered
return(
1 if count == len(self.service_to_depends_on[service]) else 0
)
return(
dfs(service) and not has_cycle
)
Follow Up Implementations
- In the scenario of an already seen service
state[service] == 2, is there a way to memoize / cache this information for future?- Also, are there any other areas of storing memoized results / cached results?
Tweak dfs()
has_cycle = False
state = defaultdict(int)
cache = defaultdict(int)
def dfs(service: str):
nonlocal has_cycle
nonlocal state
nonlocal cache
count = 0
if has_cycle:
return(0)
# If this doesn't exist, raise a ValueError
if service not in self.service_dict.keys(): # or self.service_dict[service]["is_healthy"] == False:
raise ValueError(f"{service} not currently in dictionary")
# We will set cycle here
if state[service] == 1:
has_cycle = True
return(0)
# if this service has already been seen, we should still cover it
elif state[service] == 2:
return(cache[service])
else:
state[service] = 1
# service_to_depends_on = {
# 'a': [b, c],
# 'b': [d, f],
# 'c': [d],
# 'd': [e]
# 'e': []
# 'f': []
# }
# D -> A
# Because we're using `service_to_depends_on`, we're only touching the relevant
# subgraph, and has_cycle wouldn't be set on a cycle in a separate component
for upstream_service in self.service_to_depends_on[service]:
count += dfs(upstream_service)
state[service] = 2
cache[service] = (1 if count == len(self.service_to_depends_on[service]) else 0)
# if all our dependent services are covered, then we're also covered
return(
1 if count == len(self.service_to_depends_on[service]) else 0
)
Further good follow ups:
- Talk about cache invalidation in
set_healthyandset_unhealthy - Talk about edge cases that don't fail now, but would later
- i.e returning
len(children) == countmay not always be ideal
- i.e returning
- Mention system level insights
- Health mutations are more frequent than dependency mutations, so caching operational results and invalidating on health checks is effective for system load
- Multi-query batching
- SCC caching (this one is a bit out there), talking about cycles with strongly connected components and utilizing Tarjan algorithm to identify strongly connected components and their supernodes, which can help on caching
- similar to UnionFind but works on directed graphs
What To Focus On
- Invariant returns using booleans instead of len count semantics
- Showing how data structures (black/white/gray) correspond to spec details + requirements
- Over optimize in initial pass. Focus on solving exactly as spec asks, and you can mention for future but leave out of initial implementation
- UnionFind + premature caching implementation discussion