Service Dependency Health Checker

Initial Problem Statement

You are given a collection of microservices and their dependencies

When a service depends on another, it cannot operate unless all of its dependencies are currently healthy

API

Must build a class for following API:

class ServiceHealth()

    def add_service(service: str) -> None
    def add_dependency(service: str, depends_on: str) -> None
    def set_unhealthy(service: str) -> None
    def set_healthy(service: str) -> None
    def is_operational(service: str) -> bool

add_service(service: str) -> None registers a new service with no dependencies initially
add_dependency(service: str, depends_on: str) -> None declares that service requires depends_on
- Dependencies may form arbitrary directed graphs
- Cycles are allowed in the input, but cycles mean no service in that cycle can ever be healthy
set_unhealthy(service: str) -> None sets a services health flag
- A service is considered healthy only if:
  - It's own health_flg = 1
  - All services it depends on are operational
set_healthy(service: str) -> None sets a services health flag
- A service is considered healthy only if:
  - It's own health_flg = 1
  - All services it depends on are operational
is_operational(service: str) -> bool returns whether the service is operational or not

Example

A depends on B
B depends on C

set_healthy(A), set_healthy(B), set_healthy(C)
is_operational(A) → True

set_unhealthy(C)
is_operational(A) → False
is_operational(B) → False

Cycle: X → Y → Z → X
Even if all are set healthy:
is_operational(X) → False
is_operational(Y) → False
is_operational(Z) → False

Constraints

Up to ~10,000 services
Dependency graph may contain cycles
is_operational(service) must run efficiently
Health changes (set_healthy/set_unhealthy) will be frequent
No external libraries beyond standard Python

Initial Thoughts, Questions

Data Structures

Each service is a data structure with certain attributes itself
- No need to go over the top here, just need a dictionary of service_dict: service --> Dict[str, X] with attributes around health and other items
- Any call to add_service sets this up
Service list can be stored as edges at first
- Any single call to add_dependency would re-calculate ALL of the health attributes around health
  - In a future update, caching and only traversing the affected percentage could be done with color masking and topological sort at the same time
  - Should I implement this in initial try, or keep it simple?
- There should be a direct service health flag in_cycle stored in service_dict that is updates to 1 if there's a cycle, and then this can be directly referenced in call to is_operational - if it's 1 then it's in a cycle and can just return
  - Same thing store is_unhealthy flag that's also checked at this API call
  - set_healthy and set_unhealthy would update the service_dict[service][is_unhealthy] flag

Caching and traversal of "only affected parts" is difficult, if there are 2 disconnected components that are then joined by an add_dependency call, we'd potentially miss out on them unless we stored some sort of component ID. This is usually done via UnionFind data structure, but I've only implemented that on undirected graphs. There's a way to do it I can try and walk through, but before implementing going to check in

Initial Implementation

from typing import Dict
from collections import defaultdict

class ServiceHealth():
    def __init__(self):
        self.service_dict: Dict[str, Dict] = {}
        # The services that each key needs to function
        self.service_to_depends_on = defaultdict(list)
        
        # Each service that's needed by key to function
        self.service_to_depended_by = defaultdict(list)
    
    def add_service(self, service: str) -> None:
        if service in self.service_dict.keys():
            raise ValueError(f"{service} already in dictionary")
                
        self.service_dict[service] = {
            "is_operational": False,
            "is_healthy": True,
        }
        return
    
    def add_dependency(self, service: str, depends_on: str) -> None:
        self.service_to_depends_on[service].append(depends_on)
        self.service_to_depended_by[depends_on].append(service)
        return None
    
    def set_unhealthy(self, service: str) -> None:
        # Sets the flg per service, doesn't compute the flag for all other dependent services

        if service not in self.service_dict.keys():
            raise ValueError(f"{service} not currently in dictionary")
        
        self.service_dict[service]["is_healthy"] = False
        return None

    def set_healthy(self, service: str) -> None:
        # Sets the flg per service, doesn't compute the flag for all other dependent services

        if service not in self.service_dict.keys():
            raise ValueError(f"{service} not currently in dictionary")
        
        self.service_dict[service]["is_healthy"] = True
        return None

    def is_operational(self, service: str) -> bool:
        if service not in self.service_dict.keys():
            raise ValueError(f"{service} not currently in dictionary")
        if self.service_dict[service]["is_healthy"] == False:
            return(False)
        
        has_cycle = False
        state = defaultdict(int)
        
        
        def dfs(service: str):
            nonlocal has_cycle
            nonlocal state
            count = 0
            
            if has_cycle:
                return(0)
            
            # If this doesn't exist, raise a ValueError
            if service not in self.service_dict.keys(): # or self.service_dict[service]["is_healthy"] == False:
                raise ValueError(f"{service} not currently in dictionary")
            
            # We will set cycle here
            if state[service] == 1:
                has_cycle = True
                return(0)
            # if this service has already been seen, we should still cover it
            # elif state[service] == 2:
            #     return(0)
            else:
                state[service] = 1
            
            
            # service_to_depends_on = {
            #     'a': [b, c],
            #     'b': [d, f],
            #     'c': [d],
            #     'd': [e]
            #     'e': []
            #     'f': []
            # }

            # D -> A
            
            # Because we're using `service_to_depends_on`, we're only touching the relevant
            #   subgraph, and has_cycle wouldn't be set on a cycle in a separate component
            for upstream_service in self.service_to_depends_on[service]:
                count += dfs(upstream_service)
            
            state[service] = 2
            # if all our dependent services are covered, then we're also covered
            return(
                1 if count == len(self.service_to_depends_on[service]) else 0
            )
            
            
        return(
            dfs(service) and not has_cycle
        )

Follow Up Implementations

In the scenario of an already seen service state[service] == 2, is there a way to memoize / cache this information for future?
- Also, are there any other areas of storing memoized results / cached results?

Tweak dfs()

        has_cycle = False
        state = defaultdict(int)
        cache = defaultdict(int)
        
        
        def dfs(service: str):
            nonlocal has_cycle
            nonlocal state
            nonlocal cache

            count = 0
            
            if has_cycle:
                return(0)
            
            # If this doesn't exist, raise a ValueError
            if service not in self.service_dict.keys(): # or self.service_dict[service]["is_healthy"] == False:
                raise ValueError(f"{service} not currently in dictionary")
            
            # We will set cycle here
            if state[service] == 1:
                has_cycle = True
                return(0)
            # if this service has already been seen, we should still cover it
            elif state[service] == 2:
                return(cache[service])
            else:
                state[service] = 1
            
            
            # service_to_depends_on = {
            #     'a': [b, c],
            #     'b': [d, f],
            #     'c': [d],
            #     'd': [e]
            #     'e': []
            #     'f': []
            # }

            # D -> A
            
            # Because we're using `service_to_depends_on`, we're only touching the relevant
            #   subgraph, and has_cycle wouldn't be set on a cycle in a separate component
            for upstream_service in self.service_to_depends_on[service]:
                count += dfs(upstream_service)
            
            state[service] = 2
            cache[service] = (1 if count == len(self.service_to_depends_on[service]) else 0)
            # if all our dependent services are covered, then we're also covered
            return(
                1 if count == len(self.service_to_depends_on[service]) else 0
            )

Further good follow ups:

Talk about cache invalidation in set_healthy and set_unhealthy
Talk about edge cases that don't fail now, but would later
- i.e returning len(children) == count may not always be ideal
Mention system level insights
- Health mutations are more frequent than dependency mutations, so caching operational results and invalidating on health checks is effective for system load
- Multi-query batching
- SCC caching (this one is a bit out there), talking about cycles with strongly connected components and utilizing Tarjan algorithm to identify strongly connected components and their supernodes, which can help on caching
  - similar to UnionFind but works on directed graphs

What To Focus On

Invariant returns using booleans instead of len count semantics
Showing how data structures (black/white/gray) correspond to spec details + requirements
Over optimize in initial pass. Focus on solving exactly as spec asks, and you can mention for future but leave out of initial implementation
- UnionFind + premature caching implementation discussion

Initial Problem Statement​

API​

Example​

Constraints​

Initial Thoughts, Questions​

Data Structures​

Initial Implementation​

Follow Up Implementations​

What To Focus On​