Consistent Hashing Service
Initial Problem Statement
Implement a class that assigns keys to nodes using a consistent-hashing scheme
API
class ConsistentHashing:
def add_node(self, node_id: str) -> None: ...
def remove_node(self, node_id: str) -> None: ...
def get_node(self, key: str) -> str | None: ...
Requirements
- Keys and node identifiers must be mapped onto a hash space
- Each physical node must be represented by multiple virtual nodes (use 100 replicas)
- The system must track all virtual nodes and the node they belong to
- Adding a node inserts all of its replicas into the hash space
- Removing a node removes all of its replicas
get_node(key)returns the node responsible for that key based on its hashed position on the ring- If no nodes exist, return
None
Example
None atm
Constraints
None atm
Initial Thoughts, Questions
- For
nnodes on the ring, it's split up intonsections- 1 node gets everything
- 2 nodes splits it in half
- 3 splits it into a peace sign kinda symbol
- ...
- Each key is mapped to the last node on it's section
- So if we use 1-100 as the circle identifiers, and there are 5 nodes
- 0:19
- 20:39
- 40:59
- 60:79
- 80:99
- 100 == 0
- Getting the closest node can be placing it onto ring can be done via:
hash // (number_line // n_nodes) = indexindex * (number_line // n_nodes) = ring start(65 // (20)) * (20) = ring_start
- So if we use 1-100 as the circle identifiers, and there are 5 nodes
- Adding a new node in means splitting up one of the sections
- Should this enforce all sections to become uniform again? Or just split up one of the sections?
- Same question for removal, should we just bring all above instances down to the new node, or rebalance to be uniform?
- 32 bit space
[-2_147_483_648, 2_147_483_647]- So ring is ~4M instead of 1-100, but the same rules apply
- I can't wrap my head around how to assign a node given the negative space here, I can always just project onto
[0, 4M]and subtract2.1M - Going to add
2^31to each number, and sit in positive space hash_value = hash(x) & 0xffffffff
Turns out everything I asked above is completely wrong about consistent hashing, it's not uniform, rebalances are horrible on system, etc...oops
Implementation
- Storing node hashes in list
- New node addition will be inserted
- insertion as all things have to shift up or down
v_nodes = []
- Each node is 100 nodes
- So
add_node(node)means there's actually 100 things to insert for i in range(99)vhas_pos = hash(f"node_{i}") & 0xffffffffbisect_right(nodes, vhas_pos)+ insert there- Want to have all items less than or equal to
vhas_pos - Can node hashes collide?
- Yes, insert anyways
- Want to have all items less than or equal to
- So
- Keep a dictionary of which virtual hashes apply to which overall node
v_to_node[vhash] = node_id- Could do
v_to_node[(vhash, unique_counter)] = node_idto be absolutely sure no collisions, but collisions are rare
- Could do
node_to_v[node_id].append(vhash)
- The above allows us to interact with both sorted virtual nodes, and then find their leader node
node_id remove_node(node_id)means we need to loop overnode_to_vand delete everything in there fromv_nodesand clear ournode_to_vandv_to_node- Each
del v_nodefromv_nodesis movement - No reason to keep these in a priority queue or anything, we're not constantly looking for min or max, it's just we need a sorted structure that we can insert into random positions - list is still best case here as we don't want a linked list traversal
get_node(key)would runk_idx = hash(key) & 0xffffffff- Need to find which
vhashis greater than it bisect_left(v_nodes, k_idx)will give us that index- All elements to the left are strictly less than
- Therefore index returned is either equal to, or greater than
- If that returned index is equal to
len(v_nodes)we wrap around to 0 (that index wouldn't exist)- Equal to "None" return
- Return
v_to_nodeto get the actualnode_idthat corresponds to it
- Need to find which
class ConsistentHashing:
def __init__(self):
self.v_nodes = []
self.v_to_node = defaultdict(str)
self.node_to_v = defaultdict(list)
self.replica_factor = 100
def add_node(self, node_id: str) -> None:
for i in range(self.replica_factor):
vhas_pos = hash(f"{node_id}_{i}") & 0xffffffff
vhas_item = (vhas_pos, node_id, i)
# want to get the index where everything left of it is less than or equal to
# i.e., everything to right is strictly larger
# [1, 2, 4, 5]
# bisect_right(3) returns 2 - we want to insert there
# bisect_right(2) returns 2 - all ltequal to
vhas_idx = bisect_right(self.v_nodes, vhas_item)
self.v_nodes.insert(vhas_idx, vhas_item)
# map a position hash back to node, if vhas found in v_nodes, we wwant to know
# which node_id it's for
# could do v_to_node[(vhash, unique_counter)] = node_id later on to avoid coll's
self.v_to_node[vhas_item] = node_id
# doesn't need ordering, just need reference
self.node_to_v[node_id].append(vhas_item)
def remove_node(self, node_id: str) -> None:
if node_id not in self.node_to_v.keys():
return(None)
vhas_pos_list = self.node_to_v[node_id]
del self.node_to_v[node_id]
for v_node_item in vhas_pos_list:
# if there are duplicates, bisect_left will return first one
# we are assuming no hash coll's...
self.v_nodes.pop(bisect_left(self.v_nodes, v_node_item))
del self.v_to_node[v_node_item]
def get_node(self, key: str) -> str | None:
if not self.v_nodes:
return(None)
k_idx = hash(key) & 0xffffffff
# bisect_left means all elements to left are strictly less than
# so this index is greater than or equal to
next_greatest_vhas = bisect_left(self.v_nodes, (k_idx, -1))
# loop to 0 if at the end
next_greatest_vhas = next_greatest_vhas % len(self.v_nodes)
return(
self.v_to_node[self.v_nodes[next_greatest_vhas]]
)