11. Sorting
Sorting
Sorting encompasses a range of DSA using most of the techniques in other sections, so figured it was easier to put everything into here
| Algorithm | Time Complexity (Best) | Time Complexity (Average) | Time Complexity (Worst) | Space Complexity | When to Use |
|---|---|---|---|---|---|
| Bubble Sort | Simple implementation, small datasets, rarely used in practice | ||||
| Selection Sort | Small datasets, when memory is limited | ||||
| Insertion Sort | Nearly sorted datasets, small datasets | ||||
| Merge Sort | Stable sort, large datasets, external sorting | ||||
| Quick Sort | Fastest in practice for average cases, in-memory sorting | ||||
| Heap Sort | Memory-constrained environments, when stability is not required | ||||
| Counting Sort | Small range of integers, stable sort, non-comparison-based sorting | ||||
| Radix Sort | Large datasets with integers or strings, non-comparison-based sorting | ||||
| Bucket Sort | Uniformly distributed data, when simplicity and speed are priorities |
Bubble Sort
Selection Sort
Insertion Sort
Merge Sort
Divide and Conquer and recursive algorithm based on breaking down a list into several sublists until each sublist contains a single element, afterwards merging each list in a way that resulting final set is sorted
Breaking down the entire list into sublists means we will need space
The general algorithm uses 2 main functions:
-
mergeSort(array, startIndex, endIndex)to divide and break problem down -
merge(array, startIndex, middle, endIndex)to merge together arrays -
For an array of size we need to split it in half each time, so there will be calls to split it
-
After that there will be resulting arrays we need to loop over and merge taking time
Because of this, merge sort is typically used with smaller datasets where you want to ensure time is used, and you are able to utilize a good chunk of memory. It's labeled a stable sort because it's always the same time complexity, whereas other algorithms may collapse to in some cases
Show Python Script
def merge(leftArray, rightArray):
resp = []
leftIdx = 0
rightIdx = 0
while leftIdx < len(leftArray) and rightIdx < len(rightArray):
if leftArray[leftIdx] < rightArray[rightIdx]:
resp.append(leftArray[leftIdx])
leftIdx += 1
else:
resp.append(rightArray[rightIdx])
rightIdx += 1
if leftIdx < len(leftArray):
resp.extend(leftArray[leftIdx:])
if rightIdx < len(rightArray):
resp.extend(rightArray[rightIdx:])
return(resp)
def mergeSort(array):
if len(array) <= 1:
return(array)
mid = len(array) // 2
leftArray = array[0 : mid]
rightArray = array[mid : ]
return(
# call to merge
merge(
mergeSort(leftArray),
mergeSort(rightArray)
)
)
arr = [1, 5, 3, 8, 2]
mergeSort(arr)
# [1, 5, 3, 8, 2]
# mid = 0 + (4-0) // 2 = 2
# left = [1, 5, 3], right = [8, 2]
# [1, 5], [3]
# ...
# [1], [5], [3], [8], [2]
# [1], [5], [3], [2, 8]
# [1, 5], [3], [2, 8]
# [1, 3, 5], [2, 8]
# [1, 2, 3, 5, 8]
Quick Sort
Quick sort utilizes pivot elements and array swaps to sort an array - it only uses space, and typically runs in time, but can become unstable and devolve into for certain input arrays
Since the algorithm runs on the data structure itself, it could hypothetically be implemented on an array of any size that's stored on disk and partitioned across compute nodes
Heap Sort
Counting Sort
Radix Sort
Bucket Sort
Distributed Sorting Algorithms
The algorithms above can run on local memory / singular node, but in terms of system design there are times we want to sort items across nodes
The algorithms are shown in the Lecture 4 Distributed Sorting PDF
One specific example is the Merge K Streams which is essentially the second half merge portion of merge sort ran over a number of streams coming in
Both merge sort and quick sort have distributed algorithm counterparts. Merge sort typically merges sorted chunks across nodes and keeps the merge process distributed so that final results can be written to distributed storage systems, while quick sort would sort the items across nodes keeping the storage local to each input node by partitioning data on a pivot and sort items on each node based on that pivot
- Distributed storage algorithms are used by Hadoop, Spark, and Flink type processing engines
- Both distributed merge sort and distributed quick sort are optimized to minimize data movement between nodes (shuffle) and to not bring all data onto a singular node
Most of these algorithms require coordinator leader nodes, worker nodes, and a fair amount of network shuffle and / or potential distributed file systems
Distributed Merge Sort
Distributed Quick Sort
Distributed Quick Sort Algorithm works by creating multiple partitions of the data, and then sorting each partition in parallel. By doing this, the algorithm can take advantage of multiple processors to sort the data more quickly
The chunk size is an important part of how the algorithm works; with too small or too large a chunk size, it will not perform well.