This was exported from an org-mode notebook documenting various algorithm knowledge, with some example code snippets here and there implementing and executing algorithms inline.
Code is reused where possible, such as algorithm implementations being defined in one section with multiple data tests in others, referencing that function.
This post is a bit different, as I am documenting this for myself only- and I'm sorry if I lead you astray :)
Preface
Since this is an org-mode notebook, it's actually possible to evaluate code using Babel. As a bit of an org-mode novice, I'm not familiar with all of the capabilities, but it is possible to share sessions between source code snippets. Due to this, there are some shared functions I will define that will be reused across the notebook.
Shared code
def swap_indices(array, i, j):
array[i], array[j] = array[j], array[i]
debug_state = {
'enabled': False
}
class Logger:
def debug(self, content):
if (debug_state['enabled']):
print(content)
logger = Logger()
def set_debugging(enabled):
debug_state['enabled'] = enabled
Algorithms Terminology
Cardinality
The size of the data. For set, arrays, or tuples, it's the number of elements stored in it.
Regression
Predicting a response. An example of this is K-Nearest Neighbor.
Classification
Categorization into a group. See features.
Feature
Attributes or tangible data points that can be construed about the data set.
Take for example, fruit, you might compare color and size.
For another example, consider a pizza storefront. You might utilize a number of features, such as:
- The day of the week (Sunday is 0, Monday is 1, ... Saturday is 6)
- Weekend or holiday (Yes is 1, No is 0)
- Is it a game day (Yes is 1, No is 0)
And so on. These could be used to guess what a likely outcome of another datapoint might be.
Heuristic
Unlike an algorithm, which always produces a correct result, a heuristic usually does a godo job but does not provide any guarantees.
To provide a more direct definition:
> proceeding to a solution by trial and error or by rules that are only loosely defined.
More general terms
Arrays
Arrays are continugous data structures. They provide instant access (O(1)) to any element of the array by index. They are limited in removals and insertions though, requiring on average O(n) time to insert or delete elements from the array.
For removals, this is because any element to the right of the element needs to be shifted left.
For insertions, this is because any element to the right of the element needs to be shifted right.
So on average, since (n + 0)/2 or 1/2 * n, this equates to O(n).
Sorting algorithms
There are many sorting algorithms relevant to array data structures.
Insertion sort
This is one of the more simple sorting algorithms, and has a complexity of O(n^2^). The logic for this is simple.
- Start from the beginning of the array
- Step forward
- If the current element is less than the previous element
- If the current element is less than the previous element, walk backwards in the array until either you reach an element that is less than the current element, or the beginning of the list, and place it there
- If not at the end of the array, repeat from step 2
An example implementation for this would be:
def insertion_sort(array, debug=False):
logger.debug("Starting from position 1")
backwards_index = 1
logger.debug("Starting iterations...")
for forwards_index in range(1, len(array)):
backwards_index = forwards_index
logger.debug("Finding the best place for {}".format(array[forwards_index]))
while (backwards_index > 0 and array[backwards_index] < array[backwards_index-1]):
logger.debug("Swapping from position {} to position {}".format(backwards_index, backwards_index-1))
# array[backwards_index], array[backwards_index-1] = array[backwards_index-1], array[backwards_index]
swap_indices(array, backwards_index, backwards_index-1)
backwards_index = backwards_index-1
logger.debug("Found the resting place for {}".format(array[forwards_index]))
-
Examples
import random set_debugging(False) # always generates a list from 1-13 with 4 instances of each integer arr = [x for x in range(1,14) for y in range(0,4)] random.shuffle(arr) print("The randomized array") print(arr) insertion_sort(arr) print("") print("The sorted array") print(arr)
:
The sorted array [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13]
Lists
Graphs
Trees
Summary
Trees are a special type of graph. The properties of a tree are:
- A non-linear data structure
- Always has n-1 edges for n nodes
- Edges are always directed
- There is a root node
- A node can only have up to one parent
- There are no cycles
Terminology
-
Root node
The topmost node in a tree. It has no parent, unlike the rest of the tree
-
Parent
In relation to a node, its parent is the node pointing to it
-
Child
In relation to a node, a child is a node it points to
-
Leaf node
A node that has no children. Also referred to as an external node
-
Non-leaf node
A node that does have children. Also referred to as an internal node
-
Path
A sequence of edges connecting a starting node to an end node.
-
Edge
A link between two node, just like in a graph. This must be directed.
-
Ancestor
In relation to a node, an ancestor is any node that is its parent, or a parent of its parent, and so on.
-
Descendant
In relation to a node, a descendent is any node that is its child, or a child of its child, and so on.
-
Sibling
In relation to a node, a sibling is any node that shares the same parent as it.
-
Degree
The number of children a node has.
-
Depth of a node
In relation to a node, this is the number of edges between the root node and it.
-
Height of a node
In relation to a node, this is the longest path that exists between it and any of its descendant leaf nodes.
-
Level of a node
In relation to a node, the number of edges that exist from the root node to it. Typically, depth of node + 1.
-
Rooted tree
A binary tree that has a root node and every node has at most two children.
-
Full tree
A binary tree in which every node has either 0 or 2 children.
-
Balanced tree
A binary tree where the left and right subtrees of every node differ in height by no more than 1.
-
Degenerate / Pathological tree
A tree where each parent node has only one associated child node.
This is essentially a linked list structure and provides no additional benefits over it.
B-trees
These function by optimizing the number of reads that need to be performed in order to access data from a disk. These are often used in databases.
In contrast to a binary tree which houses up to but not exceeding 2 child nodes per node, this structures the maximum number of children per node in order to reach a full block size from the disk. In this way, each read maximally saturates the disk read operation and requires fewer reads from the disk in total to reach the desired data.
Red-black trees
Red-black trees are a specialized type of binary tree which utilize a set of rule to automatically balance itself during insertions and deletions.
Properties
- Red/Black property: Every node is colored. They can either be red or black.
- Root property: The root is black
- Leaf property: Every leaf (nil) is black
- Red property: If a red node has children, then the children are always black.
- Depth property: Every path from a given node to any of its leaf (nil) nodes has the same number of black nodes.
Rotation rules
AVL trees
AVL trees are named after its inventors, Adelson-Velsky and Landis. It is another specialized type of binary tree which is self-balancing.
More info
Heaps
A specialized tree data structure, which satisfies the heap property. There are two types of heaps, min heap and max heap. In a max heap, for any given node N, if P is a parent node of N, then the value of P is greater than or equal to the value of N.
A heap is an implementation of another abstract data type; the priority queue. Priority queues are sometimes referred to as heaps, regardless of their implementation.
A common implementation is the binary heap.
Comparison to binary search trees
Binary search trees follow a different kind of rule, such that the left and right child nodes of any given node are less than or greater than their parent, respectively. Meanwhile, a binary heap follows no such ordering, and has no implicit ordering for searches.
Splay trees
A binary search tree that provides the additional benefit that recently accessed elements will be fast to access again. Operations similar complete in O(log n) time similar to self-balancing binary search trees. For operations that are performed in a non-random pattern, it can complete in faster than logarithmic time, without requiring knowledge of the pattern.
All operations are combined with one basic operation called splaying. Splaying the tree rearranges the tree so that the element is placed at the root of the tree. This requires tree rotations to move the element to the top. This allows all of the operations performed to move recently accessed elements closer to the root.
It is possible for the structure of the tree to be pathological based on what element was most recently accessed, compared to a self-balancing tree which maintains an logarithmic lookup time.
In short, a splay tree will reorganize based on most-recently used (MRU) elements, while a self-balancing tree will reorganize to optimize for random element searches.
K-Nearest Neighbor
Determining the most similar data point in a dataset based on determined features.
Examples
Distance calculation
from math import sqrt, ceil
def calculate_distance(x, y):
# we can only compare the datapoints if they are equal in cardinality
if (len(x) != len(y)):
return -1
return (ceil(sqrt(sum([(x[i]-y[i]) ** 2 for i in range(len(x))]))))
def map_distances(value, dataset):
return [calculate_distance(value, datapoint) for datapoint in dataset]
def k_nearest(k, value, dataset):
# todo: stuff
return []
Bread
### bread data
dataset = [
(5, 1, 0),
(3, 1, 1),
(1, 1, 0),
(4, 0, 1),
(4, 0, 0),
(2, 0, 0),
]
coordinate = (4,1,0)
print(map_distances(coordinate, dataset))
Netflix User Data
### netflix user data
dataset = [
(3,4,4,1,4),
(4,3,5,1,5),
]
print(calculate_distance(*dataset))
Fourier Transform
MapReduce
Scheduling Jobs
A common algorithmic problem is optimal scheduling in regards to time slots. There are various approaches that can be taken here, such as "shortest job first", or "earliest starting job first", which do not always yield the correct result. In the case of scheduling, there is a known, correct algorithm for this that is optimal:
Earliest Ending Job
This works by finding the job that ends the earliest, as opposed to the job that starts the earliest. This job guarantees that less subsequent jobs will be blocked than any others. The algorithm for this can be defined as follows:
class Job:
def __init__(self, starts, ends, meta={}):
self.starts = starts
self.ends = ends
self.meta = meta
def pop_earliest_ending_job(jobs):
logger.debug("Finding the earliest ending job")
earliest_time = float('inf')
earliest_job = None
for job in jobs:
logger.debug("Checking if {} ends earliest...".format(job.meta['title']))
if (job.ends < earliest_time):
logger.debug("Looks like it could be {}".format(job.meta['title']))
earliest_job = job
earliest_time = job.ends
logger.debug("The earliest ending job was {}".format(job.meta['title']))
jobs.remove(earliest_job)
return earliest_job
def remove_overlapping_jobs(jobs, end_time):
logger.debug("Removing jobs conflicting with end time {}".format(end_time))
for job in [j for j in jobs]:
logger.debug("Checking whether to remove {}".format(job.meta['title']))
if (job.starts < end_time):
logger.debug("Removing the job")
jobs.remove(job)
def optimal_scheduling(jobs):
print("Actual print statement")
logger.debug("Creating empty schedule")
optimal_jobs = set()
while (len(jobs) > 0):
next_best_job = pop_earliest_ending_job(jobs)
logger.debug("Next best job was {}".format(next_best_job.meta['title']))
if (next_best_job is None):
break
logger.debug("Adding to the optimal job schedule...")
optimal_jobs.add(next_best_job)
logger.debug("Removing overlapping jobs")
remove_overlapping_jobs(jobs, next_best_job.ends)
logger.debug("Found {} job(s)!".format(len(optimal_jobs)))
return optimal_jobs
print("Defined optimal scheduler")
Examples
Take for example the following case of spending the day at the movie theatre, and you want to watch as many movies as you can in a day. Here are the showtimes:
set_debugging(False)
logger.debug("Creating set of showtimes...")
showtimes = set([
Job(9, 12, {'title': 'Mission Impossible'}),
Job(10, 14, {'title': 'Fellowship of the Ring'}),
Job(14, 15, {'title': 'Cars'}),
Job(13, 16, {'title': 'Dragon Ball Z'}),
Job(15, 20, {'title': 'Lion King'}),
])
logger.debug("Created a set...")
print("Generating the best showtime schedule...")
optimal_showtime_schedule = optimal_scheduling(showtimes)
print("I was able to schedule {} show(s)!".format(len(optimal_showtime_schedule)))
shows_in_order = sorted(optimal_showtime_schedule, key=lambda s: s.starts, reverse=False)
for show in shows_in_order:
print("{}, starting at {} and ending at {}".format(show.meta['title'], show.starts, show.ends))
print("All done")