Notes on algorithms

12 minute read Published: 2023-07-28


This was exported from an org-mode notebook documenting various algorithm knowledge, with some example code snippets here and there implementing and executing algorithms inline.

Code is reused where possible, such as algorithm implementations being defined in one section with multiple data tests in others, referencing that function.

This post is a bit different, as I am documenting this for myself only- and I'm sorry if I lead you astray :)

Preface

Since this is an org-mode notebook, it's actually possible to evaluate code using Babel. As a bit of an org-mode novice, I'm not familiar with all of the capabilities, but it is possible to share sessions between source code snippets. Due to this, there are some shared functions I will define that will be reused across the notebook.

Shared code

def swap_indices(array, i, j):
    array[i], array[j] = array[j], array[i]

debug_state = {
    'enabled': False
}

class Logger:
    def debug(self, content):
        if (debug_state['enabled']):
            print(content)

logger = Logger()

def set_debugging(enabled):
    debug_state['enabled'] = enabled

Algorithms Terminology

Cardinality

The size of the data. For set, arrays, or tuples, it's the number of elements stored in it.

Regression

Predicting a response. An example of this is K-Nearest Neighbor.

Classification

Categorization into a group. See features.

Feature

Attributes or tangible data points that can be construed about the data set.

Take for example, fruit, you might compare color and size.

For another example, consider a pizza storefront. You might utilize a number of features, such as:

And so on. These could be used to guess what a likely outcome of another datapoint might be.

Heuristic

Unlike an algorithm, which always produces a correct result, a heuristic usually does a godo job but does not provide any guarantees.

To provide a more direct definition:

> proceeding to a solution by trial and error or by rules that are only loosely defined.

More general terms

Arrays

Arrays are continugous data structures. They provide instant access (O(1)) to any element of the array by index. They are limited in removals and insertions though, requiring on average O(n) time to insert or delete elements from the array.

For removals, this is because any element to the right of the element needs to be shifted left.

For insertions, this is because any element to the right of the element needs to be shifted right.

So on average, since (n + 0)/2 or 1/2 * n, this equates to O(n).

Sorting algorithms

There are many sorting algorithms relevant to array data structures.

Insertion sort

This is one of the more simple sorting algorithms, and has a complexity of O(n^2^). The logic for this is simple.

  1. Start from the beginning of the array
  2. Step forward
  3. If the current element is less than the previous element
  4. If the current element is less than the previous element, walk backwards in the array until either you reach an element that is less than the current element, or the beginning of the list, and place it there
  5. If not at the end of the array, repeat from step 2

An example implementation for this would be:

def insertion_sort(array, debug=False):
    logger.debug("Starting from position 1")
    backwards_index = 1

    logger.debug("Starting iterations...")
    for forwards_index in range(1, len(array)):
        backwards_index = forwards_index

        logger.debug("Finding the best place for {}".format(array[forwards_index]))

        while (backwards_index > 0 and array[backwards_index] < array[backwards_index-1]):

            logger.debug("Swapping from position {} to position {}".format(backwards_index, backwards_index-1))
            # array[backwards_index], array[backwards_index-1] = array[backwards_index-1], array[backwards_index]
            swap_indices(array, backwards_index, backwards_index-1)
            backwards_index = backwards_index-1

        logger.debug("Found the resting place for {}".format(array[forwards_index]))
  1. Examples

    import random
    
    set_debugging(False)
    
    # always generates a list from 1-13 with 4 instances of each integer
    arr = [x for x in range(1,14) for y in range(0,4)]
    
    random.shuffle(arr)
    
    print("The randomized array")
    print(arr)
    
    insertion_sort(arr)
    
    print("")
    print("The sorted array")
    print(arr)
    
    

    :

    The sorted array
    [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13]
    

Lists

Graphs

Trees

Summary

Trees are a special type of graph. The properties of a tree are:

Terminology

  1. Root node

    The topmost node in a tree. It has no parent, unlike the rest of the tree

  2. Parent

    In relation to a node, its parent is the node pointing to it

  3. Child

    In relation to a node, a child is a node it points to

  4. Leaf node

    A node that has no children. Also referred to as an external node

  5. Non-leaf node

    A node that does have children. Also referred to as an internal node

  6. Path

    A sequence of edges connecting a starting node to an end node.

  7. Edge

    A link between two node, just like in a graph. This must be directed.

  8. Ancestor

    In relation to a node, an ancestor is any node that is its parent, or a parent of its parent, and so on.

  9. Descendant

    In relation to a node, a descendent is any node that is its child, or a child of its child, and so on.

  10. Sibling

    In relation to a node, a sibling is any node that shares the same parent as it.

  11. Degree

    The number of children a node has.

  12. Depth of a node

    In relation to a node, this is the number of edges between the root node and it.

  13. Height of a node

    In relation to a node, this is the longest path that exists between it and any of its descendant leaf nodes.

  14. Level of a node

    In relation to a node, the number of edges that exist from the root node to it. Typically, depth of node + 1.

  15. Rooted tree

    A binary tree that has a root node and every node has at most two children.

  16. Full tree

    A binary tree in which every node has either 0 or 2 children.

  17. Balanced tree

    A binary tree where the left and right subtrees of every node differ in height by no more than 1.

  18. Degenerate / Pathological tree

    A tree where each parent node has only one associated child node.

    This is essentially a linked list structure and provides no additional benefits over it.

B-trees

These function by optimizing the number of reads that need to be performed in order to access data from a disk. These are often used in databases.

In contrast to a binary tree which houses up to but not exceeding 2 child nodes per node, this structures the maximum number of children per node in order to reach a full block size from the disk. In this way, each read maximally saturates the disk read operation and requires fewer reads from the disk in total to reach the desired data.

Red-black trees

Red-black trees are a specialized type of binary tree which utilize a set of rule to automatically balance itself during insertions and deletions.

Properties

  1. Red/Black property: Every node is colored. They can either be red or black.
  2. Root property: The root is black
  3. Leaf property: Every leaf (nil) is black
  4. Red property: If a red node has children, then the children are always black.
  5. Depth property: Every path from a given node to any of its leaf (nil) nodes has the same number of black nodes.

Rotation rules

AVL trees

AVL trees are named after its inventors, Adelson-Velsky and Landis. It is another specialized type of binary tree which is self-balancing.

More info

Heaps

A specialized tree data structure, which satisfies the heap property. There are two types of heaps, min heap and max heap. In a max heap, for any given node N, if P is a parent node of N, then the value of P is greater than or equal to the value of N.

A heap is an implementation of another abstract data type; the priority queue. Priority queues are sometimes referred to as heaps, regardless of their implementation.

A common implementation is the binary heap.

Comparison to binary search trees

Binary search trees follow a different kind of rule, such that the left and right child nodes of any given node are less than or greater than their parent, respectively. Meanwhile, a binary heap follows no such ordering, and has no implicit ordering for searches.

Splay trees

A binary search tree that provides the additional benefit that recently accessed elements will be fast to access again. Operations similar complete in O(log n) time similar to self-balancing binary search trees. For operations that are performed in a non-random pattern, it can complete in faster than logarithmic time, without requiring knowledge of the pattern.

All operations are combined with one basic operation called splaying. Splaying the tree rearranges the tree so that the element is placed at the root of the tree. This requires tree rotations to move the element to the top. This allows all of the operations performed to move recently accessed elements closer to the root.

It is possible for the structure of the tree to be pathological based on what element was most recently accessed, compared to a self-balancing tree which maintains an logarithmic lookup time.

In short, a splay tree will reorganize based on most-recently used (MRU) elements, while a self-balancing tree will reorganize to optimize for random element searches.

K-Nearest Neighbor

Determining the most similar data point in a dataset based on determined features.

Examples

Distance calculation

from math import sqrt, ceil

def calculate_distance(x, y):
    # we can only compare the datapoints if they are equal in cardinality
    if (len(x) != len(y)):
        return -1

    return (ceil(sqrt(sum([(x[i]-y[i]) ** 2 for i in range(len(x))]))))

def map_distances(value, dataset):
    return [calculate_distance(value, datapoint) for datapoint in dataset]

def k_nearest(k, value, dataset):
    # todo: stuff

    return []

Bread

### bread data
dataset = [
    (5, 1, 0),
    (3, 1, 1),
    (1, 1, 0),
    (4, 0, 1),
    (4, 0, 0),
    (2, 0, 0),
]
coordinate = (4,1,0)

print(map_distances(coordinate, dataset))

Netflix User Data

### netflix user data
dataset = [
    (3,4,4,1,4),
    (4,3,5,1,5),
]

print(calculate_distance(*dataset))

Fourier Transform

MapReduce

Scheduling Jobs

A common algorithmic problem is optimal scheduling in regards to time slots. There are various approaches that can be taken here, such as "shortest job first", or "earliest starting job first", which do not always yield the correct result. In the case of scheduling, there is a known, correct algorithm for this that is optimal:

Earliest Ending Job

This works by finding the job that ends the earliest, as opposed to the job that starts the earliest. This job guarantees that less subsequent jobs will be blocked than any others. The algorithm for this can be defined as follows:

class Job:
    def __init__(self, starts, ends, meta={}):
        self.starts = starts
        self.ends = ends
        self.meta = meta

def pop_earliest_ending_job(jobs):
    logger.debug("Finding the earliest ending job")

    earliest_time = float('inf')
    earliest_job = None

    for job in jobs:
         logger.debug("Checking if {} ends earliest...".format(job.meta['title']))
         if (job.ends < earliest_time):
             logger.debug("Looks like it could be {}".format(job.meta['title']))
             earliest_job = job
             earliest_time = job.ends

    logger.debug("The earliest ending job was {}".format(job.meta['title']))

    jobs.remove(earliest_job)

    return earliest_job

def remove_overlapping_jobs(jobs, end_time):
    logger.debug("Removing jobs conflicting with end time {}".format(end_time))

    for job in [j for j in jobs]:
        logger.debug("Checking whether to remove {}".format(job.meta['title']))

        if (job.starts < end_time):
            logger.debug("Removing the job")
            jobs.remove(job)

def optimal_scheduling(jobs):
    print("Actual print statement")
    logger.debug("Creating empty schedule")

    optimal_jobs = set()

    while (len(jobs) > 0):
        next_best_job = pop_earliest_ending_job(jobs)

        logger.debug("Next best job was {}".format(next_best_job.meta['title']))

        if (next_best_job is None):
            break

        logger.debug("Adding to the optimal job schedule...")
        optimal_jobs.add(next_best_job)

        logger.debug("Removing overlapping jobs")
        remove_overlapping_jobs(jobs, next_best_job.ends)

    logger.debug("Found {} job(s)!".format(len(optimal_jobs)))

    return optimal_jobs

print("Defined optimal scheduler")

Examples

Take for example the following case of spending the day at the movie theatre, and you want to watch as many movies as you can in a day. Here are the showtimes:

set_debugging(False)

logger.debug("Creating set of showtimes...")

showtimes = set([
    Job(9, 12, {'title': 'Mission Impossible'}),
    Job(10, 14, {'title': 'Fellowship of the Ring'}),
    Job(14, 15, {'title': 'Cars'}),
    Job(13, 16, {'title': 'Dragon Ball Z'}),
    Job(15, 20, {'title': 'Lion King'}),
])

logger.debug("Created a set...")

print("Generating the best showtime schedule...")

optimal_showtime_schedule = optimal_scheduling(showtimes)

print("I was able to schedule {} show(s)!".format(len(optimal_showtime_schedule)))

shows_in_order = sorted(optimal_showtime_schedule, key=lambda s: s.starts, reverse=False)

for show in shows_in_order:
    print("{}, starting at {} and ending at {}".format(show.meta['title'], show.starts, show.ends))

print("All done")