Huffman Encoding

If there were ever a data compression method to take the world by storm, it would be Huffman encoding. In fact, this was the method that got me into computational methods to begin with. I distinctly remember sitting in my data compression class and talking about the great information theorist Claude Shannon and Robert Fano, when suddenly my professor introduced a new kid to the mix: David Huffman. He managed to rip the heart out of the methods described by leaders of the field and create a data compression method that was easier to understand and implement, while also providing more robust results, and apparently this was all done for a school project!

It was in that moment, I knew I would never amount to anything. I have since accepted that fact and moved on.

Huffman encoding follows from the problem described in the Data Compression section. We have a string that we want to encode into bits. Huffman encoding ensures that our encoded bitstring is as small as possible without losing any information. Because it is both lossless and guarantees the smallest possible bit length, it outright replaces both Shannon and Shannon-Fano encoding in most cases, which is a little weird because the method was devised while Huffman was taking a course from Fano, himself!

The idea is somewhat straightforward in principle, but a little difficult to code in practice. By creating a binary tree of the input alphabet, every branch can be provided a unique bit representation simply by assigning a binary value to each child and reading to a character in a leaf node if starting from the root node.

So now the question is: how do we create a binary tree? Well, here we build it from the bottom up like so:

Order all characters according to the frequency they appear in the input bitstring, with the most frequent character at the top of the list. Be sure to keep track of the frequencies, too!
Add the smallest two values together to create a new node with a new frequency.
Keep doing step 2 until the tree is complete.
Read the tree backwards from the root node and concatenate the final bitstring codeword. Keep all codewords and put them into your final set of codewords (sometimes called a codebook)
Encode your phrase with the codebook.

And that's it. Here's an image of what this might look like for the phrase bibbity_bobbity:

This will create a codebook that looks like this:

Character	Bit Representation
b	0
i	100
t	101
y	110
o	1110
_	1111

and bibbity_bobbity becomes 01000010010111011110111000100101110. As mentioned this uses the minimum number of bits possible for encoding. The fact that this algorithm is both conceptually simple and provably useful is rather extraordinary to me and is why Huffman encoding will always hold a special place in my heart.

Video Explanation

Here is a quick video explanation for Huffman encoding:

Example Code

In code, this can be a little tricky. It requires a method to continually sort the nodes as you add more and more nodes to the system. The most straightforward way to do this in some languages is with a priority queue, but depending on the language, this might be more or less appropriate. In addition, to read the tree backwards, some sort of Depth First Search needs to be implemented. Whether you use a stack or straight-up recursion also depends on the language, but the recursive method is a little easier to understand in most cases.

using Test

# This is for the PriorityQueue
using DataStructures

struct Leaf
    weight::Int64
    key::Char
end

struct Branch
    right::Union{Leaf, Branch}
    left::Union{Leaf, Branch}
    weight::Int64
end

const Node = Union{Leaf, Branch}

function codebook_recurse!(leaf::Leaf, code::String,
                          dict::Dict{Char,String})
    dict[leaf.key] = code
end

function codebook_recurse!(branch::Branch, code::String,
                          dict::Dict{Char,String})
    codebook_recurse!(branch.left, string(code, "1"), dict)
    codebook_recurse!(branch.right, string(code, "0"), dict)
end

# This will depth-first search through the tree
# to create bitstrings for each character.
# Note: Any depth-first search method will work
# This outputs encoding Dict to be used for encoding
function create_codebook(n::Node)
    codebook = Dict{Char,String}()
    if isa(n, Leaf)
        codebook[n.key]="0"
    else
        codebook_recurse!(n, "", codebook)
    end
    return codebook
end

# This outputs huffman tree to generate dictionary for encoding
function create_tree(phrase::String)

    # creating weights
    weights = PriorityQueue()
    for i in phrase
        temp_string = string(i)
        if (haskey(weights, temp_string))
            weights[temp_string] += 1
        else
            weights[temp_string] = 1
        end
    end

    # Creating all nodes to iterate through
    nodes = PriorityQueue{Node, Int64}()
    while(length(weights) > 0)
        weight = peek(weights)[2]
        key = dequeue!(weights)[1]
        temp_node = Leaf(weight, key)
        enqueue!(nodes, temp_node, weight)
    end

    while(length(nodes) > 1)
        node1 = dequeue!(nodes)
        node2 = dequeue!(nodes)
        temp_node = Branch(node1, node2, node1.weight + node2.weight)
        enqueue!(nodes, temp_node, temp_node.weight)
    end

    huffman_tree = dequeue!(nodes)
    return huffman_tree

end

function encode(codebook::Dict{Char, String}, phrase::String)
    final_bitstring = ""
    for i in phrase
        final_bitstring = final_bitstring * codebook[i]
    end

    return final_bitstring
end

function decode(huffman_tree::Node, bitstring::String)
    current = huffman_tree
    final_string = ""
    for i in bitstring
        if isa(huffman_tree, Branch)
            if (i == '1')
                current = current.left
            else
                current = current.right
            end

            if (!isa(current, Branch))
                final_string *= string(current.key)
                current = huffman_tree
            end
        else
            final_string *= string(huffman_tree.key)
        end
    end

    return final_string
end

function two_pass_huffman(phrase::String)
    huffman_tree = create_tree(phrase)
    codebook = create_codebook(huffman_tree)
    bitstring = encode(codebook, phrase)
    final_string = decode(huffman_tree, bitstring)
    return final_string
end

@testset "b-string tests" begin
    @test two_pass_huffman("b") == "b"
    @test two_pass_huffman("bbbbbbbb") == "bbbbbbbb"
    @test two_pass_huffman("bibbity bobbity") == "bibbity bobbity"
end

extern crate itertools;

use std::cmp::{Ord, Ordering, PartialOrd};
use std::collections::{BinaryHeap, HashMap};

use itertools::Itertools;

#[derive(Debug)]
enum HuffmanTree {
    Branch {
        count: i32,
        left: Box<HuffmanTree>,
        right: Box<HuffmanTree>,
    },
    Leaf {
        count: i32,
        value: char,
    },
}

impl PartialEq for HuffmanTree {
    fn eq(&self, other: &Self) -> bool {
        self.count() == other.count()
    }
}

impl Eq for HuffmanTree {}

impl PartialOrd for HuffmanTree {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        other.count().partial_cmp(&self.count())
    }
}

impl Ord for HuffmanTree {
    fn cmp(&self, other: &Self) -> Ordering {
        other.count().cmp(&self.count())
    }
}

#[derive(Debug)]
struct Codebook {
    codebook: HashMap<char, String>,
    tree: HuffmanTree,
}

impl HuffmanTree {
    pub fn from(input: &str) -> Self {
        let counts = input.chars().fold(HashMap::new(), |mut map, c| {
            *map.entry(c).or_insert(0) += 1;
            map
        });
        let mut queue = counts
            .iter()
            .map(|(&value, &count)| HuffmanTree::Leaf { value, count })
            .collect::<BinaryHeap<HuffmanTree>>();

        while queue.len() > 1 {
            let left = queue.pop().unwrap();
            let right = queue.pop().unwrap();
            queue.push(HuffmanTree::Branch {
                count: left.count() + right.count(),
                left: Box::new(left),
                right: Box::new(right),
            })
        }

        queue.pop().expect("The Huffman tree has to have a root")
    }

    pub fn count(&self) -> i32 {
        match *self {
            HuffmanTree::Branch { count, .. } => count,
            HuffmanTree::Leaf { count, .. } => count,
        }
    }

    pub fn make_codebook(self) -> Codebook {
        let mut codebook = HashMap::new();
        self.dfs(String::from(""), &mut codebook);
        Codebook {
            codebook,
            tree: self,
        }
    }

    pub fn decode(&self, input: &str) -> String {
        let mut result = String::from("");
        let mut start = 0;
        while !input[start..].is_empty() {
            start += self.decode_dfs(&input[start..], &mut result);
        }
        result
    }

    fn decode_dfs(&self, input: &str, result: &mut String) -> usize {
        let current = input.chars().next();
        match *self {
            HuffmanTree::Branch { ref left, .. } if current == Some('0') => {
                1 + left.decode_dfs(&input[1..], result)
            }
            HuffmanTree::Branch { ref right, .. } if current == Some('1') => {
                1 + right.decode_dfs(&input[1..], result)
            }
            HuffmanTree::Leaf { value, .. } => {
                result.push(value);
                0
            }
            _ => panic!("Unexpected end of input"),
        }
    }

    fn dfs(&self, code: String, codebook: &mut HashMap<char, String>) {
        match *self {
            HuffmanTree::Branch {
                ref left,
                ref right,
                ..
            } => {
                left.dfs(code.clone() + "0", codebook);
                right.dfs(code.clone() + "1", codebook);
            }
            HuffmanTree::Leaf { value, .. } => {
                codebook.insert(value, code);
            }
        }
    }
}

impl Codebook {
    fn encode(&self, input: &str) -> String {
        input.chars().map(|c| &self.codebook[&c]).join("")
    }

    fn decode(&self, input: &str) -> String {
        self.tree.decode(input)
    }
}

fn main() {
    let input = "bibbity bobbity";

    let tree = HuffmanTree::from(input);
    let codebook = tree.make_codebook();
    let encoded = codebook.encode(input);
    let decoded = codebook.decode(&encoded);

    // Uncomment this line if you want to see the codebook/tree
    // println!("{:#?}", codebook);
    println!("{}", encoded);
    println!("{}", decoded);
}

// Made by Guston and edited by Gathros
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

struct tree {
    struct tree* left;
    struct tree* right;

    int count;
    char value;
};

struct bitstring_builder {
    char str[257];
    int next_index;
};

struct codebook {
    char* codes[256];
};

struct heap {
    struct tree** data;
    size_t length;
    size_t capacity;
};

bool is_leaf(const struct tree* t) {
    return !t->left && !t->right;
}

void swap(struct tree** lhs, struct tree** rhs) {
    struct tree* tmp = *lhs;
    *lhs = *rhs;
    *rhs = tmp;
}

/* The two concat functions are horribly inefficient */
void concat(char** dst, const char* src) {
    size_t dst_len = strlen(*dst);
    size_t src_len = strlen(src);
    *dst = realloc(*dst, src_len + dst_len + 1);
    strcat(*dst, src);
}

void concat_char(char** dst, char c) {
    size_t len = strlen(*dst);
    *dst = realloc(*dst, len + 2);
    (*dst)[len] = c;
    (*dst)[len + 1] = '\0';
}

char* duplicate(const char* src) {
    size_t length = strlen(src);
    char* dst = malloc(length + 1);
    memcpy(dst, src, length + 1);
    return dst;
}

void heap_push(struct heap* heap, struct tree* value) {
    if (heap->capacity == heap->length) {
        heap->capacity = heap->capacity == 0 ? 4 : heap->capacity * 2;
        heap->data = realloc(heap->data, heap->capacity * sizeof(struct tree*));
    }
    heap->data[heap->length++] = value;

    size_t index = heap->length - 1;
    while (index) {
        size_t parent_index = (index - 1) / 2;
        if (heap->data[parent_index]->count <= heap->data[index]->count) {
            break;
        }

        swap(&heap->data[parent_index], &heap->data[index]);
        index = parent_index;
    }
}

struct tree* heap_pop(struct heap* heap) {
    if (!heap->length) {
        return NULL;
    }

    struct tree* result = heap->data[0];
    swap(&heap->data[0], &heap->data[--heap->length]);

    size_t index = 0;
    for (;;) {
        size_t target = index;
        size_t left = 2 * index + 1;
        size_t right = left + 1;

        if (left < heap->length &&
                heap->data[left]->count < heap->data[target]->count) {
            target = left;
        }

        if (right < heap->length &&
                heap->data[right]->count < heap->data[target]->count) {
            target = right;
        }

        if (target == index) {
            break;
        }

        swap(&heap->data[index], &heap->data[target]);
        index = target;
    }

    return result;
}

void heap_free(struct heap* heap) {
    free(heap->data);
}

struct tree* generate_tree(const char* str) {
    int counts[256] = { 0 };

    for (; *str != '\0'; ++str) {
        counts[(unsigned char)*str] += 1;
    }

    struct heap heap = { 0 };
    for (size_t i = 0; i < sizeof(counts) / sizeof(int); ++i) {
        if (counts[i]) {
            struct tree* tree = calloc(1, sizeof(struct tree));
            tree->value = (char)i;
            tree->count = counts[i];
            heap_push(&heap, tree);
        }
    }

    if (heap.length == 1) {
        struct tree* leaf = heap_pop(&heap);
        struct tree* root = calloc(0, sizeof(struct tree));
        root->left = leaf;
        root->count = leaf->count;
        heap_free(&heap);
        return root;
    }

    while (heap.length > 1) {
        struct tree* left = heap_pop(&heap);
        struct tree* right = heap_pop(&heap);
        struct tree* branch = calloc(1, sizeof(struct tree));
        branch->count = left->count + right->count;
        branch->left = left;
        branch->right = right;
        heap_push(&heap, branch);
    }

    struct tree* root = heap_pop(&heap);
    heap_free(&heap);
    return root;
}

void tree_free(struct tree* tree) {
    if (!tree) return;
    tree_free(tree->left);
    tree_free(tree->right);
    free(tree);
}

void codebook_recurse(const struct tree* tree,
                        struct bitstring_builder* builder,
                        struct codebook* codebook) {
    if (!tree) {
        return;
    }

    if (is_leaf(tree)) {
        builder->str[builder->next_index] = '\0';
        codebook->codes[(unsigned char)tree->value] = duplicate(builder->str);
        return;
    }

    builder->str[builder->next_index++] = '0';
    codebook_recurse(tree->left, builder, codebook);
    builder->next_index -= 1;

    builder->str[builder->next_index++] = '1';
    codebook_recurse(tree->right, builder, codebook);
    builder->next_index -= 1;
}

struct codebook generate_codebook(const struct tree* tree) {
    struct codebook codebook = { .codes = { 0 } };
    struct bitstring_builder builder = { .str = { 0 }, .next_index = 0 };
    codebook_recurse(tree, &builder, &codebook);
    return codebook;
}

void codebook_free(struct codebook* codebook) {
    int size = sizeof(codebook->codes) / sizeof(codebook->codes[0]);
    for (int i = 0; i < size; ++i) {
        free(codebook->codes[i]);
    }
}

const char* get_code(const struct codebook* codebook, char c) {
    return codebook->codes[(unsigned char)c];
}

char* encode(const char* input, struct tree** huffman_tree,
                struct codebook* codebook) {
    *huffman_tree = generate_tree(input);
    *codebook = generate_codebook(*huffman_tree);

    char* result = duplicate(get_code(codebook, *input));

    input += 1;

    for (; *input; ++input) {
        concat(&result, get_code(codebook, *input));
    }

    return result;
}

const char* decode_recurse(const char* input, const struct tree* tree,
                            char** result) {
    if (!tree) {
        return input;
    }

    if (is_leaf(tree)) {
        concat_char(result, tree->value);
        return input;
    }

    if (*input == '0') {
        return decode_recurse(input + 1, tree->left, result);
    } else {
        return decode_recurse(input + 1, tree->right, result);
    }
}

char* decode(const char* input, const struct tree* tree) {
    char* result = calloc(1, 1);
    do {
        input = decode_recurse(input, tree, &result);
    } while (*input);
    return result;
}

int main() {
    struct tree* tree;
    struct codebook codebook;

    char* encoded = encode("bibbity bobbity", &tree, &codebook);
    char* decoded = decode(encoded, tree);

    printf("Codebook:\n");
    for (int i = 0; i < 256; ++i) {
        if (codebook.codes[i]) {
            printf("%c %s\n", (char)i, codebook.codes[i]);
        }
    }

    printf("%s\n", encoded);
    printf("%s\n", decoded);

    tree_free(tree);
    codebook_free(&codebook);
    free(encoded);
    free(decoded);
    return 0;
}

import qualified Data.Map as M
import           Data.List (insert, sort)

data Tree a = Leaf Int a
            | Node Int (Tree a) (Tree a)
                   deriving (Show, Eq)

freq :: Tree a -> Int
freq (Leaf i _)   = i
freq (Node i _ _) = i

instance (Eq a) => Ord (Tree a) where
  compare t1 t2 = compare (freq t1) (freq t2)

getFrequencies :: Ord a => [a] -> [(Int, a)]
getFrequencies = toSortedList . M.fromListWith (+) . flip zip (repeat 1)
  where toSortedList = sort . map swap . M.toList
        swap (a, i) = (i, a)

buildTree :: (Ord a) => [a] -> Maybe (Tree a)
buildTree = build . map (uncurry Leaf) . getFrequencies
  where build []         = Nothing
        build [t]        = Just t
        build (t1:t2:ts) = build $ insert (Node (freq t1 + freq t2) t1 t2) ts

data Bit = Zero | One

instance Show Bit where
  show Zero = "0"
  show One = "1"

encode :: (Ord a) => [a] -> (Maybe (Tree a), [Bit])
encode s = (tree, msg)
  where
  tree = buildTree s
  msg = concatMap (table M.!) s
  table = case tree of
    Nothing -> M.empty
    Just t  -> M.fromList $ mkTable (t, [])
  mkTable (Leaf _ a, p)     = [(a, reverse p)]
  mkTable (Node _ t1 t2, p) = concatMap mkTable [(t1,  Zero:p), (t2, One:p)]

decode :: (Ord a) => Maybe (Tree a) -> [Bit] -> [a]
decode Nothing _ = []
decode (Just t) m = path t m
  where path (Leaf _ a) m            = a : path t m
        path (Node _ t1 _) (Zero: m) = path t1 m
        path (Node _ _ t2) (One: m)  = path t2 m
        path _ _                     = []

main = do
  let msg = "bibbity bobbity"
      (tree, encoded) = encode msg
      decoded = decode tree encoded
  putStrLn $ "Endoding \"" ++ msg ++ "\": " ++ concatMap show encoded
  putStrLn $ "Length: " ++ (show $ length encoded)
  putStrLn $ "Decoding: " ++ decoded

HuffmanCoding.cs

// submitted by Julian Schacher (jspp), thanks to gustorn for the help
using System;
using System.Collections.Generic;
using System.Linq;

namespace HuffmanCoding
{
    public class EncodingResult
    {
        public string BitString { get; set; }
        public Dictionary<char, string> Dictionary { get; set; }
        public HuffmanCoding.Node Tree { get; set; }

        public EncodingResult(string bitString, Dictionary<char, string> dictionary, HuffmanCoding.Node tree)
        {
            this.BitString = bitString;
            this.Dictionary = dictionary;
            this.Tree = tree;
        }
    }

    public class HuffmanCoding
    {
        // The Node class used for the Huffman Tree.
        public class Node : IComparable<Node>
        {
            public Node LeftChild { get; set; }
            public Node RightChild { get; set; }
            public string BitString { get; set; } = "";
            public int Weight { get; set; }
            public string Key { get; set; }

            public bool IsLeaf => LeftChild == null && RightChild == null;

            // Creates a leaf. So just a node is created with the given values.
            public static Node CreateLeaf(char key, int weight) => new Node(key.ToString(), weight, null, null);
            // Creates a branch. Here a node is created by adding the keys and weights of both childs together.
            public static Node CreateBranch(Node leftChild, Node rightChild) => new Node(leftChild.Key + rightChild.Key, leftChild.Weight + rightChild.Weight, leftChild, rightChild);
            private Node(string key, int weight, Node leftChild, Node rightChild)
            {
                this.Key = key;
                this.Weight =  weight;
                this.LeftChild = leftChild;
                this.RightChild = rightChild;
            }

            public int CompareTo(Node other) => this.Weight - other.Weight;
        }

        // Node with biggest value at the top.
        class NodePriorityList
        {
            public int Count => nodes.Count;

            private List<Node> nodes = new List<Node>();

            public NodePriorityList() { }
            public NodePriorityList(List<Node> givenNodes)
            {
                this.nodes = givenNodes.ToList();
                this.nodes.Sort();
            }

            public void Add(Node newNode)
            {
                var index = this.nodes.BinarySearch(newNode);
                if (index < 0)
                    this.nodes.Insert(~index, newNode);
                else
                    this.nodes.Insert(index, newNode);
            }

            public Node Pop()
            {
                var result = this.nodes[0];
                this.nodes.RemoveAt(0);
                return result;
            }
        }

        public EncodingResult Encode(string input)
        {
            var root = CreateTree(input);
            var dictionary = CreateDictionary(root);
            var bitString = CreateBitString(input, dictionary);

            return new EncodingResult(bitString, dictionary, root);
        }

        public string Decode(EncodingResult result)
        {
            var output = "";
            Node currentNode = result.Tree;
            foreach (var bit in result.BitString)
            {
                // Go down the tree.
                if (bit == '0')
                    currentNode = currentNode.LeftChild;
                else
                    currentNode = currentNode.RightChild;

                if (currentNode.IsLeaf)
                {
                    output += currentNode.Key;
                    currentNode = result.Tree;
                }
            }
            return output;
        }

        private Node CreateTree(string input)
        {
            // Create a List of all characters and their count in input by putting them into nodes.
            var nodes = input
                .GroupBy(c => c)
                .Select(n => Node.CreateLeaf(n.Key, n.Count()))
                .ToList();

            // Convert list of nodes to a NodePriorityList.
            var nodePriorityList = new NodePriorityList(nodes);

            // Create Tree.
            while (nodePriorityList.Count > 1)
            {
                // Pop the two nodes with the smallest weights from the nodePriorityList and create a parentNode with the CreateBranch method. (This method adds the keys and weights of the childs together.)
                var leftChild = nodePriorityList.Pop();
                var rightChild = nodePriorityList.Pop();
                var parentNode = Node.CreateBranch(leftChild, rightChild);

                nodePriorityList.Add(parentNode);
            }

            return nodePriorityList.Pop();
        }


        private void CreateDictionary(Node node, string bitString, Dictionary<char, string> localDictionary)
        {
            if (node.IsLeaf)
                localDictionary.Add(node.Key[0], bitString);
            else
            {
                if (node.LeftChild != null)
                    CreateDictionary(node.LeftChild, bitString + '0', localDictionary);
                if (node.RightChild != null)
                    CreateDictionary(node.RightChild, bitString + '1', localDictionary);
            }
        }

        private Dictionary<char, string> CreateDictionary(Node root)
        {
            // We're using a string instead of a actual bits here, since it makes the code somewhat more readable and this is an educational example.
            var dictionary = new Dictionary<char, string>();
            CreateDictionary(root, "", dictionary);
            return dictionary;
        }


        private string CreateBitString(string input, Dictionary<char, string> dictionary)
        {
            // We're using a string right here. While no compression is achieved with a string, it's the easiest way to display what the compressed result looks like. Also this is just an educational example.
            var bitString = "";
            foreach (var character in input)
                bitString += dictionary[character];

            return bitString;
        }
    }
}

Program.cs

// submitted by Julian Schacher (jspp), thanks to gustorn for the help
using System.Collections;
using System.Collections.Generic;

namespace HuffmanCoding
{
    class Program
    {
        static void Main(string[] args)
        {
            var huffmanCoding = new HuffmanCoding();

            var result = huffmanCoding.Encode("bibbity bobbity");
            // The bitStrings are just strings and provide no compression. Look in HuffmanCoding.cs for explanation.
            // Print dictionary.
            foreach (var entry in result.Dictionary)
                System.Console.WriteLine($"{entry.Key} {entry.Value}");
            // Print BitString.
            System.Console.WriteLine($"{result.BitString} count: {result.BitString.Length}");

            var originalString = huffmanCoding.Decode(result);
            System.Console.WriteLine(originalString);
        }
    }
}

local function frequency_array(str)
  -- Collect all frequency values into a dict
  local map = {}
  for c in str:gmatch(".") do -- Iterate over each character in str
    map[c] = (map[c] or 0) + 1 -- Increment map[c] (default 0) by 1
  end

  -- We have a dict of frequencies but we want it in a sorted list
  -- Dump each key value pair into an array
  local arr = {}
  for k, v in pairs(map) do
    arr[#arr + 1] = {k, v}
  end
  table.sort(arr, function(a, b) return a[2] > b[2] end) -- Sort by frequency descending
  return arr
end

local function build_huffman_tree(message)

  if #message == 0 then return end

  local freq = frequency_array(message)

  while #freq > 1 do -- Repeat until we have only 1 node

    -- Take two of the least frequent nodes
    local node1, node2 = table.remove(freq), table.remove(freq)

        -- Group node values in first index, and sum of node frequencies in second
    local node3 = { {node1[1], node2[1] }, node1[2] + node2[2] }

    local i = 1
    while i <= #freq and freq[i][2] <= node3[2] do -- Sorted insertion, faster than inserting then sorting again
      i = i + 1
    end

    table.insert(freq, i, node3)
  end

  return freq[1][1] -- Return value of only element in freq array
end

local function _create_codebook(node, codebook, code)
  if not node then
    return
  elseif type(node) == "string" then
    codebook[node] = code -- if node is a leaf then add it to codebook
  else
    _create_codebook(node[1], codebook, code .. "0") -- Left side
    _create_codebook(node[2], codebook, code .. "1") -- Right side
  end
end

local function create_codebook(tree)
  local codebook = {}
  _create_codebook(tree, codebook, "")
  return codebook
end

local function huffman_encode(codebook, message)
  local encoded_chars = {}
  for c in message:gmatch(".") do -- Iterate over each character in message
    encoded_chars[#encoded_chars + 1] = codebook[c]
  end
  return table.concat(encoded_chars) -- table.concat to avoid slow string bufferin
end

local function _huffman_decode(node, bitstring, i)
  if type(node) == "string" then
    return node, i -- If it's a leaf node then return the value along with the next bit to read
  end
  if bitstring:sub(i, i) == "0" then
    return _huffman_decode(node[1], bitstring, i + 1) -- If it's 0 traverse down the left side
  elseif bitstring:sub(i, i) == "1" then
    return _huffman_decode(node[2], bitstring, i + 1) -- If it's 1 traverse down the right side
  end
end

local function huffman_decode(tree, bitstring)
  -- i is the current position in the bitstring, we can track which bit we are to look at next without using string.sub
  local decoded_chars, i = {}, 1
  while i <= #bitstring do
    decoded_chars[#decoded_chars + 1], i = _huffman_decode(tree, bitstring, i)
  end

  return table.concat(decoded_chars)
end

local message = "bibbity_bobbity"

local tree = build_huffman_tree(message)
local codebook = create_codebook(tree)

local bitstring = huffman_encode(codebook, message)
print("Encoded: " .. bitstring)

print("Decoded: " .. huffman_decode(tree, bitstring))

#include <algorithm>
#include <array>
#include <bitset>
#include <cassert>
#include <cctype>
#include <cstddef>
#include <limits>
#include <memory>
#include <string>
#include <utility>
#include <vector>

#include <iostream>

using std::begin;
using std::end;

namespace huffman {

[[noreturn]] inline void unreachable() {
  std::cerr << "this should never happen\n";
  std::abort();
}

// --- interface ---
class codebook {
  struct node {
    int frequency;

    node(int freq) : frequency(freq) {}
    virtual ~node() = 0;
  };

  // never null
  using node_ptr = std::unique_ptr<node const>;
  using bitstring = std::vector<bool>;

  // this is a flatmap between char and a bitstring
  // there should only ever be one character with a given
  // value at any time.
  using encoder_t = std::vector<std::pair<char, bitstring>>;

  struct leaf final : node {
    char key;

    leaf(int freq, char key) : node(freq), key(key) {}
  };

  struct branch final : node {
    node_ptr lhs;
    node_ptr rhs;

    branch(node_ptr lhs, node_ptr rhs)
        : node(lhs->frequency + rhs->frequency), lhs(std::move(lhs)),
          rhs(std::move(rhs)) {}
  };

  // this allows us to share [codebook]s among encoded strings
  struct data {
    node_ptr decoder;
    encoder_t encoder;
  };
  std::shared_ptr<data const> underlying_;

public:
  template <typename Iter>
  codebook(Iter const first, Iter const last);

  template <typename Iter>
  std::vector<bool> encode(Iter first, Iter last) const;

  template <typename Iter>
  std::string decode(Iter first, Iter last) const;
};

struct encoded_string {
  codebook codes;
  std::vector<bool> string;

  explicit encoded_string(std::string const& s)
      : codes(begin(s), end(s)), string(codes.encode(begin(s), end(s))) {}

  encoded_string(codebook codes, std::string const& s)
      : codes(codes), string(codes.encode(begin(s), end(s))) {}

  std::string decoded() const {
    return codes.decode(begin(string), end(string));
  }
};

// --- implementation ---
inline codebook::node::~node() {}

inline std::vector<bool> with_new_bit(std::vector<bool> bits, bool b) {
  bits.push_back(b);
  return bits;
}

template <typename Iter>
codebook::codebook(Iter const first, Iter const last) {
  struct helper {
    static node_ptr make_decoder(Iter const first, Iter const last) {
      // in this part of the function, we build up a frequency list
      // sorted by frequency, descending
      auto freq = std::vector<leaf>();

      std::for_each(first, last, [&freq](char c) {
        auto const it = std::find_if(
            begin(freq), end(freq), [c](auto const& p) { return p.key == c; });
        if (it != end(freq)) {
          // it's already in the list
          it->frequency += 1;
        } else {
          // it's not already in the list
          freq.emplace_back(1, c);
        }
      });

      if (freq.empty()) {
        throw std::invalid_argument("attempted to codebook an empty range");
      }

      std::sort(begin(freq), end(freq), [](auto const& lhs, auto const& rhs) {
        return lhs.frequency > rhs.frequency;
      });

      auto ret = std::vector<std::unique_ptr<node>>();
      std::transform(
          begin(freq), end(freq), std::back_inserter(ret), [](auto const l) {
            return std::make_unique<leaf>(l);
          });

      while (ret.size() > 1) {
        auto rhs = std::move(ret.back());
        ret.pop_back();
        auto lhs = std::move(ret.back());
        ret.pop_back();

        auto new_node =
            std::make_unique<branch>(std::move(lhs), std::move(rhs));
        auto const new_freq = new_node->frequency;

        // look for the first element with a smaller frequency
        auto const it =
            std::find_if(begin(ret), end(ret), [new_freq](auto const& n) {
              return n->frequency < new_freq;
            });
        // and insert the new_node before that element
        ret.insert(it, std::move(new_node));
        // in this way, we keep the list sorted by frequency
      }

      return std::move(ret.back());
    }

    static void
    encoder_rec(node const* cur, std::vector<bool> bits, encoder_t& out) {
      if (auto l = dynamic_cast<leaf const*>(cur)) {
        out.push_back(std::make_pair(l->key, std::move(bits)));
      } else if (auto b = dynamic_cast<branch const*>(cur)) {
        encoder_rec(b->lhs.get(), with_new_bit(bits, 0), out);
        encoder_rec(b->rhs.get(), with_new_bit(std::move(bits), 1), out);
      } else {
        unreachable();
      }
    }

    static encoder_t make_encoder(node const& decoder) {
      auto ret = encoder_t();

      encoder_rec(&decoder, std::vector<bool>(), ret);

      return ret;
    }
  };

  auto decoder = helper::make_decoder(first, last);
  auto encoder = helper::make_encoder(*decoder);
  underlying_ = std::make_shared<data const>(
      data{std::move(decoder), std::move(encoder)});
}

template <typename Iter>
std::vector<bool> codebook::encode(Iter const first, Iter const last) const {
  std::vector<bool> ret;

  auto& encoder = underlying_->encoder;
  std::for_each(first, last, [&ret, &encoder](char c) {
    auto const it =
        std::find_if(begin(encoder), end(encoder), [c](auto const& p) {
          return p.first == c;
        });
    if (it != end(encoder)) {
      auto const& code = it->second;
      std::copy(begin(code), end(code), std::back_inserter(ret));
    } else {
      throw std::invalid_argument(
          "The range has a character which was not in the huffman set");
    }
  });

  return ret;
}

template <typename Iter>
std::string codebook::decode(Iter const first, Iter const last) const {
  std::string ret;

  node const* const top = underlying_->decoder.get();

  // returns a pair:
  // the second member is the decoded character
  // the first member is the place we've gotten to in the range
  // i.e., if [0] is an 'a', and we have
  // [it, last) = { 0, 1, 1, 0 }
  // we return (it', 'a') such that
  // [it', last) = { 1, 1, 0 }
  auto decode_single =
      [top](Iter it, Iter const last) -> std::pair<Iter, char> {
    node const* current_node = top;

    for (; it != last; ++it) {
      if (auto l = dynamic_cast<leaf const*>(current_node)) {
        return std::make_pair(it, l->key);
      } else if (auto b = dynamic_cast<branch const*>(current_node)) {
        if (*it) {
          current_node = b->rhs.get();
        } else {
          current_node = b->lhs.get();
        }
      } else {
        unreachable();
      }
    }

    if (auto l = dynamic_cast<leaf const*>(current_node)) {
      return std::make_pair(last, l->key);
    } else {
      throw std::invalid_argument(
          "The range was not encoded with this huffman set");
    }
  };

  for (auto it = first; it != last;) {
    auto p = decode_single(it, last);
    it = p.first;
    ret.push_back(p.second);
  }

  return ret;
}

} // namespace huffman

int main() {
  std::string to_be_encoded = R"(bibbity bobbity)";

  auto encoded = huffman::encoded_string(to_be_encoded);

  std::cout << "Encoded, the string looks like: ";
  for (bool b : encoded.string) {
    std::cout << b;
  }
  std::cout << "\nand decoded, the string looks like: " << encoded.decoded();
  std::cout << "\n\nAs opposed to the original, which is "
            << to_be_encoded.size() * 8 << " bits, the encoded has size "
            << encoded.string.size() << ".\n";
}

;; earthfail
(ns experiments.core)

;; get a vector with chars and frequencies

(defn tree-string [st]
  "take a string st and return the huffmantree with the frequency of
   each character included"
  ;; vector of [character frequency] pair
  ;; for every char in string, added it to hash-map
  ;; with value one if it doesn't exist or increment its value
  (def cf-vec (vec
               (reduce (fn [m c]
                         (assoc m c (inc (get m c 0))))
                       {}
                       st)))
  ;; make a sorted list with nodes with bigger frequencies first
  ;; take the last two which will help in dividing the tree
  ;; the first and last elements before and after
  ;; the smallest two in the tree shouldn't change
  (loop [tree (sort-by last > cf-vec)]
    (if (< (count tree) 2)
      (first tree) ; only of tree is one node or nil
      (let [sorted-tree (sort-by last > tree)
            mid (take-last 2 sorted-tree)
            set-mid (set mid)
            func (complement (partial contains? set-mid))
            firsty (take-while func tree)
            [middle lasty] (split-at 2
                                     (drop-while func tree))]
        (recur
         (concat
          firsty
          ;; make a list with the two element in one list and
          ;; the sum of their frequencies e.g
          ;; '(((node1 f1) (node2 f2)) f1+f2)
          (list (list middle (reduce #(+ %1 (last %2)) 0 middle)))
          lasty))))))

(defn remove-freq [tree]
  "remove the frequencies in the huffmantree tree"
  (cond
    (char? tree) tree                 ; check if this is a branch
    ;; if the tree is a node and frequency then ignore frequency
    (integer? (second tree)) (remove-freq (first tree)) ;remove the frequency
    ;; if the tree consists of two nodes then apply to both and combine
    :else (list (remove-freq (first tree))
                (remove-freq (second tree)))))

(defn hash-tree [tree]
  "make a hashmap with code for each letter as key and the letter as
  value"
  (cond
    (char? tree) {"" tree}
    :else
    (let [left-map (hash-tree (first tree))
          right-map (hash-tree (second tree))
          func #(apply hash-map         ; apply hash-map because
                                        ; interleave return a seq
                       (interleave
                        (map (partial str %2) (keys %1)) ;add 0 or 1
                                        ;to the start
                                        ;of the keys
                        (vals %1)))]
      ;; add "0" to the keys of left nodes and "1" to the right nodes
      (merge (func left-map "0") (func right-map "1")))))


(defn coder [s hash-coder]
  "take a string s and return a coded string"
  (apply str (map hash-coder s)))

(defn decoder [s hash-decoder]
  "takes a string s and a hash-map hash-decoder and decode s"
  ;; code keyword in hashmap is for storing codes untill they are
  ;; complete and can be decoded with the decoder
  (get (reduce (fn [m code]             ; reduce return {:message
                                        ; message,:code _}
                 (let [new-code (str (m :code) code)]
                   (if-let  [letter (get hash-decoder new-code)]
                     ;; if there is a letter then add it to :message
                     ;; and revert :code to empty
                     (assoc (update m :message #(str % letter))
                            :code "")
                     ;; if there is not a letter then just add the
                     ;; code letter to the :code
                     (update m :code #(str % code)))))
               {:message "",:code ""}
               s)
       :message))           ;extract :message  value
;; ----------------EXAMPLE----------------
(def st "(bibbity bobbity)")

(def hash-decoder (->>
                   st
                   tree-string
                   remove-freq
                   hash-tree))
(def hash-coder (clojure.set/map-invert hash-decoder))
(println "coding...")
(def code (coder st hash-coder))
(clojure.pprint/pprint code)

(println "\ndecoding...")
(clojure.pprint/pprint (decoder code hash-decoder))

# Huffman Encoding
# Python 2.7+
# Submitted by Matthew Giallourakis

from collections import Counter

# constructs the tree
def build_huffman_tree(message):

    # get sorted list of character and frequency pairs
    frequencies = Counter(message)
    trees = frequencies.most_common()

    # while there is more than one tree
    while len(trees) > 1:

        # pop off the two trees of least weight from the trees list
        tree_left,weight_left = trees.pop()
        tree_right,weight_right = trees.pop()

        # combine the nodes and add back to the nodes list
        new_tree = [tree_left, tree_right]
        new_weight = weight_left + weight_right

        # find the first tree that has a weight smaller than new_weight and returns its index in the list
        # If no such tree can be found, use len(trees) instead to append
        index = next((i for i, tree in enumerate(trees) if tree[1] < new_weight), len(trees))

        # insert the new tree there
        trees.insert(index, (new_tree, new_weight))

    # Return an empty list if the message was empty, else the tree
    huffman_tree = [] if not trees else trees[0][0]
    return huffman_tree

# constructs the mapping with recursion
def build_codebook(tree, code=''):
    # Check whether our tree contains more than 1 element or not
    if not tree:
        return []
    elif type(tree) != list:
        return [(tree, '0')]

    codebook = []

    # split the tree
    left_tree, right_tree = tree

    # if the left node has children, find the mapping of those children
    # else pair the character with the current code + 0
    if type(left_tree) is list:
        codebook += build_codebook(left_tree, code+'0')
    else:
        codebook.append((left_tree, code+'0'))

    # if the right node has children, find the mapping of those children
    # else pair the character with the current code + 1
    if type(right_tree) is list:
        codebook += build_codebook(right_tree, code+'1')
    else:
        codebook.append((right_tree, code+'1'))
    return codebook

# encodes the message
def huffman_encode(codebook, message):

    encoded_message = ''

    # build a char -> code dictionary
    forward_dict = dict(codebook)

    # replace each character with its code
    for char in message:
        encoded_message += forward_dict[char]

    return encoded_message

# decodes a message
def huffman_decode(codebook, encoded_message):

    decoded_message = ''
    key = ''

    # build a code -> char dictionary
    inverse_dict = dict([(v, k) for k, v in codebook])

    # for each bit in the encoding
    # if the bit is in the dictionary, replace the bit with the paired character
    # else look at the bit and the following bits together until a match occurs
    # move to the next bit not yet looked at
    for index, bit in enumerate(encoded_message):
        key += bit
        if key in inverse_dict:
            decoded_message += inverse_dict[key]
            key = ''

    return decoded_message

def main():

    # test example
    message = 'bibbity_bobbity'
    tree = build_huffman_tree(message)
    codebook = build_codebook(tree)
    encoded_message = huffman_encode(codebook, message)
    decoded_message = huffman_decode(codebook, encoded_message)

    print('message: ' + message)
    print('huffman tree: ' + str(tree))
    print('codebook: ' + str(codebook))
    print('encoded message: ' + encoded_message)
    print('decoded message: ' + decoded_message)

    # prints the following:
    #
    #  message: bibbity_bobbity
    #  huffman_tree: ['b', [[['_', 'o'], 'y'], ['t', 'i']]]
    #  codebook: [('b', '0'), ('_', '1000'), ('o', '1001'),
    #             ('y', '101'), ('t', '110'), ('i', '111')]
    #  encoded_message: 01110011111010110000100100111110101
    #  decoded_message: bibbity_bobbity

if __name__ == '__main__':
    main()

function encode(str) {
  const tree = createTree(str);
  const codebook = createCodebook(tree);
  return {
    string: [...str].map(c => codebook[c]).join(""),
    tree,
    codebook
  };

  function createTree(str) {
    const chars = [...str];
    const charCounts = chars.reduce((counts, char) => {
      counts[char] = (counts[char] || 0) + 1;
      return counts;
    }, {});

    const nodes = Object.entries(charCounts).map(([key, weight]) => ({ key, weight }));
    const priorityQueue = makeQueue(nodes);
    while (priorityQueue.data.length > 1) {
      const left = priorityQueue.dequeue();
      const right = priorityQueue.dequeue();
      priorityQueue.enqueue({ weight: left.weight + right.weight, left, right });
    }
    return priorityQueue.dequeue();
  }

  function createCodebook(tree) {
    return recurse(tree, "", {});

    function recurse(node, bitstring, dict) {
      if (!node.left && !node.right) {
        dict[node.key] = bitstring;
      } else {
        if (node.left) {
          recurse(node.left, bitstring + "0", dict);
        }

        if (node.right) {
          recurse(node.right, bitstring + "1", dict);
        }
      }
      return dict;
    }
  }
}

function decode(bitstring, tree) {
  const result = [];
  let node = tree;

  for (const bit of [...bitstring]) {
    node = bit === "0" ? node.left : node.right;
    if (!node.left && !node.right) {
      result.push(node.key);
      node = tree;
    }
  }

  return result.join("");
}

// This queue implementation is horribly inefficient, but a proper, heap-based implementation would
// be longer that the algorithm itself
function makeQueue(iterable) {
  return {
    data: [...iterable].sort((a, b) => a.weight - b.weight),
    enqueue(value) {
      const target = this.data.findIndex(x => x.weight > value.weight);
      if (target === -1) {
        this.data.push(value);
      } else {
        this.data = [...this.data.slice(0, target), value, ...this.data.slice(target)];
      }
    },
    dequeue() {
      return this.data.shift();
    }
  };
}

const encoded = encode("bibbity bobbity");
const decoded = decode(encoded.string, encoded.tree);
console.log(encoded.string);
console.log(decoded);

import java.util.*;

class Huffman {
    public static void main(String[] args) {
        HuffmanTree huffmanTree = new HuffmanTree("bibbity_bobbity");
        huffmanTree.createTree();
        String encoded = huffmanTree.encode();
        System.out.println("Encoded String: " + encoded);
        System.out.println("Decoded String: " + huffmanTree.decode(encoded));
    }
}

class TreeNode {
    String letter = "";
    int frequency = 0;
    TreeNode left = null, right = null;

    public TreeNode(String letter, int frequency) {
        this.letter = letter;
        this.frequency = frequency;
    }

    public TreeNode(int frequency, TreeNode left, TreeNode right) {
        this.frequency = frequency;
        this.left = left;
        this.right = right;
    }
}

class HuffmanTree {
    private Map<String, Integer> frequencyMap = new HashMap<>();
    private Map<String, String> codeBook = new HashMap<>(), reverseCodeBook = new HashMap<>();
    private TreeNode root;
    private String stringToEncode;

    public HuffmanTree(String stringToEncode) {
        this.stringToEncode = stringToEncode;
    }

    public void createTree() {
        for (int i = 0; i < stringToEncode.length(); i++) {
            String key = Character.toString(stringToEncode.charAt(i));
            if (!frequencyMap.containsKey(key)) {
                frequencyMap.put(key, 1);
            } else {
                int frequency = frequencyMap.get(key) + 1;
                frequencyMap.replace(key, frequency);
            }
        }
        Queue<TreeNode> priorityQueue = new PriorityQueue<>(Comparator.comparingInt(o -> o.frequency));
        for (Map.Entry<String, Integer> m : frequencyMap.entrySet()) {
            priorityQueue.add(new TreeNode(m.getKey(), m.getValue()));
        }
        while (priorityQueue.size() > 1) {
            TreeNode left = priorityQueue.remove();
            TreeNode right = priorityQueue.remove();
            priorityQueue.add(new TreeNode(left.frequency + right.frequency, left, right));
        }
        root = priorityQueue.remove();
    }

    private void traverse(TreeNode node, StringBuilder code) {
        if (node.left == null && node.right == null) {
            codeBook.put(node.letter, code.toString());
        }
        if (node.left != null) {
            traverse(node.left, code.append(0));
            code.deleteCharAt(code.length() - 1);
        }
        if (node.right != null) {
            traverse(node.right, code.append(1));
            code.deleteCharAt(code.length() - 1);
        }
    }

    public void printCodeBook() {
        System.out.println("Code Book");
        for (Map.Entry<String, String> m : codeBook.entrySet()) {
            System.out.println(m.getKey() + "\t" + m.getValue());
        }
        System.out.println();
    }

    private void CodeBookReverse() {
        for (Map.Entry<String, String> m : codeBook.entrySet()) {
            reverseCodeBook.put(m.getValue(), m.getKey());
        }
    }

    public String encode() {
        traverse(root, new StringBuilder());
        StringBuilder encode = new StringBuilder();
        for (int i = 0; i < stringToEncode.length(); i++) {
            String k = Character.toString(stringToEncode.charAt(i));
            encode.append(codeBook.get(k));
        }
        printCodeBook();
        return encode.toString();
    }

    public String decode(String encoded) {
        StringBuilder decoded = new StringBuilder(), key = new StringBuilder();
        CodeBookReverse();
        for (int i = 0; i < encoded.length(); i++) {
            key = key.append(encoded.charAt(i));
            if (reverseCodeBook.containsKey(key.toString())) {
                decoded.append(reverseCodeBook.get(key.toString()));
                key = new StringBuilder();
            }
        }
        return decoded.toString();
    }
}

package main

import (
    "container/heap"
    "fmt"
)

type node struct {
    freq  int
    char  rune
    left  *node
    right *node
}

type codebook map[rune]string
type nodeHeap []*node

func (n nodeHeap) Len() int           { return len(n) }
func (n nodeHeap) Less(i, j int) bool { return n[i].freq > n[j].freq }
func (n nodeHeap) Swap(i, j int)      { n[i], n[j] = n[j], n[i] }

func (n *nodeHeap) Push(x interface{}) {
    if node, ok := x.(*node); ok {
        *n = append(*n, node)
    } else {
        fmt.Printf("I got a node of Type %T\n", x)
    }
}

func (n *nodeHeap) Pop() interface{} {
    old := *n
    l := len(old)
    x := old[l-1]
    *n = old[0 : l-1]
    return x
}

func buildTree(message string) *node {
    freqMap := make(map[rune]*node)
    h := new(nodeHeap)
    heap.Init(h) // really needed?

    for _, char := range message {
        if _, ok := freqMap[char]; ok {
            freqMap[char].freq++
        } else {
            newNode := new(node)
            newNode.freq = 1
            newNode.char = char
            freqMap[char] = newNode
            heap.Push(h, newNode)
        }
    }

    for h.Len() > 1 {
        left, right := h.Pop().(*node), h.Pop().(*node)
        branch := new(node)
        branch.freq = right.freq + left.freq
        branch.left = left
        branch.right = right
        heap.Push(h, branch)
    }

    root := heap.Pop(h).(*node)
    return root
}

func codebookRecurse(node *node, cb *codebook, code []rune) {
    if node == nil {
        return
    }

    if node.left == nil && node.right == nil {
        (*cb)[node.char] = string(code)
    }

    code = append(code, '0')
    codebookRecurse(node.left, cb, code)
    code = append(code[:len(code)-1], '1')
    codebookRecurse(node.right, cb, code)
}

func encode(message string) (string, *node, codebook) {
    ret := ""
    root := buildTree(message)
    cb := generateCodebook(root)
    for _, char := range message {
        ret += cb[char]
    }

    return ret, root, cb
}

func decode(message string, root *node) string {
    cur := root
    ret := ""

    for _, char := range message {
        if cur == nil {
            return message
        }

        switch string(char) {
        case "0":
            if cur.left == nil {
                ret += string(cur.char)
                cur = root.left
            } else {
                cur = cur.left
            }
        case "1":
            if cur.right == nil {
                ret += string(cur.char)
                cur = root.right
            } else {
                cur = cur.right
            }
        }
    }

    if cur.char != 0 {
        ret += string(cur.char)
    }

    return ret
}

func generateCodebook(root *node) codebook {
    cb := make(codebook)
    codeArr := make([]rune, 0)
    codebookRecurse(root, &cb, codeArr)
    return cb
}

func main() {
    enc, root, cb := encode("bibbity_bobbity")
    fmt.Println("Codebook:")
    for r, c := range cb {
        fmt.Println(string(r), "->", c)
    }
    fmt.Println("\nEncoded:", enc)
    fmt.Println("Decoded:", decode(enc, root))
}

.intel_syntax noprefix

# System V calling convention cheatsheet
# Params: rdi, rsi, rdx, rcx, r8, r9, xmm0-7
# Return: rax (int 64 bits), rax:rdx (int 128 bits), xmm0 (float)
# Callee cleanup: rbx, rbp, r12-15
# Scratch: rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11

.section .rodata
  text:       .string "bibbity bobbity"
  original:   .string "Original message: %s\n"
  encoded:    .string "Encoded message: "
  decoded:    .string "Decoded message: %s\n"

  .equ bitstr_len,       32
  .equ bitstr_size,      40
  .equ codebook_size,    256 * bitstr_size

  .equ tree_left,        0
  .equ tree_right,       8
  .equ tree_count,       16
  .equ tree_value,       20
  .equ tree_size,        24

  .equ heap_len,         0
  .equ heap_data,        4
  .equ heap_size,        512 * 8 + 16         # 512 ptrs + 4 byte length + 12 byte padding
  .equ counts_size,      256 * 4

  .equ msg_len,          0
  .equ msg_data,         8
.section .text
  .global main
  .extern printf, calloc, malloc, memset, puts

main:
  push   r12
  push   r13
  sub    rsp, codebook_size + 16              # 8 extra bytes for the Huffman-tree ptr, 8 bytes for padding

  # Print the original text
  mov    rdi, OFFSET original
  mov    rsi, OFFSET text
  xor    rax, rax
  call   printf

  # First encode the text. This will also initialize the Huffman-tree and the codebook
  mov    rdi, OFFSET text
  mov    rsi, rsp
  lea    rdx, [rsp + codebook_size]
  call   encode
  mov    r12, rax                             # Save the returned message ptr

  # Print the codebook and the encoded message
  mov    rdi, rsp
  call   print_codebook
  mov    rdi, OFFSET encoded
  xor    rax, rax
  call   printf
  mov    rdi, r12
  call   print_message

  # Decode and print the message
  mov    rdi, r12
  mov    rsi, QWORD PTR [rsp + codebook_size]
  call   decode
  mov    r13, rax
  mov    rdi, OFFSET decoded
  mov    rsi, r13
  xor    rax, rax
  call   printf

  # Free allocated resources
  mov    rdi, r12
  call   free
  mov    rdi, r13
  call   free
  mov    rdi, QWORD PTR [rsp + codebook_size]
  call   free_tree

  add    rsp, codebook_size + 16
  pop    r13
  pop    r12

  # Indiciate success with a 0 exit code
  xor    rax, rax
  ret

# rdi - text
# rsi - codebook ptr
# rdx - Huffman-tree ptr
# RET rax - encoded message ptr
encode:
  push   r12
  push   r13
  push   r14
  mov    r12, rdi                             # Save the original arguments
  mov    r13, rsi
  mov    r14, rdx
  call   generate_tree                        # The text is already in rdi
  mov    QWORD PTR [r14], rax                 # Save the Huffman-tree's root
  mov    rdi, r13                             # Set up the parameters for codebook generation: codebook ptr, Huffman-tree root
  mov    rsi, rax
  call   generate_codebook
  xor    rax, rax
  xor    r14, r14                             # We'll use r14 to keep track of the length of the message
  mov    rcx, r12                             # Make a copy of the pointer to the message to be encoded
encode_calculate_length:
  mov    al, BYTE PTR [rcx]
  test   al, al                               # If we're at the terminating null character then we're ready to encode
  jz     encode_message
  lea    rdx, [rax + 4*rax]                   # We get the codebook entry at the specific index
  lea    r8, [r13 + 8*rdx]
  add    r14, QWORD PTR [r8 + bitstr_len]     # And add the encoded word length to the total
  inc    rcx
  jmp    encode_calculate_length
encode_message:
  mov    rdi, 1
  lea    rsi, [r14 + 7]                       # Calculate the number of bytes we need to allocate to fit all the bits
  shr    rsi, 3                               # length % 8 rounded up = (length + 8 - 1) / 8
  lea    rsi, [rsi + 8]                       # Make space for an 8-byte length field
  call   calloc                               # Allocate the necessary memory, the message will be in rax
  mov    QWORD PTR [rax], r14                 # Save the length of the message
  # Registers:
  #   - r12: text
  #   - r13: codebook_ptr
  #   - rax: message ptr
  #   - free to use: rdi, rsi, rcx, rdx, r8, r9, r10, r11, r14
  xor    r8, r8                               # Bit offset
  lea    r9, [rax + 8]                        # 8-byte message block
encode_message_bits:
  xor    rdi, rdi                             # We need to clear rdi because moving a single byte to dil doesn't do so
  mov    dil, BYTE PTR [r12]                  # Iterate the message again
  test   dil, dil                             # If we're at the the null terminator we're done
  jz     encode_done
  lea    rdx, [rdi + 4*rdi]                   # Get the codebook entry
  lea    r10, [r13 + 8*rdx]
  mov    r11, QWORD PTR [r10 + bitstr_len]    # Load the bitstring length
  lea    r14, [r10]                           # The bitstring qword we're currently processing
encode_message_bits_qword:
  mov    rdi, QWORD PTR [r14]                 # Calculate the first mask: [code qword] << [bit offset]
  mov    rsi, rdi                             # Get a second copy of the code's current qword
  mov    rcx, r8
  shl    rdi, cl
  or     QWORD PTR [r9], rdi                  # Apply the mask to the current block
  mov    rcx, 64                              # Calculate the second mask: [code qword] >> [64 - bit offset]
  sub    rcx, r8
  shr    rsi, cl
  mov    rcx, r11                             # Copy the code length so we can manipulate it without destroying the original value
  sub    rcx, 64
  jle    encode_message_bits_try_overflow     # If the length was less than or equal to 64, check if the code qword would overflow the current message block
  mov    r11, rcx                             # We wanted to subtract 64 from the code length anyway
  lea    r9, [r9 + 8]                         # Load the next message block
  or     QWORD PTR [r9], rsi                  # Save the second mask to the new message block
  jmp    encode_message_bits_qword
encode_message_bits_try_overflow:
  add    rcx, r8                              # Calculate [code length] + [bit offset] - 64
  jl     encode_calculate_new_bit_offset      # If the result is less than 0 then we have no remaining bits -> calculate the new bit offset
  mov    r8, rcx                              # Otherwise this also happens to be our new bit offset
  lea    r9, [r9 + 8]                         # Load the next message block
  or     QWORD PTR [r9], rsi                  # Save the second mask to the new message block
  inc    r12                                  # Go to the next character in the input
  jmp    encode_message_bits
encode_calculate_new_bit_offset:
  lea    r8, [r8 + r11]                       # Calculate the bit offset for the next code qword
  inc    r12
  jmp    encode_message_bits
encode_done:
  pop    r14
  pop    r13
  pop    r12
  ret

# rdi - encoded message
# rsi - Huffman-tree root (ptr)
# RET rax - the decoded message
decode:
  push   r12
  push   r13
  push   r14
  mov    r12, rdi
  mov    r13, rsi
  mov    rdi, QWORD PTR [r12]                 # Load the length of the message
  mov    r14, rdi                             # We'll use the length of the message as a loop counter later
  lea    rdi, [rdi + 1]                       # The null terminator
  call   malloc                               # This will usually be more than enough memory to contain the whole decoded message (we don't handle pathological cases right now)
  mov    rdi, r12                             # The single-character decoder doesn't touch rdi so we can hoist it before the loop
  xor    rcx, rcx
  mov    rdx, rax                             # The current byte in the output string
decode_loop:
  cmp    rcx, r14                             # The encoded message bit counter
  jge    decode_done
  mov    rsi, r13                             # The current node in the Huffman-tree
decode_loop_char:
  test   rsi, rsi                             # If the Huffman-tree node is null then we reached a dead-end -> start over
  jz     decode_loop
  cmp    QWORD PTR [rsi + tree_left], 0       # If the node has either a left or a right child, treat it as a branch
  jnz    decode_loop_char_branch
  cmp    QWORD PTR [rsi + tree_right], 0
  jnz    decode_loop_char_branch
  mov    r9d, DWORD PTR [rsi + tree_value]    # Load the value in this node in case the next iteration needs it
  mov    BYTE PTR [rdx], r9b                  # And save it to the output
  lea    rdx, [rdx + 1]                       # Advance the output string
  jmp    decode_loop
decode_loop_char_branch:
  mov    r9, rcx                              # First, load the byte of the message the current bit is in
  shr    r9, 3
  mov    r10b, BYTE PTR [rdi + r9 + msg_data]
  mov    r11, rcx                             # Save rcx in another register temporarily so we can restore it without push/pop
  and    rcx, 7
  shr    r10, cl                              # Get the bit we're interested in to position 0
  lea    rcx, [r11 + 1]                       # Restore rcx and immediately add 1 to get the next bit to decode
  and    r10, 0x1                             # Zero out all other bits
  mov    r8, rsi
  mov    rsi, QWORD PTR [r8 + tree_left]      # Take the left branch for 0, the right branch for a non-zero bit
  cmovnz rsi, QWORD PTR [r8 + tree_right]
  jmp    decode_loop_char
decode_done:
  mov    BYTE PTR [rdx], 0                    # Write the null terminator at the end of the string
  pop    r14
  pop    r13
  pop    r12
  ret

# rdi - The starting address of the codebook we want to generate
# rsi - Huffman-tree root (ptr)
generate_codebook:
  push   r12
  sub    rsp, bitstr_size + 16                # 16 extra bytes for alignment
  mov    r12, rsi
  xorps  xmm0, xmm0                           # Create a 0-initialized bitstring. This will be
  movaps XMMWORD PTR [rsp], xmm0              # used in the recursive function calls
  movaps XMMWORD PTR [rsp + 16], xmm0
  mov    QWORD PTR [rsp + 32], 0
  xor    rsi, rsi
  mov    rdx, codebook_size
  call   memset
  mov    rdi, rax
  mov    rsi, r12
  mov    rdx, rsp
  call   generate_codebook_recurse
  add    rsp, bitstr_size + 16
  pop    r12
  ret

# rdi - The codebook's starting address
# rsi - The current Huffman-tree node
# rdx - The bitstring used for code generation
generate_codebook_recurse:
  push   rbp
  push   r12
  push   r13
  test   rsi, rsi                             # If we reached a null pointer we're done
  jz     generate_codebook_recurse_done
  mov    r12, rsi
  cmp    QWORD PTR [r12 + tree_left], 0       # If at least one of the children is not null
  jnz    generate_codebook_branch             # then we need to treat the current node as a branch
  cmp    QWORD PTR [r12 + tree_right], 0
  jnz    generate_codebook_branch
  mov    r8d, DWORD PTR [r12 + tree_value]    # Get the value of the current node
  movaps xmm0, XMMWORD PTR [rdx]              # Get the values of the current bitstring into some registers
  movaps xmm1, XMMWORD PTR [rdx + 16]
  mov    r9, QWORD PTR [rdx + 32]
  lea    rax, [r8 + 4*r8]                     # The index calculation needs to add 40 * index. With lea arithmetic this can be represented as
  lea    r10, [rdi + 8*rax]                   # base address + 8 * (5 * index). This is done in two lea instructions
  movups XMMWORD PTR [r10], xmm0              # And copy the data over to it
  movups XMMWORD PTR [r10 + 16], xmm1
  mov    QWORD PTR [r10 + 32], r9
  jmp    generate_codebook_recurse_done
generate_codebook_branch:
  # First, calculate the necessary indices and bitmask to use for the bitstring
  mov    r13, QWORD PTR [rdx + bitstr_len]    # Load the current length of the bitstring
  mov    rcx, r13                             # This will be used to index into the bitstring data. We'll need two copies for it
  shr    r13, 6                               # We first get which 64 bit chunk of the bitstring we want to modify
  and    rcx, 63                              # Then the bit we want to change
  mov    rbp, 1                               # Generate the mask we'll use to set the correct bit
  shl    rbp, cl
  # We'll start with the right branch
  or     QWORD PTR [rdx + 8*r13], rbp         # Set the bit
  inc    QWORD PTR [rdx + bitstr_len]         # Increase the bitstring length
  mov    rsi, QWORD PTR [r12 + tree_right]
  call   generate_codebook_recurse
  # Now we move on to the left branch: rbx - left child, r13 - bitstring index, rbp - mask
  not    rbp
  and    QWORD PTR [rdx + 8*r13], rbp
  mov    rsi, QWORD PTR [r12 + tree_left]
  call   generate_codebook_recurse
  dec    QWORD PTR [rdx + bitstr_len]         # Decrease the bitstring length
generate_codebook_recurse_done:
  pop    r13
  pop    r12
  pop    rbp
  ret

# rdi - text
# RET rax - Huffman-tree root (ptr)
generate_tree:
  push   r12
  push   r13
  sub    rsp, 5128                            # 1024 bytes for the char counts, 4 bytes for heap length, 4096 bytes for the heap, 4 byte padding
  mov    r12, rdi                             # Save the original text so it doesn't get clobbered
  mov    rdi, rsp                             # Zero out the character counts and the heap length
  xor    rsi, rsi
  mov    rdx, 1040
  call   memset
  xor    rax, rax
generate_tree_count_chars:
  mov    al, BYTE PTR [r12]
  test   al, al
  jz     generate_tree_leaves_setup
  inc    DWORD PTR [rsp + 4*rax]
  inc    r12
  jmp    generate_tree_count_chars
generate_tree_leaves_setup:
  mov    r12, 255                             # The loop counter. We can only get here if the "test" on line 301 resulted in a zero so the next jl instruction will do the right thing
generate_tree_leaves:
  jl     generate_tree_one_leaf               # If not then it's time to generate the branches
  mov    r13d, DWORD PTR [rsp + 4*r12]        # Load the count at the ith position
  test   r13d, r13d                           # And check if it's zero
  jz     generate_tree_leaves_counters        # If it is we can skip this iteration
  mov    rdi, 1                               # If not, we need to allocate a new leaf node
  mov    rsi, tree_size
  call   calloc
  mov    DWORD PTR [rax + tree_value], r12d   # Save the value and the count to the tree
  mov    DWORD PTR [rax + tree_count], r13d
  lea    rdi, [rsp + counts_size]             # Then push it onto the heap
  mov    rsi, rax
  call   heap_push
generate_tree_leaves_counters:
  dec    r12                                  # Decrement the loop counter and start over
  jmp    generate_tree_leaves
generate_tree_one_leaf:
  cmp    DWORD PTR [rsp + counts_size], 1     # Check if there is only one element in the heap
  jne    generate_tree_branches
  lea    rdi, [rsp + counts_size]             # Get the element
  call   heap_pop
  mov    r12, rax
  mov    rdi, tree_size                       # Create the new tree node, the pointer to it will be in rax
  call   malloc
  mov    QWORD PTR [rax + tree_left], r12     # Save element in the left node
  mov    ecx, DWORD PTR [r12 + tree_count]    # Save element count in branch
  mov    DWORD PTR [rax + tree_count], ecx
  jmp    generate_tree_ret                    # Returning
generate_tree_branches:
  cmp    DWORD PTR [rsp + counts_size], 1     # Check if there are still at least two elements in the heap
  jle    generate_tree_done                   # If not, we're done
  lea    rdi, [rsp + counts_size]             # Get the left child
  call   heap_pop
  mov    r12, rax
  lea    rdi, [rsp + counts_size]             # Get the right child
  call   heap_pop
  mov    r13, rax
  mov    rdi, tree_size                       # Create the new tree node, the pointer to it will be in rax
  call   malloc
  mov    ecx, DWORD PTR [r12 + tree_count]    # The new node's count: left count + right count
  add    ecx, DWORD PTR [r13 + tree_count]
  mov    QWORD PTR [rax + tree_left], r12     # Save the new node's fields: left, right, count (leave value unititialized, it shouldn't be used with branch nodes)
  mov    QWORD PTR [rax + tree_right], r13
  mov    DWORD PTR [rax + tree_count], ecx
  lea    rdi, [rsp + counts_size]             # Add the branch to the heap
  mov    rsi, rax
  call   heap_push
  jmp    generate_tree_branches
generate_tree_done:
  lea    rdi, [rsp + counts_size]             # The tree's root will be in rax after the pop
  call   heap_pop
generate_tree_ret:
  add    rsp, 5128
  pop    r13
  pop    r12
  ret

# rdi - heap ptr
# rsi - tree ptr
heap_push:
  lea    rax, QWORD PTR [rdi + heap_data]     # We load the heap's data ptr and length to the respective registers
  mov    ecx, DWORD PTR [rdi + heap_len]      # Load the current length
  lea    edx, [ecx + 1]                       # First, calculate the new length (length + 1)
  mov    DWORD PTR [rdi + heap_len], edx      # Then save it
  mov    QWORD PTR [rax + 8*rcx], rsi         # And finally add the new value at the end of the array
heap_push_sift_up:
  test   rcx, rcx                             # Test if we got to the root (index == 0)
  jz     heap_push_done
  lea    rdx, [rcx - 1]                       # Calculate the parent index: (index - 1) / 2
  shr    rdx, 1
  lea    r8, [rax + 8*rcx]                    # Get the pointer to the current and parent elements
  lea    r9, [rax + 8*rdx]
  mov    r10, QWORD PTR [r8]                  # Load the current and the parent elements
  mov    r11, QWORD PTR [r9]
  mov    esi, DWORD PTR [r10 + tree_count]    # Load the current tree's count
  cmp    DWORD PTR [r11 + tree_count], esi    # If parent count <= current count
  jle    heap_push_done                       # Then we're done
  mov    QWORD PTR [r8], r11                  # Otherwise swap the two elements
  mov    QWORD PTR [r9], r10
  mov    rcx, rdx
  jmp    heap_push_sift_up
heap_push_done:
  ret

# rdi - heap ptr
# RET rax - tree ptr
heap_pop:
  mov    r8d, DWORD PTR [rdi + heap_len]      # Load the heap's length
  test   r8d, r8d                             # If it's 0 then the heap's empty
  jz     heap_empty
  lea    rdx, [rdi + heap_data]               # Get the heap's data ptr
  mov    rax, QWORD PTR [rdx]                 # The return value will be the tree's current root
  lea    r8d, [r8d - 1]                       # Calculate the new length
  mov    DWORD PTR [rdi + heap_len], r8d      # And save it
  mov    rsi, QWORD PTR [rdx + 8*r8]          # Load the element we're going to swap with the root
  mov    QWORD PTR [rdx], rsi                 # Swap the root and the last element
  mov    QWORD PTR [rdx + 8*r8], rax
  xor    r9, r9                               # The loop index
heap_pop_sift_down:
  mov    rcx, r9                              # Save the target index at the start of the loop
  lea    r10, [r9 + r9 + 1]                   # The left child index
  lea    r11, [r9 + r9 + 2]                   # The right child index
  cmp    r10, r8
  jge    heap_pop_check_right
  mov    rdi, QWORD PTR [rdx + 8*r10]         # Load the left child
  mov    rsi, QWORD PTR [rdx + 8*rcx]         # Load the target
  mov    esi, DWORD PTR [rsi + tree_count]    # Load the target tree count
  cmp    DWORD PTR [rdi + tree_count], esi    # If the left tree count < target tree count
  jge    heap_pop_check_right
  mov    rcx, r10
heap_pop_check_right:
  cmp    r11, r8
  jge    heap_pop_compare_indices
  mov    rdi, QWORD PTR [rdx + 8*r11]         # Load the right child
  mov    rsi, QWORD PTR [rdx + 8*rcx]         # Load the target
  mov    esi, DWORD PTR [rsi + tree_count]    # Load the target tree count
  cmp    DWORD PTR [rdi + tree_count], esi    # If the right tree count < target tree count
  jge    heap_pop_compare_indices
  mov    rcx, r11
heap_pop_compare_indices:
  cmp    r9, rcx                              # If the target index == current index we're done
  je     heap_pop_done
  mov    rdi, QWORD PTR [rdx + 8*r9]          # Otherwise we swap the values
  mov    rsi, QWORD PTR [rdx + 8*rcx]
  mov    QWORD PTR [rdx + 8*r9], rsi
  mov    QWORD PTR [rdx + 8*rcx], rdi
  mov    r9, rcx
  jmp    heap_pop_sift_down
heap_empty:
  xor    rax, rax                             # Return a null pointer to indicate the heap was empty
heap_pop_done:
  ret

# rdi - codebook start ptr
print_codebook:
  push   rbx
  push   r12
  sub    rsp, 272                             # The bitstring we're going to print
  mov    r12, rdi
  xor    rbx, rbx                             # Save the loop counter into a register that doesn't get clobbered
print_codebook_loop:
  cmp    rbx, 255
  jg     print_codebook_done
  lea    rax, [rbx + 4*rbx]                   # We get the codebook entry at the specific index
  lea    r10, [r12 + 8*rax]
  mov    rdx, QWORD PTR [r10 + bitstr_len]    # Load the length of the bitstring
  test   rdx, rdx                             # If it's zero then the codepoint didn't exist in the original alphabet, skip
  jz     print_codebook_counters
print_codebook_char:
  mov    BYTE PTR [rsp], bl                   # First, the character we're printing the code for
  mov    WORD PTR [rsp + 1], 0x203a           # Then ": "
  mov    BYTE PTR [rsp + rdx + 3], 0x00       # At the end add the null terminator
print_codebook_generate_binary:
  dec    rdx
  jl     print_codebook_binary
  mov    r9, rdx                              # Two copies of the loop counter
  mov    rcx, rdx
  shr    r9, 6                                # Calculate the bitstring part we're going to load
  and    rcx, 63                              # The bit we're interested in
  mov    rsi, QWORD PTR [r10 + r9]            # One of the 4, 64 bit parts of the bitstring we're going to print
  shr    rsi, cl                              # Get the relevant bit into the 0th position
  and    rsi, 1                               # Mask the rest of the bits
  add    rsi, '0'                             # Convert it to ASCII
  mov    BYTE PTR [rsp + rdx + 3], sil        # And copy it into the string
  jmp    print_codebook_generate_binary
print_codebook_binary:
  mov    rdi, rsp                             # Print the current bitstring
  call   puts
print_codebook_counters:
  inc    rbx                                  # And go to the next codebook entry
  jmp    print_codebook_loop
print_codebook_done:
  add    rsp, 272
  pop    r12
  pop    rbx
  ret

# rdi - message ptr
# This would run out of stack space for long messages but it will do for now
print_message:
  push   r12
  push   r13
  mov    r12, rdi
  mov    r13, QWORD PTR [rdi]                 # Get the length of the message
  lea    rdi, [r13 + 1]                       # For the length of the string we'll need an additional the null terminator
  call   malloc
  xor    rdx, rdx
print_message_generate_string:
  cmp    rdx, r13
  jge    print_message_puts
  mov    r8, rdx                              # Get two copies of the current index
  mov    rcx, rdx
  shr    r8, 3                                # We first get the byte we want to print
  mov    r10b, BYTE PTR [r12 + r8 + msg_data]
  and    rcx, 7                               # Then the bit in that byte
  shr    r10, cl
  and    r10, 0x1                             # Mask it so only the bit we're interested in is visible
  add    r10, '0'                             # Convert it to ASCII
  mov    BYTE PTR [rax + rdx], r10b           # Write it into the printable string
  inc    rdx
  jmp    print_message_generate_string
print_message_puts:
  mov    BYTE PTR [rax + rdx], 0x00           # Write the null terminator
  mov    rdi, rax                             # And print the string
  call   puts
  pop    r13
  pop    r12
  ret

# rdi - tree ptr
free_tree:
  push   rbx
  mov    rbx, rdi
  test   rbx, rbx                             # When the tree ptr we're trying to free is already null we reached the termination condition
  jz     free_tree_done
  mov    rdi, [rbx + tree_left]               # Otherwise free the left child first
  call   free_tree
  mov    rdi, [rbx + tree_right]              # Then the right child
  call   free_tree
  mov    rdi, rbx                             # And finally, the node itself
  call   free
free_tree_done:
  pop    rbx
  ret

import scala.collection.mutable.{Map, PriorityQueue}

object HuffmanEncoding {

  trait Node {
    var weight: Int
  }

  case class Leaf(char: Char, var weight: Int) extends Node

  case class Branch(left: Node, right: Node, var weight: Int) extends Node

  def createTree(phrase: String): Option[Node] = {

    val tree = PriorityQueue[Node]()(Ordering.by(-_.weight))
    tree ++= phrase
      .groupBy(identity)
      .mapValues(_.length)
      .map{
        case (char, count) => Leaf(char, count)
      }

    while (tree.size > 1) {
      val node1 = tree.dequeue()
      val node2 = tree.dequeue()
      tree += Branch(node1, node2, node1.weight + node2.weight)
    }

    tree.headOption
  }


  def createCodeBook(maybeRoot: Option[Node]): Map[Char, String] = {
    val codeBook = Map[Char, String]()

    def codeBookRecurse(node: Node, code: String): Unit =
      node match {
        case Leaf(symbol, _) => codeBook.put(symbol, code)
        case Branch(left, right, _) => {
          codeBookRecurse(left, code + "0")
          codeBookRecurse(right, code + "1")
        }
      }

    maybeRoot.foreach(c => codeBookRecurse(c, ""))

    codeBook
  }


  def encode(phrase: String, codeBook: Map[Char, String]): String = {
    phrase.flatMap(c => codeBook.getOrElse(c, "?"))
  }

  def decode(encoded: String, maybeRoot: Option[Node]): String = {
    val root = maybeRoot.getOrElse(Leaf('?', 0))
    var currentNode = root

    def chooseTreeBranch(bit: Char) =
      currentNode match {
        case Branch(left, right, _) =>
          currentNode = if (bit == '0') left else right
        case _ =>
      }

    def maybeGetACharacter =
      currentNode match {
        case Leaf(c, _) => {
          currentNode = root
          Some(c)
        }
        case _ => None
      }

    encoded
      .flatMap(bit => {
        chooseTreeBranch(bit)
        maybeGetACharacter
      })
  }

  def main(args: Array[String]): Unit = {
    val originalText = "bibbity_bobbity"
    println("Original Text: " + originalText)

    val tree = createTree(originalText)
    val codeBook = createCodeBook(tree)
    println("CodeBook is: " + codeBook)

    val encoded = encode(originalText, codeBook)
    println("Encoded text: " + encoded)

    val decoded = decode(encoded, tree)
    println("Decoded text: " + decoded)

  }

}

The code snippet was taken from this scratch project

from collections import Counter, deque
from bisect import bisect

class Tree

data Empty() from Tree
data Leaf(char, int(n)) from Tree:
    def __str__(self):
        return f'Leaf({self.char}, {self.n})'

    __repr__ = __str__

data Node(Tree(left), Tree(right)) from Tree:
    def __str__(self):
        return f'Node({str(self.left)}, {str(self.right)})'
    __repr__ = __str__

def weight(Tree()) = 0
addpattern def weight(Leaf(char, n)) = n
addpattern def weight(Node(left, right)) = weight(left) + weight(right)

def build_huffman_tree(message):

    # get sorted list of character and frequency pairs
    frequencies = Counter(message)
    trees = frequencies.most_common() |> map$(t -> Leaf(*t)) |> reversed |> deque

    if not trees:
        return Empty()

    # while there is more than one tree
    while len(trees) > 1:

        # pop off the two trees of least weight from the trees list
        tree_left = trees.popleft()
        tree_right = trees.popleft()

        # combine the nodes and add back to the nodes list
        new_tree = Node(tree_left, tree_right)

        # find the first tree that has a weight smaller than new_weight
        # and returns its index in the list.
        # If no such tree can be found, use len(trees) instead to append
        index = bisect(trees |> map$(weight) |> list, weight(new_tree))

        # insert the new tree there
        trees.insert(index, new_tree)

    huffman_tree = trees[0]
    return huffman_tree


def build_codebook(Empty(), code='') = []
addpattern def build_codebook(Leaf(char, n), code='') = [(char, code)]
addpattern def build_codebook(Node(left, right), code='') =
    build_codebook(left, code+'0') + build_codebook(right, code+'1')

def huffman_encode(codebook, message):

    if len(codebook) == 1:
        return '0' * len(message)

    # build a char -> code dictionary
    forward_dict = dict(codebook)

    return ''.join(message |> map$(forward_dict[]))

def huffman_decode(codebook, encoded_message):

    decoded_message = []
    key = ''

    if not codebook:
        return ''
    elif len(codebook) == 1:
        return codebook[0][0] * len(encoded_message)

    # build a code -> char dictionary
    inverse_dict = dict((v, k) for k, v in codebook)

    # for each bit in the encoding
    # if the bit is in the dictionary, replace the bit with the paired
    # character else look at the bit and the following bits together
    # until a match occurs move to the next bit not yet looked at.
    if encoded_message == '':
        return inverse_dict['']

    for bit in encoded_message:
        key += bit
        if key in inverse_dict:
            decoded_message.append(inverse_dict[key])
            key = ''

    return ''.join(decoded_message)


if __name__ == '__main__':
    # test example
    message = 'bibbity_bobbity'
    tree = build_huffman_tree(message)
    codebook = build_codebook(tree)
    encoded_message = huffman_encode(codebook, message)
    decoded_message = huffman_decode(codebook, encoded_message)

    print('message:', message)
    print('huffman tree:', tree)
    print('codebook:', codebook)
    print('encoded message:', encoded_message)
    print('decoded message:', decoded_message)

    # prints the following:
    #
    #  message: bibbity_bobbity
    #  huffman_tree: Node(Leaf(b, 6), Node(Node(Leaf(y, 2), Leaf(t, 2)),
    #                     Node(Node(Leaf(o, 1), Leaf(_, 1)), Leaf(i, 3))))
    #  codebook: [('b', '0'), ('y', '100'), ('t', '101'),
    #             ('o', '1100'), ('_', '1101'), ('i', '111')]
    #  encoded_message: 01110011110110011010110000111101100
    #  decoded_message: bibbity_bobbity

License

Code Examples

The code examples are licensed under the MIT license (found in LICENSE.md).

Text

The text of this chapter was written by James Schloss and is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Images/Graphics

The image "huffman_tree" was created by Julian Schacher and is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Pull Requests

After initial licensing (#560), the following pull requests have modified the text or graphics of this chapter:

none

Huffman Encoding

Huffman Encoding

Video Explanation

Example Code

HuffmanCoding.cs

Program.cs

License

Code Examples

Text

Images/Graphics

Pull Requests

results matching ""

No results matching ""