on the representation of trees


Trees are fucking everywhere: filesystems, parsers, forests, Dyck languages, structural induction, types, between the cracks in pavement...

The question then arises about how to efficiently represent tries. Naievely, you can treat them similarily to a linked list, but that requires a lot of overhead. The question then arises of how to efficiently represent binary trees, and perhaps more importantly, do it in a way that doesn't make common operations on it incredibly painful (goddamn).

This blog post is intended as an exploration of trees and their representations. I'm not looking to Review the Literature™, but rather to summarize some interesting things I've found while trying to figure out how to represent inductive data types efficiently. Don't sue me™.

succinct is a funny word

What we're really looking for is the notion of a succinct representation. That is, a representation asymptotically "close" to the smallest a tree can be represented. Various types of trees have slightly different information-theoretic minima, but who cares I guess. We specifically allow linear overhead to that information-theoretic minimum, since computer scientists are all about that linear space complexity.

higher-arity trees

So far, we've been talking just about binary trees, which, does need justification. Trust me, I haven't been doing that for no reason. I've been doing it because it's funny.

And by funny, I mean you can represent any tree as a binary tree, and vice-versa. For 1-ary trees, the encoding is trivial, since 1-ary trees are already binary trees. We leave 0-ary trees as an exercise for the reader.

For trees of higher arity, we can encode as binary trees in the following way: for a node A with children B1, B2, B3, ..., through BN: make B1 the left child of A, then make B2 the right child of B1, B3 the right child of B2, and so on.

In effect, we encode the sibling relation as the right child, and first-child relation as the left child. In another view, we represent the children of a node as a linked list, with the first node in the linked list being linked to by the parent.

This is a bijection. We can trivially reverse the encoding, turning any binary tree into an n-ary tree, but that's kinda boring so you can do it yourself.

Crucially, though, this encoding preserves node count. Which I think would be a requirement to it being a bijection in the first place (a tree homomorphism?) but whatever. So, I'm gonna refer to binary trees and higher-arity trees interchangably as just trees. Such is the power of the forest.

terminology of trees (treeminology?)

Papers on succinct representations of trees (from what I've seen) tend to differentiate between so-called ordinal trees and cardinal trees.

From what I can tell, the main difference is in how children are enumerated. Cardinal trees have children in specific "slots", where you can, for example, have a "Child 1" and "Child 7" out of slots allowing children 0 through 9 (for a 10-ary tree). However, an ordinal tree treats its children as an ordered set, where the indexing of children is an after-the-fact imposition. That is, there is a "First Child" and "Second Child", not "Child 1" and "Child 2".

We can also see them as an analogy to their names. A cardinal tree numbers its children cardinally, that is, absolutely. Ordinal trees number their children ordinally, that is, relatively, or based on an order. This is just as cardinal numbers (one, two, three, etc) differ from ordinal numbers (first, second, third, etc).

Free trees impose no specific order or slots on children: they simply form a set, where no ordering is be imposed. The difference between cardinal and ordinal trees is small, but can be significant based on your application.

array representation

The prototypical succinct representation of a binary tree is the array representation. This representation requires extremely low overhead, but requires the tree to be complete (all levels except the last are completely filled, and all nodes in the last level are all to a predetermined side). We can uniquely represent such trees by their number of nodes, and can further store any labels through a level-order (or breadth-first) enumeration, stored as an array.

If every node is labelled, we can even store the entire tree as a null-terminated string, where the first element is the label for the root, next two for the children of the root node (which is a banging-ass band name), next four for the root's children's children, etc.

This does support several constant-time implementations as well, such as getting the labels for each node, and finding the parent and children for a node. Probably others as well, idk.

succinct representations

I guess the point of a lot of this was just to communicate the necessary prerequisites to be able to find research on succinct representations of trees. If you want any other examples of representations like that, try the search terms "Succinct Representations of Ordinal Trees" or "Succinct Representations of Cardinal Trees" or even "Uniform Succinct Representations of (...) Trees". I don't really feel like reading those papers though.

Succinct representations are all about keeping representations small while keeping important operations constant-time. I'm trying to represent inductive data types with efficient pattern-matching. So, getting the subtrees rooted at the root's children is my main focus. This, I don't think has been the main focus of currently-known succinct representations. I may be wrong though! It sure seems like it should be wrong.


So, I'm gonna try to figure out a good enough representation for this case. There'll probably be a blog post later on the design of the language I'm working on, including a tree representation.

See the sequel post the agony of a counted bark for a construction of a tree representation!