on the agony of a counted bark

a continuation

This article continues from on the representation of trees, in which we, among usther things, established the countability of binary trees, as well as a bijection between higher-arity trees and binary trees (and thus, the countability of trees). However, the countability of trees wasn't quite constructive. I intend to provide a, hopefully succinct and efficient, representation of trees here.

Same disclaimers as the last article apply though.


So, I managed to find a paper on this shit. It's "Constant Time Generation of Rooted Trees" by Terry Beyer and Sandra Hedetniemi. It's not perfect for my case, of course, but it's good enough to adapt.

The paper was written to figure out how to generate arbitrary binary trees, up to subtree ordering. That is, generating free trees. What I'm specifically interested in here is the notion of a level sequence.

Define a level sequence as a preorder traversal of a given tree, mapping nodes to their height. That is, enumerate through a each node, then its children, in order. Take their heights (1 for the root, 2 for the Children of the Root Node, 3 for the root's children's children, etc) and create a sequence in that enumerated order. A tree represented in parentheses notion as ( (()) () ) would have its level sequence be [1 2 3 2].

fuzzy bark

Now, that seems useful and great. However, we can improve this representation. For one, combining trees is slow: we would have to shift each level up by 1 (in linear time) before concatenating with another tree. This is the fundamental operation of data construction, so it should be efficient. For two, there's some unnecessary aspects of the representation.

Note the following aspects of the representation: that it's absolutely indexed, and that for every two adjacent increasing numbers in the level sequence, it may only increase by 1. We shall explore the first aspect first.

Let's make the representation relative, where each number in the level sequence represents a change in level, rather than a level itself. For the above example, [1 2 3 2] becomes [1 1 -1]. A tree with level sequence [1 2 3 4 2 3 3] becomes [1 1 1 -2 1 0]. This primarily acts to fix the tree combination problem in our representation. With an absolute level sequence, combining [1 2] and [1 2 2] would require a linear-time operation, resulting in [1 2 3 2 3 3]. Combining the equivalent relative level sequences [1] and [1 0] would result in [1 1 -1 0].

This... doesn't quite fix the linear-time problem. That -1 still requires a scan-through of the first level sequence to calculate. So, we'll use the Power of Memoization and introduce a tail to the sequence, and call these fur-relative level sequences (shortened as furry levseq). We suffix each furry levseq with a single number, as if the tree ends with another node at level 1 (which, is impossible by definition of a tree). So, [1] becomes [1 -1] and [1 0] becomes [1 0 -1]. You can pretend these originate from the absolute levseqs [1 2 1] and [1 2 2 1] instead of [1 2] and [1 2 2], respectively.

Furry levseqs fix the linear-time problem. To combine two furry levseqs [1 -1] and [1 0 -1], we create a new furry levseq as follows: start with 1, concatenate the first furry levseq, then concatenate the second furry levseq, but with its tail decremented. Then, [1 -1] and [1 0 -1] combined becomes [1 1 -1 1 0 -2].

dry bark

We're still left with the second aspect: that level sequences only ever increase by one. This means, then, that we can add together adjacent positive (nonzero) numbers in a furry levseq without any loss of information.

Given any furry levseq, we may create what we deem a squeeshed levseq by adding together any two adjacent positive (nonzero) numbers. The equivalent squeeshed levseq of the furry levseq [1 1 -1 1 0 -2], then, is [2 -1 1 0 -2]. We may just as easily reverse this by expanding any positive (nonzero) number into ones.

In fact, that strikes at an important part of all of these different types of levseqs I just made up: they're all in bijection. That is, they're lossless transformations. We can take any squeeshed levseq and turn it back into an absolute levseq with no loss of information. Going from absolute, to squeeshed, and back gives us the exact levseq we started with.

All we've done is compress and optimize for our use case. That is, an on-disk representation of arbitrary trees. Squeeshed levseqs aren't any better or worse. They're optimized for disk representations and type induction, while absolute levseqs are optimized for enumerating and generating tres from scratch. Squeeshed levseqs would be shit at that.

mortar and pestle

So: we've reduced our problem of storing arbitrary tree structures into storing a sequence of numbers. Which, is perfect.

This is intended for a programming language I'm calling Kemmer. Any data type is really just a labelled tree, where each node is labelled with a number (in the case of sum types or enums: which of the possible types it evaluates to) and some kind of data (in the case of product types or structs: what external information does the type carry).

Having the tree's structure already stored, all we need to do is also, alongside it, provide these labels. We can just do that as an array postpended to a squeeshed levseq, together with a memoized size (for ease of combination).

I'm getting tired

Hopefully that was coherent enough. This was meant as an abstract outline of my thought process putting together this representation, as well as a way for me to develop this representation (and write it down :p).

It's probably (definately) going to be changed as I remember other design goals for Kemmer. I think I forgot to consider the existance of destructors in this optimization. And partial pattern-matching.

So it goes.