Roslyn and my favorite persistent data structure

This post is not a propaganda about perks and benefits of immutable persistent data structures. People familiar with the concept already know when it can tremendously help your code and when it cannot. And for those who aren’t familiar I cannot give a better introduction than those three links:

In this post I want to talk about my favorite persistent data structure. Or, if you wish, the one I consider most beautiful. For a long time the answer to this question was Zipper¹. However, two years ago this question popped up on StackOverflow:

Are Roslyn SyntaxNodes reused?

The team who built Roslyn, the new compiler infrastructure for C# and VB.NET languages, had to come up with an AST data structure that would satisfy a set of mutually incompatible requirements. The tree had to (a) have a cheap access to the parent field; (b) maintain an up-to-date character span in the document for each node; (c) persistently reuse most of the nodes on every update. As Eric Lippert describes, they solved this problem by maintaining two trees, one of which is immutable and persistent, and another has parent references and is re-built on-demand for each query, but only partially. The latter tree forms a facade around the former one, and together they solve all of the requirements at once.

The post is a great read, I encourage everyone to look at Eric’s description. Because Roslyn is now open-source, the entire implementation is now available on GitHub (hint: search for SyntaxNode/RedNode/GreenNode). For a long time I was also surprised that the team never published anything that formally describes their idea. However, today I found out that they at least filed a public patent. Its text contains many more interesting explanations and details on how this combination of two parse trees works as one unified persistent parse tree.

What is your favorite persistent data structure?

By the way, did you know that a zipper is a derivative of a surrounding container in a semifield of algebraic data types? This is a beautiful connection that also relates to generating functions, Taylor series, linear logic, and information theory. See, for example, [1], [2], [3], [4], [5]. ↩