Hashing Code and UCM

Ever since I saw this presentation by Rúnar Bjarnason, I’ve been fascinated by the concept of the Unison Codebase Manager (UCM). The idea of a codebase that is hashed and versioned based on its contents, is 🔥.

Unison is a language that tries to solve many problems, and I’m not sure I’m sold on their way of doing things. However, I have been thinking about how to implement a UCM-like system that is generic over programming languages. What might this look like?

The Why

There are a few key concepts that I think are important to understand about UCM, and why it is so interesting.

First of all, you don’t have to worry about organizing your files. In fact, you can’t, because that’s not how you interact with the codebase. When you use UCM, you have to write your definitions into a scratch file, which then is synthesized by UCM and appended to the codebase. This has several cool implications:

If you define the same function twice, it’s hashed to the same value, and thus only stored once. There’s a bit here about aliasing, but it’s not the important bit.
You don’t have to care one bit about formatting, because that’s not how your code is stored. The textual representation is just how you interface with the UCM.
You don’t have to care about file organization, because, again, not how your code is stored!
Your code is type-checked at the time of synthesis, which means that you incrementally build up a type-checked codebase, and only need to perform type-checking on the code you’re currently working on. If we want to be general over languages, we’d have to settle for weaker guarantees in languages that don’t have strong type systems, but that’s fine.

Now, there are some downsides to this approach, for example, you have to be able to browse your code somehow, and when you do view it, it would be nice to view it in the textual representation you’re used to, and not as a hash tree. Unison has solutions for this, and I think they’re pretty cool. They’ve even talked about the possibility of UCM handling several different syntaxes, adn being able to switch between them. The same could in theory also apply to formatting or code organization, i.e. you could use tabs, your colleague could use spaces, and the codebase manager would handle it.

The How

Now for the hard part. I’m not sure yet! I’ve thought about two approaches that might be viable, and already have a lot of previous work to make them possible.

Using tree-sitter, which parses languages into an CST, and then hashing the CST.
Building on top of the language server protocol. I haven’t looked into this much, but I think it might be possible to use the LSP to get the AST of a file, and then hash that.

I think the second approach is more interesting, because it might make it pretty straightforward to provide the typical IDE features, like autocomplete, jump to definition, etc.

But that’s just the hashing part. The other part is managing the codebase, browsing definitions, and providing a convenient interface for developers, who are using to dealing with code as text files.

The When

I’m not sure. Hopefully this summer. We’ll see.