*It makes me wonder, what is next? What new astonishing thing will happen in ver...

viraptor · on Dec 18, 2011

This should be much simpler once we stop using those silly text files and start storing everything as a proper representation of AST. Then again, at that point we can get rid of the silly text-diff-based systems and just store everything as versioned trees.

tikhonj · on Dec 18, 2011

I actually wrote a very simple system like this for a hackathon several months ago. The idea is that we would take some basic Scheme code (boy did we aim high :), parse it and commit the result. We would then diff the trees and keep track of the changes that way. Finally we had a cute web front-end that pretty printed the code from the AST and could show the diffs visually.

We got the basics working, including simple diffs. One goal was to link the same variables between two versions; we did not manage to make that work, but had a very hacky approach that looked like it worked.

Doing any sort of merging with this data is nontrivial. We were planning to implement it, but unfortunately ran out of time. Still, we did have a cute demo of some commits and some diffs in the end--it actually worked a little, which is much more than I expected starting out.

However, despite not implementing merging, we did throw in some nice features. Particularly, we were able to identify commits that did not change the function of the code (whitespace and comment changes only) and mark them. This was very easy but yet still useful, and a good indicator of the sorts of things one could do with a system like that.

After the hackathon, one of my friends found some papers about a system just like ours. I don't remember where they were from, but if you're interested you could look for them. (I think the phrase "semantic version control" is good for Googling; that's what we called our project.)

Overall I think that it's a neat domain but in hindsight maybe it was a little too much for 18 hours of coding :) We did have fun, and it was cool, so I have no regrets.

gbog · on Dec 18, 2011

This sounds like a bad idea. Code is text. If ast helps merging diffs, why not use it for analysis in case of conflict and keep code as text?

tikhonj · on Dec 18, 2011

I think it's most accurate to say that code is a textual representation of an AST. Saying that it's just text is just like saying it's just a bunch of numbers--both technically true but missing the bigger picture.

One potential reason no to store code as text is that there are many equivalent programs that differ only in inconsequential text. A perfect example is trailing whitespace.

There are also some benefits of storing code as an AST. For one, it would make it trivial to identify commits that did not change the actual code--things like updated comments. This would help you filter out commits when looking for bugs. Another benefit would be better organized historical data: in a perfect system, you would be able to look at the progress of a function even if it got renamed part of the way through.

adambyrtek · on Dec 18, 2011

But then you end up with a version control system that is not generic, but dependent on a particular language. The story of Smalltalk suggests that the added value might not be worth the coupling and complexity it requires.

tikhonj · on Dec 18, 2011

You should be able to write a generic version control system like this where you can just plug the appropriate parser in and it would work for that language. For backup, you could have it still keep some files as text.

viraptor · on Dec 18, 2011

Because you can always convert from ast to text, but not always from text to ast. Also it's easier and faster to convert ast->text, and you can do it when needed only. Additionally you'd never commit a syntax error. Why is it a bad idea?

kfool · on Dec 18, 2011

Now it's time to bring the idea beyond code

Exactly! Try versioning data with ChronicDB:

http://chronicdb.com