Hackathon in Delft: Gone!

I’ve been at it again. In the previous autumn/winter season, I spent a month in Delft, hacking on the Stratego/Spoofax program transformation infrastructure.

Now that another winter is here, I’ve had the good fortune to repeat what is slowly becoming a tradition. This year I even found time to reserve two months for a more extended hackathon.

Many really good things came of out of the stay. A few of the immediate results are:

I am also involved in several on-going projects for bringing the necessary parts of the Spoofax infrastructure to the web. Interesting results should start popping up over the new few months.

As is the case for most startups, finding the time to contribute back to the upstream projects we use is always difficult. Whenever this happens, though, the payoff is often substantial.

An Eclipse Console for Spoofax

More good news, everyone! I have found time to integrate the command line shell for Spoofax into Eclipse. You can now have all sorts of tricky conversations with the Spoofax interpreter inside a perfectly innocent-looking Eclipse console:

A conversiation with Spoofax

As you can see, there are still rough edges to be ironed out. One of them is clearly the color palette. Another is the lack of inline rules, which are not supported yet. Neither are dynamic rules (and it is doubtful they ever will be — we are currently exploring other opportunities for dealing with context-sensitive rewrite rules).

You can bring up the actual console by using the console page selector in the top right-hand corner of your console, like you do for the other types of consoles:

Console page selector

Remember that you can always report the issues you come across in YellowGrass and tag me (@karltk).

Hackathon in Delft: Go!

It’s that time of the year again. The glorious month of intensive parser implementation, compiler engineering and language workbenches — the essentials of any IDE — has arrived. I’ve retreated to the TU Delft campus for the month of November to hack on interactive language infrastructure for our startup, and to think big thoughts about IDEs and DSLs in general.

Our startup is a user of both Stratego and Spoofax, so it was only natural to join forces with Eelco Visser and some of his henchmen here in Delft, who are the maintainers of said projects.

For somewhat practical and somewhat nostalgic reasons, I decided to stay at the TU Delft campus. Campus life here already brings back memories of my year living at Cambridgelaan: the access to lightning fast broadband in your dorm room, the four minute walk to the lab, the 24-hour party people (aka Erasmus students), and the freedom to follow your biological clock.

Time for some commits:)

Prototype of a Scannerless, Generalized Left-to-right Rightmost (SGLR) derivation parser for JavaScript

Need a Scannerless, Generalized Left-to-right Rightmost (SGLR) derivation parser for JavaScript? Then my JSGLR-GWT proof-of-concept might interest you. (If you don’t know what a scannerless GLR parser is, then you are probably not interested in writing modular grammars for your programming language implementation anyway — or you might want to read a few sections down).

A word of warning: I’m not claiming it is written in JavaScript, merely for JavaScript. I simply took our old Java-based SGLR implementation and massaged it quite heavily until GWT could spit out a working version for the JavaScript interpreters.

Performance

The performance is, quite frankly, beyond my wildest expectations. I’ve not made a single attempt at performance tweaking for the JavaScript engines yet, and still I manage to parse through 830 lines of Stratego code in ~3 seconds (Chromium 7.0.519.0). For smaller fragments, even Firefox 3.5 manages to keep up: it manages ~20 lines in ~1s (Chromium does that test in 74ms).

To most people, this probably doesn’t sound very fast at all, but I was much worse off when I initially wrote JSGLR back in 2006 — and that used a term(tree) library that wasn’t hacked together over night. All in all, I think there are plenty of performance tweaking opportunities, and the JavaScript engines are getting faster almost by the minute.

Now, another reason for my optimism, is my use case. I’m interested only interested in JavaScript port of SGLR for interactive use. Think web IDE. My aim is to attain interactive speeds when editing text. A few optimization ideas off the top of my head:

  • Tweak the parser to only reparse the lines that have changed. This alone might be enough to survive with the current level of performance.
  • Reduce the extreme amount of garbage created currently by using object pools.
  • Do the initial parse on the server-side, at least for larger files.

What is a Scannerless GLR parser anyway?

(optional reading)

SGLR does not separate your parser into the usual two steps: (1) lexer/tokenizer/scanner followed by (2) LL- or LR-style parsing of the token stream. Instead, SGLR runs the parser on every character. This is why it’s scannerless.

The next point, GLR, is tricker to explain briefly. Remember that a parser takes a string of text and creates a (parse) tree for it. There might, at some stage during the parsing, be multiple possible trees that might fit the text seen so far. This is called an ambiguity. We don’t yet know what is meant by the text. Only by continue reading, will it (hopefully) become apparent. However, this is where the k comes in. LL(k)/LALR(k) lets you look only k tokens into the future. GLR allows you to look all tokens into the future (for SGLR, I should probably say all characters into the future.).

Typically, LL and LALR-parsers, such as the archaic Yacc and Bison parsers, force you to contort your grammar a lot to meet the k-token lookahead. If you have ever run into the dreaded “shift/reduce” conflict, you know what I’m talking about.

The combination of scannerless and GLR gives us something wonderful: grammars are closed under composition. So, if you write grammars for languages L1 and L2, you can just make a new module L3, include L1 and L2, et voila — you have L3 = L1 ∪ L2.

Trying such a stunt with LALR will almost invariably push you into the “shift/reduce”-conflict corner. Admittedly, there is a significant trade-off or gotcha with the (S)GLR approach. And it is very similar to the trade-offs/gotchas with early vs late binding, or static vs dynamic typing.

If, after processing the entire text, there still are multiple possible parse trees, GLR will give you all of them. So, you catch the ambiguity at runtime. But you can use GLR to say more: the set of grammars you can express with GLR is larger than that of LL(k) and LALR(k).

The trade-off is simple: with LL(k)/LALR(k), you are guaranteed to only have one, unambiguous, tree at the end — but the you might not be able to express the language you want to design. With SGLR, you are free to express the full class of context-free grammars, but the parsing result might be a forest of trees. It’s up to you to prune the forest afterwards, finding the tree you want.

I’m not claiming that SGLR is the best solution for all kinds for parsing needs. With great power comes great responsibility. You might not always need great power, and great responsibilty is usually just a liability, anyway;)

Implementation Details

To interested parties, this is what I did:

  • I ripped out the dependence on the CWI ATerm library and (yet again) wrote my own. Mostly hacked together based from scratch, but with the Textual ATerm Format (TAF) parser stolen from the BasicStrategoTerm implementation.
  • Ripped out all references to java.io and replaced with my own stuff, based on strings.
  • Wrote a tiny RemoteServiceServlet to allow the client side to access the parse tables from server side.
  • Added a tiny hack to also fetch parse tables as TAFs and load these on the client side.
  • Various other minor fixes all over the place.