- Tweak the parser to only reparse the lines that have changed. This alone might be enough to survive with the current level of performance.
- Reduce the extreme amount of garbage created currently by using object pools.
- Do the initial parse on the server-side, at least for larger files.
What is a Scannerless GLR parser anyway?
SGLR does not separate your parser into the usual two steps: (1) lexer/tokenizer/scanner followed by (2) LL- or LR-style parsing of the token stream. Instead, SGLR runs the parser on every character. This is why it’s scannerless.
The next point, GLR, is tricker to explain briefly. Remember that a parser takes a string of text and creates a (parse) tree for it. There might, at some stage during the parsing, be multiple possible trees that might fit the text seen so far. This is called an ambiguity. We don’t yet know what is meant by the text. Only by continue reading, will it (hopefully) become apparent. However, this is where the k comes in. LL(k)/LALR(k) lets you look only k tokens into the future. GLR allows you to look all tokens into the future (for SGLR, I should probably say all characters into the future.).
Typically, LL and LALR-parsers, such as the archaic Yacc and Bison parsers, force you to contort your grammar a lot to meet the k-token lookahead. If you have ever run into the dreaded “shift/reduce” conflict, you know what I’m talking about.
The combination of scannerless and GLR gives us something wonderful: grammars are closed under composition. So, if you write grammars for languages L1 and L2, you can just make a new module L3, include L1 and L2, et voila — you have L3 = L1 ∪ L2.
Trying such a stunt with LALR will almost invariably push you into the “shift/reduce”-conflict corner. Admittedly, there is a significant trade-off or gotcha with the (S)GLR approach. And it is very similar to the trade-offs/gotchas with early vs late binding, or static vs dynamic typing.
If, after processing the entire text, there still are multiple possible parse trees, GLR will give you all of them. So, you catch the ambiguity at runtime. But you can use GLR to say more: the set of grammars you can express with GLR is larger than that of LL(k) and LALR(k).
The trade-off is simple: with LL(k)/LALR(k), you are guaranteed to only have one, unambiguous, tree at the end — but the you might not be able to express the language you want to design. With SGLR, you are free to express the full class of context-free grammars, but the parsing result might be a forest of trees. It’s up to you to prune the forest afterwards, finding the tree you want.
I’m not claiming that SGLR is the best solution for all kinds for parsing needs. With great power comes great responsibility. You might not always need great power, and great responsibilty is usually just a liability, anyway;)
To interested parties, this is what I did:
- I ripped out the dependence on the CWI ATerm library and (yet again) wrote my own. Mostly hacked together based from scratch, but with the Textual ATerm Format (TAF) parser stolen from the BasicStrategoTerm implementation.
- Ripped out all references to
java.ioand replaced with my own stuff, based on strings.
- Wrote a tiny RemoteServiceServlet to allow the client side to access the parse tables from server side.
- Added a tiny hack to also fetch parse tables as TAFs and load these on the client side.
- Various other minor fixes all over the place.