GLR parsing

In a conventional LR(N) parser, the state transition taken on each shift or reduction is determined by the next N input token values. If the grammar has two possible transitions that could occur for the same set of N lookahead tokens, a state machine cannot be generated without increasing the value of N until there are unique sets of lookahead tokens for each possible transition in the grammar. In practice, a value of N=1 is used for most parser generators, and the input grammar defined to be deliberately and LR(1) grammar. The complexity of the parser rises non-linearly with the value of N, meaning that parsers with increasing values of N rapidly become impractical to implement.

There are circumstances where a grammar is needed for which a large value of N would be required, or for which the ambiguity between the possible valid parses of an input token sequence cannot be resolved until late in the input token sequence. In these circumstances, a generalised LR parser may be a more practical alternative. An example of such a grammar might be a parser that tried to extract the meaning from sentences in English by parsing the sequence of words. Imagine an input grammar recogniser trying to parse the sentence: "The registrar married the bank clerk and the midwife attended." Clearly there is a significant change of meaning when the parse proceeds beyond 'bank', beyond 'clerk' and beyond 'midwife'!

ParseLR builds a canonical LR(1) parser as its default parser type. This means that it uses just the next input token to determine what state change to make. If for a given state, there is more than one possible transition (shift or reduction) that is valid for a particular next input token value, the grammar is ambiguous and either a 'shift-reduce' conflict or a 'reduce-reduce' conflict is flagged. No parser is generated by default.

ParseLR does introduce a little flexibility over other LR(1) parser generators here. Each token can be accompanied by a boolean guard condition that is tested along with the next input token type, allowing the same next token to be used for multiple transitions form a particular state if the guard conditions can differentiate between them. This is still sometimes inadequate for more complex grammars requiring a longer lookahead sequence to determine which transition to take.

An LR(1) parser builds a single stack to represent the set of tokens or part-recognised grammar rules it has seen so far in the parser input stream. The next input token (and possibly a guard) are used to perform either a shift (pushes the input token onto the stack, and jumps to a different state) or a reduction (pops a number of tokens from the stack, replacing them with a single token representing the LHS of a grammar rule, and jumps to a different state).

A GLR parser works exactly the same, except that on encountering an input token that is ambiguous, it clones the stack and subsequently runs two parses side by side on the input token stream. At some point later in the input stream, one of the parses may encounter an invalid input token in the input stream. This stack is then dropped from the set of candidate stacks as it constitutes an invalid parse. At the end of the input token sequence, ideally only one stack has survived, representing the correct parse of the input sequence. Potentially multiple stacks may have survived, meaning that the input sequence was genuinely ambiguous. In this case the parser has captured all valid interpretations of that sequence.

Documentation

In order to be able to write the grammar for and use a GLR parser created by ParseLR, you should have a good understanding of how to write the grammar and code for an LR(1) parser, as described in the LR(1) documentation for ParseLR. You then will need to read the following pages:

There is also an implementation of the calculator example using a GLR parser to resolve operator precedence. A description of the code will be found here. As this is built into the existing calculator demo as an alternative grammar implementation, you can visit the same code to compare the LR(1) implementation against the GLR implementation.