Parsing guard expressions in LR(1) grammars

In this document, we explain how the parser transforms a grammar containing non-terminals that have guard conditions applied into a grammar where guards only appear on terminal tokens in rules. This enables parsers with a simple extension to support guarded terminal tokens to handle the more complex grammars that allow guards on non-terminal tokens.

Guards on terminal symbols

When appying a guard expression to a terminal token in a grammar rule, this is interpreted as meaning that the specified terminal token must be seen in the input stream, and the guard must evaluate to true once the token has been seen, before the rule can shift to the next token, or reduce to the parent rule if the guard is applied to the last token in the rule. Applying guards to terminal tokens in an LR grammar is a simple but useful extension to the parsing engine.

Rewriting grammars with guards on non-terminal symbols

A guard condition on a non-terminal symbol in a rule production implies that the guard condition should be applied to the last terminal token within each expansion of that non-terminal. Hence a new set of productions should be added for the non-terminal with that condition applied. The rules for generating a grammar containing these expansions are described below.

We are going to show the transformations that are applied using the following sample input grammar. In this grammar, lower-case identifiers represent non-terminal tokens, while upper-case identifiers are terminal tokens as received from the input tokeniser. Note that the top-level non-terminal token that will have been recognised if the input token sequence obeys that grammar is topRule. Note also that there are two places where a guard condition is applied to a non-terminal token, both of them being the application of the guard condition [C1] to the non-terminal token z.

The numbered alpha symbols in curly braces represent the places in the grammar where an action is executed in response to a production being recognised, immediately prior to its reduction. Our aim is to show that the actions still get executed at the correct time and in the correct order, despite the transformations to the grammar:


topRule: // The original top-level rule
   Z w {α0}
|  w {α1};

w: x y z[C1] V {α2}
|  x y z[C1] x {α3};

y: Y Y {α4}
|  ε {α5};

z: Z y {α6}
|  y W {α7}
|  ε {α8};

x: X Y {α9}
|  ε {α10};

The first transformation ensures you start with an expanded top-level grammar rule, as required by LR parsers. This guarantees that the entry rule never appears on the right hand side of an expansion. Note that the expanded grammar rule, called _start in the code sample below, now includes a synthetic first terminal token (SOF) before the top rule. This is because the algorithm used to recognise guards on non-terminals recursively applies the guard condition to tokens on the non-terminal’s left, if the non terminal includes the empty expansion among its productions. Thus in the extreme case of an empty grammar being a valid completion of the grammar, we have a token to which we can attach a guard condition that has moved left through the grammar right up to the parent rule. The grammar is rewritten below:


_start: SOF topRule; // Auto-expanded rule

topRule: // The original top-level rule
   Z w{α0}
|  w {α1};

w: x y z[C1] V {α2}
|  x y z[C1] x {α3};

y: Y Y {α4}
|  ε {α5};

z: Z y {α6}
|  y W {α7}
|  ε {α8};

x: X Y {α9}
|  ε {α10};

Next, for each non-terminal with a guard condition, add a new set of productions for that non-terminal guard pair as the LHS of the production. The RHS will have its last token modified to take the same guard condition.


_start: SOF topRule;

topRule:
   Z w{α0}
|  w {α1};

w: x y z[C1] V {α2}
|  x y z[C1] x {α3};

y: Y Y {α4}
|  {α5};

z: Z y {α6}
|  y W {α7}
|  {α8};

x: X Y {α9}
|  {α10};

zC1:
   Z y[C1] {α6}
|  y W[C1] {α7}
|  [C1] {α8};

Apply this rule recursively for any further non-terminals that the guard condition gets moved to in the expansion:


yC1:
   Y Y[C1] {α4}
|  [C1] {α5};

Now we notice that for some of the productions in the examples given above, some of the expansions result in a guard condition at the beginning of the rule, rather than on a terminal or non-terminal symbol in the rule. These are what we shall term carry-back guard conditions. For convenience in later expansions, we shall divide these rules into two sets, the set consisting of an embedded guard condition but with no guard at the beginning of the rule (denoted using subscript φ), and a set beginning with the guard condition (denoted using subscript ε) as follows:


zφC1: Z y[C1] {α6}
|    y W[C1] {α7};

zεC1: [C1] {α8};

yφC1: Y Y[C1] {α4};

yεC1: [C1] {α5};

The carry-back guards now have to be applied to the rules in which the guarded non-terminals appear. Any place in a rule where we see p: q r[C] s, and r[C] has a carry-back set of rules, we shall replace it with two rules. One rule merely replaces r[C] with rφC, while the other rule in which we carry back the condition to the previous symbol replaces r[C] with its empty set rule: p: q[C] rεC s. In our grammar, this means we shall replace rules as follows:


w: x y zφC1 V {α2}
|  x y zφC1 x {α3}
|  x y[C1] zεC1 V {α2}
|  x y[C1] zεC1 x {α3};

We also have a similar modification to apply in one of our previously added rules:


zφC1: Z yφC1 {α6}
|    Z[C1] yεC1 {α6};

The newly written rules for w have uncovered some more non-terminals followed by guard conditions (y[C1]), so we apply the same rule replacement steps in w again, but for y[C1]:


w: x yφC1 zεC1 V {α2}
|  x[C1] yεC1 zεC1 V {α2}
|  x yφC1 zεC1 x {α3}
|  x[C1] yεC1 zεC1 x {α3};

We now notice that the non-terminal and guard x[C1] has not been expanded yet, so we need to apply that expansion:


xφC1: X Y[C1] {α9};

xεC1: [C1] {α10};

Causing the rules for w to be remodelled as:


w: x y zφC1 V {α2}
|  x y zφC1 x {α3}
|  x yφC1 zεC1 V {α2}
|  xφC1 yεC1 zεC1 V {α2}
|  [C1] xεC1 yεC1 zεC1 V {α2}
|  x yφC1 zεC1 x {α3}
|  xφC1 yεC1 zεC1 x {α3}
|  [C1] xεC1 yεC1 zεC1;

As we now have caused carry-backs for w, so must we split the rules into two groups:


wφ: x y zφC1 V {α2}
|  x y zφC1 x {α3}
|  x yφC1 zεC1 V {α2}
|  xφC1 yεC1 zεC1 V {α2}
|  x yφC1 zεC1 x {α3}
|  xφC1 yεC1 zεC1 x {α3};

wεC1:
   [C1] xεC1 yεC1 zεC1 V {α2}
|  [C1] xεC1 yεC1 zεC1;

As w has now been split into two sets of rules, one with a guard condition to be carried back, we next recursively apply carry-back rule rewrites on all rules in the grammar containing w among the tokens on the right hand side of any production:


topRule: Z wφ {α0}
| wφ {α1}
| Z[C1] wεC1 {α0}
| [C1] wεC1 {α1};

Once again, we now see that topRule has its own carry-back rule, so we split it into two groups:


topRuleφ: Z wφ{α0}
| wφ {α1};
| Z[C1] wεC1 {α0};

topRuleεC1: [C1] wεC1 {α1};

Since this uncovers a carry back for topRule that must be applied to rules containing topRule in the RHS expansion, we rewrite any rules in which topRule appears on the RHS:


_start: SOF topRuleφ
| SOF[C1] topRuleεC1; 

Notice that by rewriting the extended starting rule to have a synthetic terminal first token that precedes any real tokens in the input stream, we guarantee that the starting token _start will never have a carry-back token of its own. Hence the start of the grammar applies any carry-back for the entire grammar onto the synthetic token. Notice too that at this point, no part of the grammar contains a guard condition applied to a non-terminal. Guard conditions are always applied to preceding terminal tokens. The fully reconstructed grammar is:


_start:
   SOF topRuleφ
|  SOF[C1] topRuleεC1;

topRuleφ:
   Z wφ {α0}
|  wφ {α1}
|  Z[C1] wεC1 {α0};

topRuleεC1:
   wεC1 {α1}; // Carries back [C1]

wφ: x y zφC1 V {α2}
|  x y zφC1 x {α3}
|  x yφC1 zεC1 V {α2}
|  xφC1 yεC1 zεC1 V {α2}
|  x yφC1 zεC1 x {α3}
|  xφC1 yεC1 zεC1 x {α3};

wεC1: xεC1 yεC1 zεC1 V {α2} // Carries back [C1]
|    xεC1 yεC1 zεC1 x {α3}; // Carries back [C1]

x: X Y {α9}
|  {α10};

xφC1: X Y[C1] {α9};

xεC1: {α10}; // Carries back [C1]

y: Y Y {α4}
|  {α5};

yφC1: Y Y[C1] {α4};

yεC1: {α5}; // Carries back [C1]

z: Z y {α6}
|  y W {α7}
|  {α8};

zφC1: Z yφC1 {α6}
|    Z[C1] yεC1 {α6}
|    y W[C1] {α7};

zεC1: {α8}; // Carries back [C1]

This can now be used to construct the first sets and item sets for the LR(1) parser in the usual way. A few points to note on the implementation of these rules are worth mentioning:

Both the in-line and the offline parser generators used in the ParseLR suite apply the transformations described above, if they encounter non-terminal tokens that have guard conditions attached to them. They also perform canonicalisation of boolean guard conditions to test them for equivalence against each other as part of the optimisation process. Hence, for example, the expressions [!X & !Y] and [!(X | Y)] would be treated as the same guard condition when encountered.