Strongly typed token values

Background

By default, parsers generated using ParseLR absorb input tokens or events as a stream of objects each exposing the IToken interface. For each successive input token, there is a token or event type exposed through the Type property, and held as an integer. To translate between integer values and meaningful names for the token type, a parser generated by parseLR exposes a TwoWayMap<string, int> called Tokens. The parser factory for the parser class also exposes the same property as a static property. This makes it easy to look up a token value from its name where a reference to the parser object is not available.

Much of the power of a parser comes from processing additional input data that accompanies each input token. The type of that data may be different for different values of a token's type. Hence the IToken interface exposes a Value property that is of type object, so that it can carry any data type. it is left to the programmer to choose the right cast for a given token type when writing the action code to be executed in the grammar. This usually results in casts appearing before occurrences of the $N parameters.

This style of coding is not type-safe as it relies on the developer knowing what to cast action parameters to when writing action code. It also makes the code cluttered, as operator precedence requires extra parentheses when accessing members of an object that has just been cast to its real type.

It is possible to add strong typing to the input token values in a grammar so that input token values no longer require their $N parameters to be cast, but already have the correct type. it is also possible to strongly type the non-terminals that appear to the left of grammar rules. This means that assignments to the $$ value parameter in action code, representing the new value assigned to the non-terminal token, are also type checked. Similarly when a non-terminal appears on the right side of a grammar rule, its value as a $N parameter in the action code is strongly typed.

Strongly typed terminal tokens

In the tokens or events section of the grammar, we give names and optional integer type values for each of the terminal tokens that maight be returned from an input tokeniser. It is also possible to provide a data type for the Value property of the corresponding ITokens that carry that particular token type. This is acheived by using the same syntax as C# uses for establishing a generic type. The data type for the object stored in the Value property of the IToken appears after the name of the terminal token, in less-than and greater-than braces. An example appears below:

tokens
{
	INTEGER<int> = 1,        // User-specified token type value of 1, & type for the token data
	PLUS,                    // Will be allocated a type value beyond 16384
	MINUS,                   // Absence of data type makes token data type default to 'object'
	TIMES,
	DIVIDE,
	IDENTIFIER<string>,      // User-specified data type, but token type beyond 16384
	LPAREN,
	RPAREN
}

Strongly typed non-terminals

To stipulate the data type for the Value property of a non-terminal IToken, we place a data type in less-than and greater-than symbols to the right of the first rule definition for that non-terminal, just before the colon in the rule definition. This ensures that the $N parameter in any action code will automatically have the correct type when the non-terminal appears in a rule that is being reduced. It also ensures that the type of any expression that assigns to $$ in action code is checked against the type of the non-terminal at the left of that rule. An example follows:

    adjectives <List<string>> :
        adjectives ADJECTIVE
        {
            // Assume that the terminal token ADJECTIVE was also strongly
            // typed as a string in the tokens section of the grammar
            
            $0.Add($1); // Append a string onto a List<string>
            $$ = $0;    // Ensure the list of strings is passed to the
                        // left hand side non-terminal on rule reduction
        }
    |
        {
            $$ = new List<string>();
        }
    ;

Strong typing and multiplicity

The multiplicity symbols placed after a token cause the strong typing to be changed to support the selected multiplicity. If a terminal or non-terminal token has type X, and appears in a multiplicity using a * or a +, then the type for that multiple element when used as a $N parameter will be IList<X>. Similarly, if the multiplicity symbol is the optional symbol ?, the resulting type of the $N parameter will be IOptional<X>, this being the interface containing a boolean HasValue property and a Value property of type X. Consider the following sample code and its comments indicating the types of the various rule elements:

...
tokens
{
    ...
    ATERMINAL<TSomeType>,
    ...
}
grammar(rootNonterminal)
{
    ...
    nonTerminal1<TNonterm1> : ... rules for nonTerminal1 ... ;
    ...
    nonTerminal2 :
        nonTerminal1* ATERMINAL+ nonTerminal1?
        {
            if($0.Count > 0)  // $0 has type IList<TNonterm1>
                foreach(TSomeType tst in $1)  // $1 has type IList<TSomeType>
                    tst.SomeMemberFunction();
            if($2.HasValue)  // $2 has type IOptional<TNonterm1>
                $2.Value.SomeOtherMemberFunction();
        }
    ;
}

Writing the input tokeniser

Interestingly there are no changes to the input tokeniser to support the strong typing. It still implements the standard IToken interface on each of the tokens it returns to the parser engine. Hence the Value property is still filled in with data that is treated as if it were of type object. It is the autogenerated parser code created from the grammar that instills the strong typing on the $N and $$ parameters to the action code blocks. For more details on how to write input tokenisers, see Writing an input tokeniser.