Grammar rules

The grammar section of a grammar description consists of a list of grammar rules. Each grammar rule defines what sequence of input symbols would have to be recognised before a particular rule construct has been recognised. For example, a rule construct that describes a real number might be defined as an optional plus or minus sign, followed by zero or more digits, followed by an optional period and subsequent arbitrary number of digits. A grammar rule is a formal way of writing down such a rule construct.

To understand how a rule is written down we need to define the two terms: terminal token and non-terminal token.

A terminal token is one of the tokens returned by the input tokeniser. A terminal token has a token type and optionally an associated value. The list of token types the input tokeniser can generate is given in the tokens or events section of the input grammar description. The input tokeniser's job is to recognise the real input stream of events or symbols and to transform them into a stream of tokens that are defined as being recognised in the grammar description. For example, an input tokeniser that reads characters from an input file might transform individual characters in the range '0' to '9' into a token of type DIGIT where the associated value is the numeric value 0 to 9 corresponding to the character received. It will be the token type DIGIT that will be recognised within a grammar rule.

A non-terminal token is an identifier used as an abbreviation for a sequence of other tokens that has been recognised, be they terminal or non-terminal or a combination of the two. In practice, every non-terminal used within a grammar description appears as the left hand side of at least one of the grammar rules. That rule defines what sequence of tokens has to be recognised on the input stream before that non-terminal token has been observed.

As an example, we shall consider the real number rule described above. Written using the grammar rule syntax, assuming SIGN, PERIOD and DIGIT are terminal token types, the following group of rules describe valid real numbers:


realNumber: optSign digits optFractionalPart ;

optSign: SIGN ;
optSign: ;

digits: digits DIGIT ;
digits: DIGIT ;

optFractionalPart: PERIOD digits ;
optFractionalPart: ;       

Notice the basic syntax for a grammar rule. A single non-terminal token that is being defined appears to the left of a colon character. After the colon appear zero or more terminal or non-terminal tokens. The whole grammar rule is terminated by a semi-colon. The top rule of the grammar above says that a realNumber has only been recognised on the input stream when an optSign has been recognised, immediately followed by a digits non-terminal, followed by an optFractionalPart.

Notice too that the grammar nests rules. In order for a realNumber to be recognised, one of its input tokens must have been a digits token. In order for the digits token to have been recognised though, a sequence of one or more contiguous DIGIT terminal tokens must have been received, as defined by the two rules with digits to the left of their colon.

The grammar rules may be recursive. One rule in the above list states that in order for a digits token to have been seen on the input stream, a digits token followed by a single DIGIT terminal token must have been seen.  This rule looks like it might never be recognised completely in the input stream, as in order to complete itself, it must have already seen a complete version of itself followed by a DIGIT. However, there is a second rule that also identifies a second circumstance in which we can say we have recognised a digits token, namely a single DIGIT non-terminal token.

Assume that we have a two-digit input sequence for example. The first digit character causes the input tokeniser to return a DIGIT token to the grammar parsing engine. The engine knows that it has a rule saying receipt of a single DIGIT can be recognised as a digits non-terminal, so it converts (reduces) its record of an input DIGIT into a received digits token. The second digit character to be read from the input tokeniser causes the parsing engine to recognise that its previously recorded digits token followed by a DIGIT terminal token is the right hand side of the other digits grammar rule, and so it replaces both recorded tokens with the single recognised digits token. Hence the two digits rules define the fact that an arbitrary length sequence of one or more digit characters reduces to a single recognised digits token. As implied in this section, replacing a sequence of terminals and non-terminals that have been recognised as matching the right side of a rule with the single non-terminal token that appears at the left side of the rule is known as reducing by that rule.

Notice that some rules have no tokens between the colon and the terminating semi-colon. This means that that rule has been recognised when no further input tokens have been received at that point in the grammar. These empty rules are applied only if the non-empty rules don't apply. Hence in the grammar above, an optSign has been recognised if a SIGN (a '+' or a '-') is seen, but if the next input token is neither a '+' or a '-', we just assume we've recognised an optSign anyway. This allows the optSign in the top rule (the rule defining realNumber) to be recognised even if there is no sign character before the digit sequence.

A common abbreviation of the grammar allows us to group all the rules that share the same left hand token, and to only write the left token name once. We separate the several rules that all reduce to the same non-terminal using the 'or' operator, which is the same vertical bar character used for bitwise-or in the C family of languages. Hence the grammar rules above would more commonly be written as:


realNumber:
    optSign digits optFractionalPart 
;

optSign: 
    SIGN 
|   // This rule is empty
;

digits: 
    digits DIGIT
|   DIGIT
;

optFractionalPart: 
    PERIOD digits
|
;     

Note that the layout used here is not required, as white space and newlines are merely separators. It is however a common layout so that the separate rules and rule groups are easily identified. Note too that the comment begun with two forward slash characters is not required for empty rules, but is included above to highlight an example of an empty rule.