The grammar
section of a grammar description consists of a list of grammar
rules. Each grammar rule defines what sequence of input symbols would have to be
recognised before a particular rule construct has been recognised. For example,
a rule construct that describes a real number might be defined as an optional
plus or minus sign, followed by zero or more digits, followed by an optional
period and subsequent arbitrary number of digits. A grammar rule is a formal way
of writing down such a rule construct.
To understand how a rule is written down we need to define the two terms: terminal token and non-terminal token.
A terminal token is one of the tokens returned by the input
tokeniser. A terminal token has a token type and optionally an associated value.
The list of token types the input tokeniser can generate is given in the
tokens
or events
section of the input grammar description.
The input tokeniser's job is to recognise the real input stream of events or
symbols and to transform them into a stream of tokens that are defined as being
recognised in the grammar description. For example, an input tokeniser that
reads characters from an input file might transform individual characters in the
range '0' to '9' into a token of type DIGIT
where the associated
value is the numeric value 0 to 9 corresponding to the character received. It
will be the token type DIGIT
that will be recognised within a
grammar rule.
A non-terminal token is an identifier used as an abbreviation for a sequence of other tokens that has been recognised, be they terminal or non-terminal or a combination of the two. In practice, every non-terminal used within a grammar description appears as the left hand side of at least one of the grammar rules. That rule defines what sequence of tokens has to be recognised on the input stream before that non-terminal token has been observed.
As an example, we shall consider the real number rule described above.
Written using the grammar rule syntax, assuming SIGN
, PERIOD
and
DIGIT
are
terminal token types, the following group of rules describe valid real numbers:
realNumber: optSign digits optFractionalPart ;
optSign: SIGN ;
optSign: ;
digits: digits DIGIT ;
digits: DIGIT ;
optFractionalPart: PERIOD digits ;
optFractionalPart: ;
Notice the basic syntax for a grammar rule. A single non-terminal token
that is being defined appears to the left of a colon character. After the colon
appear zero or more terminal or non-terminal tokens. The whole grammar rule is
terminated by a semi-colon. The top rule of the grammar above says that a
realNumber
has only been recognised on the input stream when an
optSign
has been recognised, immediately followed by a digits
non-terminal, followed by an optFractionalPart
.
Notice too that the grammar nests rules. In order for a realNumber
to be recognised, one of its input tokens must have been a digits
token. In order for the digits
token to have been recognised
though, a sequence of one or more contiguous DIGIT
terminal tokens
must have been received, as defined by the two rules with digits
to
the left of their colon.
The grammar rules may be recursive. One rule in the above list states that in
order for a digits
token to have been seen on the input stream, a
digits
token followed by a single DIGIT
terminal token
must have been seen. This rule looks like it might never be recognised
completely in the input stream, as in order to complete itself, it must have
already seen a complete version of itself followed by a DIGIT
.
However, there is a second rule that also identifies a second circumstance in
which we can say we have recognised a digits
token, namely a single
DIGIT
non-terminal token.
Assume that we have a two-digit input sequence for example. The first digit
character causes the input tokeniser to return a DIGIT
token to the
grammar parsing engine. The engine knows that it has a rule saying receipt of a
single DIGIT
can be recognised as a digits
non-terminal, so it converts (reduces) its record of an input
DIGIT
into a received digits
token. The second digit
character to be read from the input tokeniser causes the parsing engine to
recognise that its previously recorded digits
token followed by a
DIGIT
terminal token is the right hand side of the other
digits
grammar rule, and so it replaces both recorded tokens with the
single recognised digits
token. Hence the two digits
rules define the fact that an arbitrary length sequence of one or more digit
characters reduces to a single recognised digits
token. As implied
in this section, replacing a sequence of terminals and non-terminals that have
been recognised as matching the right side of a rule with the single
non-terminal token that appears at the left side of the rule is known as
reducing by that rule.
Notice that some rules have no tokens between the colon and the terminating
semi-colon. This means that that rule has been recognised when no further input
tokens have been received at that point in the grammar. These empty rules are
applied only if the non-empty rules don't apply. Hence in the grammar above, an
optSign
has been recognised if a SIGN
(a '+' or a '-')
is seen, but if the next input token is neither a '+' or a '-', we just assume
we've recognised an optSign
anyway. This allows the optSign
in the top rule (the rule defining realNumber
) to be recognised
even if there is no sign character before the digit sequence.
A common abbreviation of the grammar allows us to group all the rules that share the same left hand token, and to only write the left token name once. We separate the several rules that all reduce to the same non-terminal using the 'or' operator, which is the same vertical bar character used for bitwise-or in the C family of languages. Hence the grammar rules above would more commonly be written as:
realNumber:
optSign digits optFractionalPart
;
optSign:
SIGN
| // This rule is empty
;
digits:
digits DIGIT
| DIGIT
;
optFractionalPart:
PERIOD digits
|
;
Note that the layout used here is not required, as white space and newlines are merely separators. It is however a common layout so that the separate rules and rule groups are easily identified. Note too that the comment begun with two forward slash characters is not required for empty rules, but is included above to highlight an example of an empty rule.