Writing a Parser in Java: Introduction to the LL(1) Grammar

In the previous post in this series we wrote a tokenizer which splits the input into short segments called tokens. Each type of token is given a unique code and the input is reduced to a series of token codes. In this and the next instalment we will be creating a grammar to parse mathematical expressions. A grammar is a set of rules that specifies how the input is structured. Designing a grammar is a somewhat abstract process and is probably a little bit daunting for many. I will try to explain the steps as clearly as possible.

A grammar is made up of a collection of rules. Before we start explaining the rules, we have to talk about what the rules act upon. The tokenizer gave us a sequence of tokens and I said before that each type of token has a unique code. These token codes are called the terminal symbols of the grammar.

In addition to the terminal symbols we also introduce so-called non-terminal symbols. The non-terminal symbols represent somewhat more abstract concepts. For example a sum might be a non-terminal. A sum is made up of a sequence of terms separated by plus or minus signs. A sum does not correspond to an individual token but states how tokens must be arranged in the input. Usually non-terminals are written in lower case while terminal symbols are written in upper case.

Now we come back to the rules of the grammar. Each rule expresses a way to convert a non-terminal symbol into a sequence of terminal and non-terminal symbols. There might be more than one rule converting a specific non-terminal. A grammar also specifies a special non-terminal which is the starting point.

A simple example

Before we start designing our real grammar form mathematical expressions, let’s look at a simple example.

(1) sum -> NUMBER PLUS NUMBER
(2) sum -> NUMBER PLUS sum

In this grammar we start with the non-terminal sum. The two rules allow us to express any sum of numbers separated by plus signs. We start with the initial non-terminal

sum

By applying rule (2) we can replace sum

(2) -> NUMBER PLUS sum

Now we can apply rule (2) again to the sum on the right hand side, and we end up with

(2) -> NUMBER PLUS NUMBER PLUS sum

We finally apply rule (1) to the non-terminal sum which is still present in our right hand side. This leads us to

(1) -> NUMBER PLUS NUMBER PLUS NUMBER PLUS NUMBER

Now we have created a sequence that is purely made up of non-terminal symbols and expresses a sum of four numbers. In general, the grammar specified above lets us generate a sum of any length. We just have to keep applying rule (2) again and again, until we have enough NUMBERs to add.

Parsing the grammar

One problem remains however. When we are parsing an input from the user, we need to decide which rules to apply. This has to be somehow deduced from the input sequence. We successively apply rules and match them with the input sequence while, at the same time, reading token by token from the input. In the simplest type of grammars, the so-called LL(1) grammars, looking at the next token on the input should be enough to decide which rule to apply.

Let’s take the example above again and assume our input from the tokenizer is

NUMBER PLUS NUMBER PLUS NUMBER

We start with the non-terminal sum, and we can immediately see that we have to apply first rule (2) and then rule (1). But with the restrictions of the LL(1) grammar, the parser can only see the next symbol. So, when we start parsing, the parser will only see

NUMBER ...

The parser does not know how many numbers are in the sum and can’t decide whether to use rule (1) or rule (2). In order to help the parser we have to re-design our grammar. We use a new terminal symbol called EPSILON which is used when a non-terminal can be removed if it doesn’t match any input.

(1) sum -> NUMBER sum_op
(2) sum_op -> PLUS NUMBER sum_op
(3) sum_op -> EPSILON

Now, let’s follow how the parser reacts to the input. We start by placing our starting non-terminal on the stack.

Stack: sum
Input: NUMBER (PLUS NUMBER PLUS NUMBER)

The input in brackets is the input that is still in the stream but is not visible to the parser. The parser now decides to apply rule (1) based on the next token.

Rules applied: (1)
Stack: NUMBER sum_op
Input: NUMBER (PLUS NUMBER PLUS NUMBER)

Now the stack starts with the terminal NUMBER that matches in the terminal on the input stream. In this case we can discard the matching terminals.

Rules applied: (1)
Stack: sum_op
Input: PLUS (NUMBER PLUS NUMBER)

Now we see a PLUS on the input and the only way to convert sum_op into a string that starts with PLUS is by applying rule (2)

Rules applied: (1) (2)
Stack: PLUS NUMBER sum_op
Input: PLUS (NUMBER PLUS NUMBER)

We reduce again by discarding the matching terminal from the beginning of the stack.

Rules applied: (1) (2)
Stack: NUMBER sum_op
Input: NUMBER (PLUS NUMBER)
Rules applied: (1) (2)
Stack: sum_op
Input: PLUS (NUMBER)

Note how we can successively discard all matching terminals, the NUMBER and the PLUS. Again we have to apply rule (2) and reduce

Rules applied: (1) (2) (2)
Stack: PLUS NUMBER sum_op
Input: PLUS (NUMBER)
Rules applied: (1) (2) (2)
Stack: NUMBER sum_op
Input: NUMBER
Rules applied: (1) (2) (2)
Stack: sum_op
Input:

Now the input string is empty. This is where the EPSILON rule comes in. The EPSILON rule is used when there is no other rule that can be used. The EPSILON is a terminal that can be discarded without matching any input.

Rules applied: (1) (2) (2) (3)
Stack:
Input:

Now both the stack and the input string are empty. This means we are finished with parsing the input and we have a match.

This concludes our little introduction to LL(1) grammars. With the knowledge we have now, we can set about designing a grammar for mathematical expressions. We will do this in the next instalment of this series.