We first add to our class **Token** that holds the information about the tokens in the input.

public class Token { public static final int EPSILON = 0; public static final int PLUSMINUS = 1; public static final int MULTDIV = 2; public static final int RAISED = 3; public static final int FUNCTION = 4; public static final int OPEN_BRACKET = 5; public static final int CLOSE_BRACKET = 6; public static final int NUMBER = 7; public static final int VARIABLE = 8; public final int token; public final String sequence; public Token(int token, String sequence) { super(); this.token = token; this.sequence = sequence; } }

In short, this class defines a number of static constants for the different types of tokens and a couple of fields that hold the data for the individual token. Note how we made the fields public final. This makes it immediately clear that we are dealing with an immutable object and there is no need for a multitude of getters in the class.

Next we write class called **Parser** which does the actual parsing of the expression. The tokens are stored in a **List** of **Token** objects and one **Token** object is stored as the lookahead.

The parser we are about to write will not do much except throwing an excpetion if the expression is invalid. If the parser runs without an exception being thrown we know that the expression is valid. In one of the future posts we will add to this parser and construct an internal representation of the expression that can be used for calculations.

public class Parser { LinkedList<Token> tokens; Token lookahead;

The main method of the parser is called **parse** and takes the tokens as parameter.

public void parse(List<Token> tokens) { this.tokens = (LinkedList<Token>) tokens.clone(); lookahead = this.tokens.getFirst(); expression(); if (lookahead.token != Token.EPSILON) throw new ParserException("Unexpected symbol %s found", lookahead); }

In the **parse** method we first create a shallow copy of the token list because we will be taking elements out of the list and we don’t want to create side effects on the parameters. Then **lookahead** is assigned the first token in the list. After these initialisations we call a method called **expression()**. The parser will have one method for every non-terminal symbol of the grammar that we designed in the last post. This means the method **expression()** will parse the non-terminal symbol **expression**.

Once the expression has been parsed completely there should be no symbols left in the list. This means that the **lookahead** should be equal to **Token.EPSILON**. If there is still a symbol left in the lookahead it means that there is an error in the input. After parsing the expression we can therefore perform an error check. This takes care of balancing parentheses and making sure the input is a valid expression.

Before we write the **expression()** method we create a small utility method that reads the next token from the list.

private void nextToken() { tokens.pop(); // at the end of input we return an epsilon token if (tokens.isEmpty()) lookahead = new Token(Token.EPSILON, "", -1); else lookahead = tokens.getFirst(); }

This method will be used frequently for reducing the matched terminal symbols. We pop the first token off the list and set the lookahead to the new head of the list. In case the list is empty, we create the special **EPSILON** symbol which shows to the parser that the input has finished.

Now for the **expression()** method. The **expression** non-terminal only has one rule so the method becomes quite simple.

private void expression() { // expression -> signed_term sum_op signedTerm(); sumOp(); }

For each non-terminal on the right hand side of the rule we call the appropriate method.

We continue with writing a method for the **sum_op** non-terminal.

private void sumOp() { if (lookahead.token == Token.PLUSMINUS) { // sum_op -> PLUSMINUS term sum_op nextToken(); term(); sumOp(); } else { // sum_op -> EPSILON } }

If the next symbol is **PLUSMINUS**, that is a plus or a minus, we apply the rule

sum_op -> PLUSMINUS term sum_op

Because the right hand side starts with a terminal that is allowed by the grammar, we can eat it up by calling **nextToken()**. Then we call the methods corresponding to the remaining non-terminals in the rule. If, on the other hand the next token is not **PLUSMINUS** there is no other match for **sum_op** except the **EPSILON** rule. This means that, in this case, we just do nothing and return out of the **sumOp()** method.

We can continue in the same way with **signedTerm()**

private void signedTerm() { if (lookahead.token == Token.PLUSMINUS) { // signed_term -> PLUSMINUS term nextToken(); term(); } else { // signed_term -> term term(); } }

Again, there are two possible rules. If the next token is **PLUSMINUS** we can eat it up and then parse the non-terminal **term**. Otherwise we parse the non-terminal **term** directly.

The next symbols are handled in pretty much the same way.

private void term() { // term -> factor term_op factor(); termOp(); } private void termOp() { if (lookahead.token == Token.MULTDIV) { // term_op -> MULTDIV factor term_op nextToken(); signedFactor(); termOp(); } else { // term_op -> EPSILON } }

The previous two methods will handle a multiplication of an arbitrary number of terms. The individual factors, except the first one, can be preceded by a **PLUSMINUS**. This is handled by the following method.

private void signedFactor() { if (lookahead.token == Token.PLUSMINUS) { // signed_factor -> PLUSMINUS factor nextToken(); factor(); } else { // signed_factor -> factor factor(); } }

The following two methods will handle factors which can contain exponentiation of an arbitrary number of terms.

private void factor() { // factor -> argument factor_op argument(); factorOp(); } private void factorOp() { if (lookahead.token == Token.RAISED) { // factor_op -> RAISED expression nextToken(); signedFactor(); } else { // factor_op -> EPSILON } }

The following method, **argument()**, handles either a fixed value as given by a variable or constant, a function, or a bracketed expression.

private void argument() { if (lookahead.token == Token.FUNCTION) { // argument -> FUNCTION argument nextToken(); argument(); } else if (lookahead.token == Token.OPEN_BRACKET) { // argument -> OPEN_BRACKET sum CLOSE_BRACKET nextToken(); expression(); if (lookahead.token != Token.CLOSE_BRACKET) throw new ParserException("Closing brackets expected and " + lookahead.sequence + " found instead"); nextToken(); } else { // argument -> value value(); } }

When we are parsing a bracketed expression, we first eat up the opening bracket by calling **nextToken()**. Then we call **expression()** to parse the expression inside the brackets. After that is done we are expecting the next token to be a closing bracket. If that is not the case than we have encountered a syntax error. In this case we throw an exception informing about the error.

Finally we have a method for the non-terminal value which can either expand to a **NUMBER** or to a **VARIABLE**.

private void value() { if (lookahead.token == Token.NUMBER) { // argument -> NUMBER nextToken(); } else if (lookahead.token == Token.VARIABLE) { // argument -> VARIABLE nextToken(); } else { throw new ParserException( "Unexpected symbol "+lookahead.sequence+" found"); } }

This concludes the coding of a recursive descent parser. If you haven’t figured it out by now, the name “recursive descent” comes from the fact that the parser performs a depth first search by recursively calling the same methods. You can spot the recursion in the **argument()** method which is indirectly called by the **expression()** method but also calls the **expression()** method itself.

In the next post in this series we will be constructing an internal representation of the expression.

]]>

Before we actually get down to designing the grammar we have to define all the terminal symbols that we expect.

Terminal Meaning PLUSMINUS + or - MULTDIV * or / RAISED ^ FUNCTION any mathematical function: sin, cos, exp, sqrt, ... OPEN_BRACKET ( CLOSE_BRACKET ) NUMBER any number VARIABLE the name of a variable

These are all the symbols that we allow in the user input. We assume that the tokenizer has already split the input into these terminal symbols and the parser now gets a stream of these symbols.

Remember from the last post that the parser works with terminal and non-terminal symbols. The non-terminal symbols can be expanded by the parser’s rules while the terminal symbols cannot be changed. The parsing starts with a single non-terminal. For our expression parser we call that symbol **expression**.

We want our parser to not only decide if the input is a valid expression. We also want it to generate an internal representation of that expression. This means we have to make sure that the mathematical operations are carried out in the right order. In an expression like **3 + 2*5** the multiplication has to be carried out first because the multiplication has a higher precedence than the addition (remember multiplication before addition).

For reasons that we will see later the rules start with the weakest operation. Essentially, in a forthcoming post, we will be creating a tree structure with the lowest precedence operations near the root.

On the top level an expression can be interpreted as a sum of one or more terms. The first term can start with a plus or minus.

expression -> signed_term sum_op sum_op -> PLUSMINUS term sum_op sum_op -> EPSILON

Remember from the last post that the way these rules are designed is necessary to make sure that we always know which rule to apply just by reading the next symbol from the input. The **signed_term** can be a term preceded by **PLUSMINUS** or just a term.

signed_term -> PLUSMINUS term signed_term -> term

Now we continue in the same way with products. Each term is a product of one or more factors.

term -> factor term_op term_op -> MULTDIV signed_factor term_op term_op -> EPSILON

A **signed_factor** is a **factor** that can start with a **PLUSMINUS** or just a factor.

signed_factor -> PLUSMINUS factor signed_factor -> factor

A factor can be either a function or an argument which might be raised to some power.

factor -> argument factor_op factor_op -> RAISED signed_factor factor_op -> EPSILON

The second rule here allows either a simple argument, because **factor_op** can be eliminated using the **EPSILON** rule. It also allows exponentiation via the third rule.

An argument is either just a value or a full expression enclosed in brackets.

argument -> value argument -> FUNCTION argument argument -> OPEN_BRACKET expression CLOSE_BRACKET

The third of these two rules allows us to build expressions of arbitrary complexity because we can keep on putting expressions within expressions as long as they are enclosed in brackets. The second rule allows function expressions to be used as arguments. Note that an expression like **sin x** is possible without any brackets around the **x** because the function argument can simply be a value. But what happens when the expression reads something like **sin 2*x**. This results in an ambiguity should **sin 2*x** be treated as **sin(2*x)** or as **(sin 2)*x**? I have tried typing the expression into Google calculator and it turns out that it makes the distinction between **sin 2x** which it interprets as **sin(2*x)** and **sin 2*x** which it interprets as **sin(2)*x**. WolframAlpha on the other hand will interpret both as **sin(2*x)** but when you add spaces, **sin 2 x** and **sin 2 * x**, will be interpreted as **sin(2)*x**. In our case, using the simple LL(1) grammar and ignoring white spaces, we won’t be able to implement such fine grained behaviour. This means our grammar will always interpret the expression as **sin(2)*x**. Also note that this means that an expression such as **sin x^2**. will be interpreted as **(sin(x))^2**.

Finally, a value can be a **NUMBER** or a **VARIABLE**

value -> NUMBER value -> VARIABLE

This is the complete grammar for mathematical expressions. Why not try out yourself if you can analyse an expression using this grammar?

In the next post in this series I will program a recursive descent parser based on this grammar.

]]>A grammar is made up of a collection of rules. Before we start explaining the rules, we have to talk about what the rules act upon. The tokenizer gave us a sequence of tokens and I said before that each type of token has a unique code. These token codes are called the terminal symbols of the grammar.

In addition to the terminal symbols we also introduce so-called non-terminal symbols. The non-terminal symbols represent somewhat more abstract concepts. For example a sum might be a non-terminal. A sum is made up of a sequence of terms separated by plus or minus signs. A sum does not correspond to an individual token but states how tokens must be arranged in the input. Usually non-terminals are written in lower case while terminal symbols are written in upper case.

Now we come back to the rules of the grammar. Each rule expresses a way to convert a non-terminal symbol into a sequence of terminal and non-terminal symbols. There might be more than one rule converting a specific non-terminal. A grammar also specifies a special non-terminal which is the starting point.

Before we start designing our real grammar form mathematical expressions, let’s look at a simple example.

(1) sum -> NUMBER PLUS NUMBER (2) sum -> NUMBER PLUS sum

In this grammar we start with the non-terminal **sum**. The two rules allow us to express any sum of numbers separated by plus signs. We start with the initial non-terminal

sum

By applying rule (2) we can replace **sum**

(2) -> NUMBER PLUS sum

Now we can apply rule (2) again to the **sum** on the right hand side, and we end up with

(2) -> NUMBER PLUS NUMBER PLUS sum

We finally apply rule (1) to the non-terminal **sum** which is still present in our right hand side. This leads us to

(1) -> NUMBER PLUS NUMBER PLUS NUMBER PLUS NUMBER

Now we have created a sequence that is purely made up of non-terminal symbols and expresses a sum of four numbers. In general, the grammar specified above lets us generate a sum of any length. We just have to keep applying rule (2) again and again, until we have enough **NUMBER**s to add.

One problem remains however. When we are parsing an input from the user, we need to decide which rules to apply. This has to be somehow deduced from the input sequence. We successively apply rules and match them with the input sequence while, at the same time, reading token by token from the input. In the simplest type of grammars, the so-called LL(1) grammars, looking at the next token on the input should be enough to decide which rule to apply.

Let’s take the example above again and assume our input from the tokenizer is

NUMBER PLUS NUMBER PLUS NUMBER

We start with the non-terminal **sum**, and we can immediately see that we have to apply first rule (2) and then rule (1). But with the restrictions of the LL(1) grammar, the parser can only see the next symbol. So, when we start parsing, the parser will only see

NUMBER ...

The parser does not know how many numbers are in the sum and can’t decide whether to use rule (1) or rule (2). In order to help the parser we have to re-design our grammar. We use a new terminal symbol called **EPSILON** which is used when a non-terminal can be removed if it doesn’t match any input.

(1) sum -> NUMBER sum_op (2) sum_op -> PLUS NUMBER sum_op (3) sum_op -> EPSILON

Now, let’s follow how the parser reacts to the input. We start by placing our starting non-terminal on the stack.

Stack: sum Input: NUMBER (PLUS NUMBER PLUS NUMBER)

The input in brackets is the input that is still in the stream but is not visible to the parser. The parser now decides to apply rule (1) based on the next token.

Rules applied: (1) Stack: NUMBER sum_op Input: NUMBER (PLUS NUMBER PLUS NUMBER)

Now the stack starts with the terminal **NUMBER** that matches in the terminal on the input stream. In this case we can discard the matching terminals.

Rules applied: (1) Stack: sum_op Input: PLUS (NUMBER PLUS NUMBER)

Now we see a **PLUS** on the input and the only way to convert **sum_op** into a string that starts with **PLUS** is by applying rule (2)

Rules applied: (1) (2) Stack: PLUS NUMBER sum_op Input: PLUS (NUMBER PLUS NUMBER)

We reduce again by discarding the matching terminal from the beginning of the stack.

Rules applied: (1) (2) Stack: NUMBER sum_op Input: NUMBER (PLUS NUMBER)

Rules applied: (1) (2) Stack: sum_op Input: PLUS (NUMBER)

Note how we can successively discard all matching terminals, the **NUMBER** and the **PLUS**. Again we have to apply rule (2) and reduce

Rules applied: (1) (2) (2) Stack: PLUS NUMBER sum_op Input: PLUS (NUMBER)

Rules applied: (1) (2) (2) Stack: NUMBER sum_op Input: NUMBER

Rules applied: (1) (2) (2) Stack: sum_op Input:

Now the input string is empty. This is where the **EPSILON** rule comes in. The **EPSILON** rule is used when there is no other rule that can be used. The **EPSILON** is a terminal that can be discarded without matching any input.

Rules applied: (1) (2) (2) (3) Stack: Input:

Now both the stack and the input string are empty. This means we are finished with parsing the input and we have a match.

This concludes our little introduction to LL(1) grammars. With the knowledge we have now, we can set about designing a grammar for mathematical expressions. We will do this in the next instalment of this series.

]]>