Remember from the post in which we implemented the parser that every non-terminal symbol in the grammar was given a method inside the **Parser** class. When we defined these methods they were plain void methods that did not take any arguments. This is about to change. We will now modify all these methods to return an **ExpressionNode**.

private ExpressionNode expression() ... private ExpressionNode sumOp(ExpressionNode expr) ... private ExpressionNode signedTerm() ... private ExpressionNode term() ... private ExpressionNode termOp(ExpressionNode expr) ... private ExpressionNode signedFactor() ... private ExpressionNode factor() ... private ExpressionNode factorOp(ExpressionNode expr) ... private ExpressionNode argument() ... private ExpressionNode value() ...

Inside each of these methods we now have to create the expression nodes that correspond to the parsed expression and return the result. Let’s start with the leaf nodes which are parsed inside **value()**.

private ExpressionNode value() { // argument -> NUMBER if (lookahead.token == Token.NUMBER) { ExpressionNode expr = new ConstantExpressionNode(lookahead.sequence); nextToken(); return expr; } // argument -> VARIABLE if (lookahead.token == Token.VARIABLE) { ExpressionNode expr = new VariableExpressionNode(lookahead.sequence); nextToken(); return expr; } if (lookahead.token == Token.EPSILON) throw new ParserException("Unexpected end of input"); else throw new ParserException("Unexpected symbol %s found", lookahead); }

If we encounter a number then we create a new **ConstantExpressionNode**. We can use **lookahead.sequence** to initialise the constant. **lookahead.sequence** contains the string that corresponds to the current token. In case of a number, it will contain the string representation of the numerical value. **ConstantExpressionNode** had a convenience constructor that takes a string, converts it into a number and stores it internally as the value of the constant.

The code added for variables is almost identical except that we create a new **VariableExpressionNode**. Note how, in both cases, we create the node before calling **nextToken()**. We do this because **nextToken()** modifies the lookahead which would mean we loose the information about the current token.

Next up is the argument() method.

private ExpressionNode argument() { // argument -> FUNCTION argument if (lookahead.token == Token.FUNCTION) { int function = FunctionExpressionNode.stringToFunction(lookahead.sequence); nextToken(); ExpressionNode expr = argument(); return new FunctionExpressionNode(function, expr); } // argument -> OPEN_BRACKET sum CLOSE_BRACKET else if (lookahead.token == Token.OPEN_BRACKET) { nextToken(); ExpressionNode expr = expression(); if (lookahead.token != Token.CLOSE_BRACKET) throw new ParserException("Closing brackets expected", lookahead); nextToken(); return expr; } // argument -> value return value(); }

The argument non-terminal does not actually correspond to any type of node in the expression tree. All we need to do here is pass on the nodes we obtain from the sub-expressions. So, depending on the rule, we either return a terminal node via the **value()** method or a more complex sub-tree via the **expression()** method or we create a **FunctionExpressionNode** if we encounter a function.

If we find a function then we have to create a **FunctionExpressionNode**. To do this we first have to convert the string in** lookahead.sequence** to a function identifier. In the last post we defined a static helper method called **stringToFunction()** that can do this for us. This function will throw an exception if the string does not correspond to a supported function.

The factor() method produces argument non-terminal optionally wrapped by **functionOp**.

private ExpressionNode factor() { // factor -> argument factor_op ExpressionNode a = argument(); return factorOp(a); }

The rule for **factor** produces and **argument** and a **factor_op** non-terminal. The **factor_op** non-terminal can include an exponentiation, or it might not. In order for **factor_op** to create the exponentiation it needs to know the base which is parsed by **argument()**. For this reason we decide to pass the node produced by **argument()** to the **factorOp()** method.

private ExpressionNode factorOp(ExpressionNode expression) { // factor_op -> RAISED factor if (lookahead.token == Token.RAISED) { nextToken(); ExpressionNode exponent = signedFactor(); return new ExponentiationExpressionNode(expression, exponent); } // factor_op -> EPSILON return expression; }

If we find a **RAISED** token then **factorOp()** creates an **ExponentiationExpressionNode** with the base given by the expression passed to the the method and the exponent which is parsed from a recursive call to **factor()**. If the **EPSILON** rule is applied then we simply return expression.

We have called the **signedFactor()** method for the exponent which corresponds to the **signed_factor** non-terminal.

private ExpressionNode signedFactor() { // signed_factor -> PLUSMINUS factor if (lookahead.token == Token.PLUSMINUS) { boolean positive = lookahead.sequence.equals("+"); nextToken(); ExpressionNode t = factor(); if (positive) return t; else return new AdditionExpressionNode(t, false); } // signed_factor -> factor return factor(); }

A signed factor can start with a plus or minus sign. In the case of a minus sign, an **AdditionExpressionNode** is generated to represent the negative of the factor. In all other cases the expression from **factor()** is simply passed on.

The term non-terminal again does not correspond to any node in the expression tree. The implementation of **term()**, therefore, just passes on the expression nodes.

private ExpressionNode term() { // term -> factor term_op ExpressionNode f = factor(); return termOp(f); }

Again, **termOp()** might or might not produce a product, depending on which symbols are found next. In order for **termOp()** to be able to create a product including the first factor, we pass the factor to **termOp()**.

private ExpressionNode termOp(ExpressionNode expression) { // term_op -> MULTDIV factor term_op if (lookahead.token == Token.MULTDIV) { MultiplicationExpressionNode prod; if (expression.getType() == ExpressionNode.MULTIPLICATION_NODE) prod = (MultiplicationExpressionNode)expression; else prod = new MultiplicationExpressionNode(expression, true); boolean positive = lookahead.sequence.equals("*"); nextToken(); ExpressionNode f = signedFactor(); prod.add(f, positive); return termOp(prod); } // term_op -> EPSILON return expression; }

The implementation of **termOp()** is slightly more complicated. Remember from the last post that we can add an arbitrary number of factors to a **MultiplicationExpressionNode**. This means, as we continue calling **termOp()** recursively we want to keep adding factors to the multiplication. When we discover a **MULTDIV** token we know that we are dealing with a product or division. We create a new **MultiplicationExpressionNode** only if the node that was passed to us is not already a **MultiplicationExpressionNode**. Then we add the next factor to the multiplication.

The **EPSILON** rule, on the other hand, simply returns the expression that was passed to **termOp()**.

A generalisation of **term** is **signed_term**. This can include a leading plus or minus symbol.

private ExpressionNode signedTerm() { // signed_term -> PLUSMINUS term if (lookahead.token == Token.PLUSMINUS) { boolean positive = lookahead.sequence.equals("+"); nextToken(); ExpressionNode t = term(); if (positive) return t; else return new AdditionExpressionNode(t, false); } // signed_term -> term return term(); }

Only in the case that the leading symbol is a minus, we have to create an **AdditionExpressionNode** and add the term with a negative sign. In all other cases we simply pass on the node obtained from **term()**.

The methods **expression()** and **sumOp()** follow pretty much the same pattern as **term()** and **termOp()**.

private ExpressionNode expression() { // expression -> signed_term sum_op ExpressionNode expr = signedTerm(); return sumOp(expr); }

The method **expression()** does not produce any nodes itself but passes the result of **signedTerm()** to **sumOp()** which can then assemble and **AdditionExpressionNode** if it has to.

private ExpressionNode sumOp(ExpressionNode expr) { // sum_op -> PLUSMINUS term sum_op if (lookahead.token == Token.PLUSMINUS) { AdditionExpressionNode sum; if (expr.getType() == ExpressionNode.ADDITION_NODE) sum = (AdditionExpressionNode)expr; else sum = new AdditionExpressionNode(expr, true); boolean positive = lookahead.sequence.equals("+"); nextToken(); ExpressionNode t = term(); sum.add(t, positive); return sumOp(sum); } // sum_op -> EPSILON return expr; }

As with **termOp()** we continue calling **sumOp()** recursively we want to keep adding terms to the sum. When we discover a **PLUSMINUS** token we know that we are dealing with an addition or subtraction. We create a new **AdditionExpressionNode** only if the node that was passed to us is not already a **AdditionExpressionNode**. Then we add the next term to the sum.

The **EPSILON** rule, on the other hand, simply returns the expression that was passed to **sumOp()**.

Finally we modify the parse() method to return the ExpressionNode created in the process of parsing the input.

public ExpressionNode parse(LinkedList<Token> tokens) { this.tokens = (LinkedList<Token>)tokens.clone(); lookahead = this.tokens.getFirst(); ExpressionNode expr = expression(); if (lookahead.token != Token.EPSILON) throw new ParserException("Unexpected symbol %s found", lookahead); return expr; }

With all the above we are ready to use the parser. In our main() method we can now test the following.

Parser parser = new Parser(); try { ExpressionNode expression = parser.parse("3*2^4 + sqrt(1+3)"); System.out.println("The value of the expression is "+expression.getValue()); } catch (ParserException e) { System.out.println(e.getMessage()); } catch (EvaluationException e) { System.out.println(e.getMessage()); }

This will produce the following output.

The value of the expression is 50.0

I hope you feel the same of exhilaration when seeing this result. It just feels like everything finally came together and the parser works exactly as expected.

Unfortunately we are not quite done. While we are able to parse variables and named constants, we have not yet got a mechanism for setting the values of those constants. This will be the topic of the next and final post in this series. To make thins a bit more interesting, I will introduce the visitor design pattern to achieve this.

]]>We first add to our class **Token** that holds the information about the tokens in the input.

public class Token { public static final int EPSILON = 0; public static final int PLUSMINUS = 1; public static final int MULTDIV = 2; public static final int RAISED = 3; public static final int FUNCTION = 4; public static final int OPEN_BRACKET = 5; public static final int CLOSE_BRACKET = 6; public static final int NUMBER = 7; public static final int VARIABLE = 8; public final int token; public final String sequence; public Token(int token, String sequence) { super(); this.token = token; this.sequence = sequence; } }

In short, this class defines a number of static constants for the different types of tokens and a couple of fields that hold the data for the individual token. Note how we made the fields public final. This makes it immediately clear that we are dealing with an immutable object and there is no need for a multitude of getters in the class.

Next we write class called **Parser** which does the actual parsing of the expression. The tokens are stored in a **List** of **Token** objects and one **Token** object is stored as the lookahead.

The parser we are about to write will not do much except throwing an excpetion if the expression is invalid. If the parser runs without an exception being thrown we know that the expression is valid. In one of the future posts we will add to this parser and construct an internal representation of the expression that can be used for calculations.

public class Parser { LinkedList<Token> tokens; Token lookahead;

The main method of the parser is called **parse** and takes the tokens as parameter.

public void parse(List<Token> tokens) { this.tokens = (LinkedList<Token>) tokens.clone(); lookahead = this.tokens.getFirst(); expression(); if (lookahead.token != Token.EPSILON) throw new ParserException("Unexpected symbol %s found", lookahead); }

In the **parse** method we first create a shallow copy of the token list because we will be taking elements out of the list and we don’t want to create side effects on the parameters. Then **lookahead** is assigned the first token in the list. After these initialisations we call a method called **expression()**. The parser will have one method for every non-terminal symbol of the grammar that we designed in the last post. This means the method **expression()** will parse the non-terminal symbol **expression**.

Once the expression has been parsed completely there should be no symbols left in the list. This means that the **lookahead** should be equal to **Token.EPSILON**. If there is still a symbol left in the lookahead it means that there is an error in the input. After parsing the expression we can therefore perform an error check. This takes care of balancing parentheses and making sure the input is a valid expression.

Before we write the **expression()** method we create a small utility method that reads the next token from the list.

private void nextToken() { tokens.pop(); // at the end of input we return an epsilon token if (tokens.isEmpty()) lookahead = new Token(Token.EPSILON, "", -1); else lookahead = tokens.getFirst(); }

This method will be used frequently for reducing the matched terminal symbols. We pop the first token off the list and set the lookahead to the new head of the list. In case the list is empty, we create the special **EPSILON** symbol which shows to the parser that the input has finished.

Now for the **expression()** method. The **expression** non-terminal only has one rule so the method becomes quite simple.

private void expression() { // expression -> signed_term sum_op signedTerm(); sumOp(); }

For each non-terminal on the right hand side of the rule we call the appropriate method.

We continue with writing a method for the **sum_op** non-terminal.

private void sumOp() { if (lookahead.token == Token.PLUSMINUS) { // sum_op -> PLUSMINUS term sum_op nextToken(); term(); sumOp(); } else { // sum_op -> EPSILON } }

If the next symbol is **PLUSMINUS**, that is a plus or a minus, we apply the rule

sum_op -> PLUSMINUS term sum_op

Because the right hand side starts with a terminal that is allowed by the grammar, we can eat it up by calling **nextToken()**. Then we call the methods corresponding to the remaining non-terminals in the rule. If, on the other hand the next token is not **PLUSMINUS** there is no other match for **sum_op** except the **EPSILON** rule. This means that, in this case, we just do nothing and return out of the **sumOp()** method.

We can continue in the same way with **signedTerm()**

private void signedTerm() { if (lookahead.token == Token.PLUSMINUS) { // signed_term -> PLUSMINUS term nextToken(); term(); } else { // signed_term -> term term(); } }

Again, there are two possible rules. If the next token is **PLUSMINUS** we can eat it up and then parse the non-terminal **term**. Otherwise we parse the non-terminal **term** directly.

The next symbols are handled in pretty much the same way.

private void term() { // term -> factor term_op factor(); termOp(); } private void termOp() { if (lookahead.token == Token.MULTDIV) { // term_op -> MULTDIV factor term_op nextToken(); signedFactor(); termOp(); } else { // term_op -> EPSILON } }

The previous two methods will handle a multiplication of an arbitrary number of terms. The individual factors, except the first one, can be preceded by a **PLUSMINUS**. This is handled by the following method.

private void signedFactor() { if (lookahead.token == Token.PLUSMINUS) { // signed_factor -> PLUSMINUS factor nextToken(); factor(); } else { // signed_factor -> factor factor(); } }

The following two methods will handle factors which can contain exponentiation of an arbitrary number of terms.

private void factor() { // factor -> argument factor_op argument(); factorOp(); } private void factorOp() { if (lookahead.token == Token.RAISED) { // factor_op -> RAISED expression nextToken(); signedFactor(); } else { // factor_op -> EPSILON } }

The following method, **argument()**, handles either a fixed value as given by a variable or constant, a function, or a bracketed expression.

private void argument() { if (lookahead.token == Token.FUNCTION) { // argument -> FUNCTION argument nextToken(); argument(); } else if (lookahead.token == Token.OPEN_BRACKET) { // argument -> OPEN_BRACKET sum CLOSE_BRACKET nextToken(); expression(); if (lookahead.token != Token.CLOSE_BRACKET) throw new ParserException("Closing brackets expected and " + lookahead.sequence + " found instead"); nextToken(); } else { // argument -> value value(); } }

When we are parsing a bracketed expression, we first eat up the opening bracket by calling **nextToken()**. Then we call **expression()** to parse the expression inside the brackets. After that is done we are expecting the next token to be a closing bracket. If that is not the case than we have encountered a syntax error. In this case we throw an exception informing about the error.

Finally we have a method for the non-terminal value which can either expand to a **NUMBER** or to a **VARIABLE**.

private void value() { if (lookahead.token == Token.NUMBER) { // argument -> NUMBER nextToken(); } else if (lookahead.token == Token.VARIABLE) { // argument -> VARIABLE nextToken(); } else { throw new ParserException( "Unexpected symbol "+lookahead.sequence+" found"); } }

This concludes the coding of a recursive descent parser. If you haven’t figured it out by now, the name “recursive descent” comes from the fact that the parser performs a depth first search by recursively calling the same methods. You can spot the recursion in the **argument()** method which is indirectly called by the **expression()** method but also calls the **expression()** method itself.

In the next post in this series we will be constructing an internal representation of the expression.

]]>