So let’s think about what exactly we want to to. Once the expression is parsed and we have an internal representation in the form of an ExpressionNode, we want to iterate through the whole tree and find VariableExpressionNode objects. If the nodes found in this way contain a variable with the name that we want to set, then we will assign the value to it.

The most straightforward way of doing this would be to add a new method to the ExpressionNode interface that each concrete implementation will have to define. This method might have the following signature.

public void setVariable(String name, double value);

The implementation in most of the concrete node types would simply pass the information on to the child nodes by calling their setVariable() method. Only the VariableExpressionNode would compare the name passed in the argument with its own name and set its value accordingly. This is indeed not a bad programming style but it has one disadvantage. Let’s think a bit into the future. There might be other operations that we want to perform on the expression tree. Maybe we want to print out the contents in a particular format or we want to find sub-expressions that match certain criteria. In each case we would have to iterate through the tree and in each case we would have to add a new method to the ExpressionNode interface and all its implementations. As your software grows you might have a multitude of different methods for different tasks defined in this way. Not only is this difficult to maintain because the code for a specific task is always spread over many classes but it also makes is necessary to keep modifying existing code for every new operation on the tree.

Ideally we would like to separate all the code for a specific task into a single class with the ability of adding new functionality without modifying existing code. This can be achieved with the visitor design pattern. In this software design pattern, we create an interface that acts as a visitor to the class hierarchy. It is useful when the class hierarchy can be expected to be relatively stable while the number of operations on the classes can increase over time. The visitor will have one method for each class that is part of the hierarchy.

public interface ExpressionNodeVisitor { public void visit(VariableExpressionNode node); public void visit(ConstantExpressionNode node); public void visit(AdditionExpressionNode node); public void visit(MultiplicationExpressionNode node); public void visit(ExponentiationExpressionNode node); public void visit(FunctionExpressionNode node); }

Next we add a method called accept() the ExpressionNode interface which takes an ExpressionNodeVisitor.

public void accept(ExpressionNodeVisitor visitor);

Now that we have defined the method in the interface we have to define it in every class that implements the interface. For the VariableExpressionNode and the ConstantExpressionNode the implementation is a one-liner

public class VariableExpressionNode implements ExpressionNode { ... public void accept(ExpressionNodeVisitor visitor) { visitor.visit(this); } } public class ConstantExpressionNode implements ExpressionNode { ... public void accept(ExpressionNodeVisitor visitor) { visitor.visit(this); } }

When passing an **ExpressionNodeVisitor** to the **accept()** method, the method will call the appropriate visit method of the visitor class. This is achieved by a combination of overriding and overloading of methods. Don’t be fooled. Even though both of the implementations look the same, they call different **visit()** methods in the **ExpressionNodeVisitor** because the **this** member has a different type in each case.

In order for the visitor to be visited by all the nodes in the tree, the **accept()** methods of the other class implementations will also have to call the **accept()** methods of their children.

public class AdditionExpressionNode extends SequenceExpressionNode { ... public void accept(ExpressionNodeVisitor visitor) { visitor.visit(this); for (Term t: terms) t.expression.accept(visitor); } } public class MultiplicationExpressionNode extends SequenceExpressionNode { ... public void accept(ExpressionNodeVisitor visitor) { visitor.visit(this); for (Term t: terms) t.expression.accept(visitor); } }

The implementations again look identical but have slightly different effects because of the different types of **this**.

Finally the FunctionExpressionNode and the ExponentiationExpressionNode follow a similar pattern

public class FunctionExpressionNode implements ExpressionNode { ... public void accept(ExpressionNodeVisitor visitor) { visitor.visit(this); argument.accept(visitor); } } public class ExponentiationExpressionNode implements ExpressionNode { ... public void accept(ExpressionNodeVisitor visitor) { visitor.visit(this); base.accept(visitor); exponent.accept(visitor); } }

With these few lines of code we have added a very powerful mechanism to or class design. We can now freely create new functionality by implementing the ExpressionNodeVisitor interface.

Let’s get back to the task at hand. We would like to set the value of variables with a given name. To do this we create a new class, SetVariable, that implements ExpressionNodeVisitor.

public class SetVariable implements ExpressionNodeVisitor { private String name; private double value; public SetVariable(String name, double value) { super(); this.name = name; this.value = value; } public void visit(VariableExpressionNode node) { if (node.getName().equals(name)) node.setValue(value); } public void visit(ConstantExpressionNode node) {} public void visit(AdditionExpressionNode node) {} public void visit(MultiplicationExpressionNode node) {} public void visit(ExponentiationExpressionNode node) {} public void visit(FunctionExpressionNode node) {} }

The class contains the name and the value of the variable to be set and these are initialised in the standard way in the constructor. Because we are implementing the ExpressionNodeVisitor we need to define all the corresponding methods. However, as you can see most of these methods are empty and do nothing because we are really only interested in the VariableExpressionNode elements in the tree. Only two lines of code are needed to compare and set the variable. All the iteration through the tree is handled by the accept() methods of the individual classes and in the visitor we can concentrate on writing the code that actually does something.

Using the visitor is also a one-liner. Let’s say we have an ExpressionNode called expression, then the following code will set the value of pi.

expression.accept(new SetVariable("pi", Math.PI));

To test this, let’s update the main method from the last post.

Parser parser = new Parser(); try { ExpressionNode expression = parser.parse("sin(pi/2)"); expression.accept(new SetVariable("pi", Math.PI)); System.out.println("The value of the expression is " + expression.getValue()); } catch (ParserException e) { System.out.println(e.getMessage()); } catch (EvaluationException e) { System.out.println(e.getMessage()); }

As expected, the program prints out

The value of the expression is 1.0

This is concludes our series of the Java expression parser. You can download the complete parser here. It is released under the MIT license, so feel free to use it in your own projects but please keep the copyright notice.

The code reference can be found here.

We have used the parser in our Android app DiffIt! which can calculate derivatives of mathematical expressions.

]]>Remember from the post in which we implemented the parser that every non-terminal symbol in the grammar was given a method inside the **Parser** class. When we defined these methods they were plain void methods that did not take any arguments. This is about to change. We will now modify all these methods to return an **ExpressionNode**.

private ExpressionNode expression() ... private ExpressionNode sumOp(ExpressionNode expr) ... private ExpressionNode signedTerm() ... private ExpressionNode term() ... private ExpressionNode termOp(ExpressionNode expr) ... private ExpressionNode signedFactor() ... private ExpressionNode factor() ... private ExpressionNode factorOp(ExpressionNode expr) ... private ExpressionNode argument() ... private ExpressionNode value() ...

Inside each of these methods we now have to create the expression nodes that correspond to the parsed expression and return the result. Let’s start with the leaf nodes which are parsed inside **value()**.

private ExpressionNode value() { // argument -> NUMBER if (lookahead.token == Token.NUMBER) { ExpressionNode expr = new ConstantExpressionNode(lookahead.sequence); nextToken(); return expr; } // argument -> VARIABLE if (lookahead.token == Token.VARIABLE) { ExpressionNode expr = new VariableExpressionNode(lookahead.sequence); nextToken(); return expr; } if (lookahead.token == Token.EPSILON) throw new ParserException("Unexpected end of input"); else throw new ParserException("Unexpected symbol %s found", lookahead); }

If we encounter a number then we create a new **ConstantExpressionNode**. We can use **lookahead.sequence** to initialise the constant. **lookahead.sequence** contains the string that corresponds to the current token. In case of a number, it will contain the string representation of the numerical value. **ConstantExpressionNode** had a convenience constructor that takes a string, converts it into a number and stores it internally as the value of the constant.

The code added for variables is almost identical except that we create a new **VariableExpressionNode**. Note how, in both cases, we create the node before calling **nextToken()**. We do this because **nextToken()** modifies the lookahead which would mean we loose the information about the current token.

Next up is the argument() method.

private ExpressionNode argument() { // argument -> FUNCTION argument if (lookahead.token == Token.FUNCTION) { int function = FunctionExpressionNode.stringToFunction(lookahead.sequence); nextToken(); ExpressionNode expr = argument(); return new FunctionExpressionNode(function, expr); } // argument -> OPEN_BRACKET sum CLOSE_BRACKET else if (lookahead.token == Token.OPEN_BRACKET) { nextToken(); ExpressionNode expr = expression(); if (lookahead.token != Token.CLOSE_BRACKET) throw new ParserException("Closing brackets expected", lookahead); nextToken(); return expr; } // argument -> value return value(); }

The argument non-terminal does not actually correspond to any type of node in the expression tree. All we need to do here is pass on the nodes we obtain from the sub-expressions. So, depending on the rule, we either return a terminal node via the **value()** method or a more complex sub-tree via the **expression()** method or we create a **FunctionExpressionNode** if we encounter a function.

If we find a function then we have to create a **FunctionExpressionNode**. To do this we first have to convert the string in** lookahead.sequence** to a function identifier. In the last post we defined a static helper method called **stringToFunction()** that can do this for us. This function will throw an exception if the string does not correspond to a supported function.

The factor() method produces argument non-terminal optionally wrapped by **functionOp**.

private ExpressionNode factor() { // factor -> argument factor_op ExpressionNode a = argument(); return factorOp(a); }

The rule for **factor** produces and **argument** and a **factor_op** non-terminal. The **factor_op** non-terminal can include an exponentiation, or it might not. In order for **factor_op** to create the exponentiation it needs to know the base which is parsed by **argument()**. For this reason we decide to pass the node produced by **argument()** to the **factorOp()** method.

private ExpressionNode factorOp(ExpressionNode expression) { // factor_op -> RAISED factor if (lookahead.token == Token.RAISED) { nextToken(); ExpressionNode exponent = signedFactor(); return new ExponentiationExpressionNode(expression, exponent); } // factor_op -> EPSILON return expression; }

If we find a **RAISED** token then **factorOp()** creates an **ExponentiationExpressionNode** with the base given by the expression passed to the the method and the exponent which is parsed from a recursive call to **factor()**. If the **EPSILON** rule is applied then we simply return expression.

We have called the **signedFactor()** method for the exponent which corresponds to the **signed_factor** non-terminal.

private ExpressionNode signedFactor() { // signed_factor -> PLUSMINUS factor if (lookahead.token == Token.PLUSMINUS) { boolean positive = lookahead.sequence.equals("+"); nextToken(); ExpressionNode t = factor(); if (positive) return t; else return new AdditionExpressionNode(t, false); } // signed_factor -> factor return factor(); }

A signed factor can start with a plus or minus sign. In the case of a minus sign, an **AdditionExpressionNode** is generated to represent the negative of the factor. In all other cases the expression from **factor()** is simply passed on.

The term non-terminal again does not correspond to any node in the expression tree. The implementation of **term()**, therefore, just passes on the expression nodes.

private ExpressionNode term() { // term -> factor term_op ExpressionNode f = factor(); return termOp(f); }

Again, **termOp()** might or might not produce a product, depending on which symbols are found next. In order for **termOp()** to be able to create a product including the first factor, we pass the factor to **termOp()**.

private ExpressionNode termOp(ExpressionNode expression) { // term_op -> MULTDIV factor term_op if (lookahead.token == Token.MULTDIV) { MultiplicationExpressionNode prod; if (expression.getType() == ExpressionNode.MULTIPLICATION_NODE) prod = (MultiplicationExpressionNode)expression; else prod = new MultiplicationExpressionNode(expression, true); boolean positive = lookahead.sequence.equals("*"); nextToken(); ExpressionNode f = signedFactor(); prod.add(f, positive); return termOp(prod); } // term_op -> EPSILON return expression; }

The implementation of **termOp()** is slightly more complicated. Remember from the last post that we can add an arbitrary number of factors to a **MultiplicationExpressionNode**. This means, as we continue calling **termOp()** recursively we want to keep adding factors to the multiplication. When we discover a **MULTDIV** token we know that we are dealing with a product or division. We create a new **MultiplicationExpressionNode** only if the node that was passed to us is not already a **MultiplicationExpressionNode**. Then we add the next factor to the multiplication.

The **EPSILON** rule, on the other hand, simply returns the expression that was passed to **termOp()**.

A generalisation of **term** is **signed_term**. This can include a leading plus or minus symbol.

private ExpressionNode signedTerm() { // signed_term -> PLUSMINUS term if (lookahead.token == Token.PLUSMINUS) { boolean positive = lookahead.sequence.equals("+"); nextToken(); ExpressionNode t = term(); if (positive) return t; else return new AdditionExpressionNode(t, false); } // signed_term -> term return term(); }

Only in the case that the leading symbol is a minus, we have to create an **AdditionExpressionNode** and add the term with a negative sign. In all other cases we simply pass on the node obtained from **term()**.

The methods **expression()** and **sumOp()** follow pretty much the same pattern as **term()** and **termOp()**.

private ExpressionNode expression() { // expression -> signed_term sum_op ExpressionNode expr = signedTerm(); return sumOp(expr); }

The method **expression()** does not produce any nodes itself but passes the result of **signedTerm()** to **sumOp()** which can then assemble and **AdditionExpressionNode** if it has to.

private ExpressionNode sumOp(ExpressionNode expr) { // sum_op -> PLUSMINUS term sum_op if (lookahead.token == Token.PLUSMINUS) { AdditionExpressionNode sum; if (expr.getType() == ExpressionNode.ADDITION_NODE) sum = (AdditionExpressionNode)expr; else sum = new AdditionExpressionNode(expr, true); boolean positive = lookahead.sequence.equals("+"); nextToken(); ExpressionNode t = term(); sum.add(t, positive); return sumOp(sum); } // sum_op -> EPSILON return expr; }

As with **termOp()** we continue calling **sumOp()** recursively we want to keep adding terms to the sum. When we discover a **PLUSMINUS** token we know that we are dealing with an addition or subtraction. We create a new **AdditionExpressionNode** only if the node that was passed to us is not already a **AdditionExpressionNode**. Then we add the next term to the sum.

The **EPSILON** rule, on the other hand, simply returns the expression that was passed to **sumOp()**.

Finally we modify the parse() method to return the ExpressionNode created in the process of parsing the input.

public ExpressionNode parse(LinkedList<Token> tokens) { this.tokens = (LinkedList<Token>)tokens.clone(); lookahead = this.tokens.getFirst(); ExpressionNode expr = expression(); if (lookahead.token != Token.EPSILON) throw new ParserException("Unexpected symbol %s found", lookahead); return expr; }

With all the above we are ready to use the parser. In our main() method we can now test the following.

Parser parser = new Parser(); try { ExpressionNode expression = parser.parse("3*2^4 + sqrt(1+3)"); System.out.println("The value of the expression is "+expression.getValue()); } catch (ParserException e) { System.out.println(e.getMessage()); } catch (EvaluationException e) { System.out.println(e.getMessage()); }

This will produce the following output.

The value of the expression is 50.0

I hope you feel the same of exhilaration when seeing this result. It just feels like everything finally came together and the parser works exactly as expected.

Unfortunately we are not quite done. While we are able to parse variables and named constants, we have not yet got a mechanism for setting the values of those constants. This will be the topic of the next and final post in this series. To make thins a bit more interesting, I will introduce the visitor design pattern to achieve this.

]]>The last post in this series showed you how a mathematical expression parser was implemented from the grammar that we designed earlier. This was all fine and good but the parser did not do much except confirm if a mathematical expression was valid or not. In the end there was no way of evaluating that expression because we never stored any of the information that the parser was able to retrieve from the input. We will have to do this by adding side effects in the parser’s methods, but before we do we have to decide how our mathematical expressions should be stored. That is the topic of this post.

The typical way of storing expressions is in a tree structure. Each node in the tree represents either an operation or a value. The outermost operation (the one that is performed last in the evaluation of the expression) is at the root of the tree and the values are stored in the leaf node. The figure on the right shows the tree for the expression **3*2^4 + sqrt(1+3)**. The outermost operation is the **+** which adds **3*2^4** and **sqrt(1+3)**. Evaluating the expression is done from the leaves upward. In our example we can start on the left side and evaluate 2^4 and replace the exponentiation with the result, 16. The resulting tree is shown below.

We continue in the same way with the **3*16** nodes and replace the multiplication with the result **48**. On the right hand side of the tree we can evaluate the addition **1+3** and replace it with the sum **4**. Then we evaluate **sqrt(4)** which results **2**. The sqrt node is replaced and the tree, after all these operations is shown in the figure below. The final result of the expression is obtained from adding the two numbers which then gives **50** as the answer.

To store the tree in some kind of data structure we need to cast it into some sort of object hierarchy. We will create a base interface for the nodes of the tree and a number of classes that implement the interface for different types of nodes, that is one class for additions, one class for multiplications and so on. The interface is called **ExpressionNode**.

public interface ExpressionNode { public static final int VARIABLE_NODE = 1; public static final int CONSTANT_NODE = 2; public static final int ADDITION_NODE = 3; public static final int MULTIPLICATION_NODE = 4; public static final int EXPONENTIATION_NODE = 5; public static final int FUNCTION_NODE = 6; public int getType(); public double getValue(); }

ExpressionNode defines a couple of static integer constants that can be used to identify different node types. The abstract method getType() is intended to return the type of the node, that is every derived class should return one of the above constants. The method getValue() should return the evaluated result of the sub-tree which is rooted in the current node. In the example above, the node corresponding to the square root should return 2.

We now implement this interface and create a class called ConstantExpressionNode for holding numeric constants, that is numbers like 2 or 3.14159.

public class ConstantExpressionNode implements ExpressionNode { private double value; public ConstantExpressionNode(double value) { this.value = value; } public ConstantExpressionNode(String value) { this.value = Double.valueOf(value); } public double getValue() { return value; } public int getType() { return ExpressionNode.CONSTANT_NODE; } }

The method **getType()** returns **ExpressionNode.CONSTANT_NODE** and **getValue()** returns the value stored in a private member of type double. There are two constructors, one will take a double value of the constant and the other will take a which is converted to the double value.

The expression node that holds variables or named constant is called VariableExpressionNode.

public class VariableExpressionNode implements ExpressionNode { private String name; private double value; private boolean valueSet; public VariableExpressionNode(String name) { this.name = name; valueSet = false; } public int getType() { return ExpressionNode.VARIABLE_NODE; } public void setValue(double value) { this.value = value; this.valueSet = true; } public double getValue() { if (valueSet) return value; else throw new EvaluationException("Variable '" + name + "' was not initialized."); } }

**VariableExpressionNode** has an internal parameter, **name**, that stores the name of the variable along with a double, **value**, that stores the value of the variable if it is set. A boolean, **valueSet**, indicates if the value has been set. The **getValue()** method will return the value or throw an exception, depending on the state of **valueSet**.

Next we want to create a class that captures additions. Instead of holding just two terms which are added, we decide that an addition should hold an arbitrary number of terms which are either added or subtracted. We will have a similar requirement for multiplications and we will separate out the commonalities in a class called **SequenceExpressionNode**. First we define an inner class called **Term** which holds a single term of the sum together with a boolean indicating the sign.

public class Term { public boolean positive; public ExpressionNode expression; public Term(boolean positive, ExpressionNode expression) { super(); this.positive = positive; this.expression = expression; } }

This class is used by the **SequenceExpressionNode**. Note how the class is declared abstract because we are not yet implementing the methods of the **ExpressionNode** interface.

public abstract class SequenceExpressionNode implements ExpressionNode { protected LinkedList<Term> terms; public SequenceExpressionNode() { this.terms = new LinkedList<Term>(); } public SequenceExpressionNode(ExpressionNode a, boolean positive) { this.terms = new LinkedList<Term>(); this.terms.add(new Term(positive, a)); } public void add(ExpressionNode a, boolean positive) { this.terms.add(new Term(positive, a)); } }

The linked list **terms** contains all the terms of the sum or product. Two constructors are available, one creates an empty sequence, the other immediately adds the first term into the sequence. The second constructor is a there for convenience so that sequences can be created easily. The add() method allows terms to be added to the sum or product. With this class defined, we can now go ahead and write the **AdditionExpressionNode** class.

public class AdditionExpressionNode extends SequenceExpressionNode { public AdditionExpressionNode() { super(); } public AdditionExpressionNode(ExpressionNode a, boolean positive) { super(a, positive); } public int getType() { return ExpressionNode.ADDITION_NODE; } public double getValue() { double sum = 0.0; for (Term t : terms) { if (t.positive) sum += t.expression.getValue(); else sum -= t.expression.getValue(); } return sum; } }

The constructors and the **getType()** method are implemented as expected. **getValue()** iterates over the terms and adds or subtracts them from the sum, depending of the **positive** flag inside the terms. In order to get the values of the sub-expressions that are added and subtracted, we can recursively call the **getValue()** methods of all the terms.

The **MultiplicationExpressionNode** is very similar.

public class MultiplicationExpressionNode extends SequenceExpressionNode { public MultiplicationExpressionNode() { super(); } public MultiplicationExpressionNode(ExpressionNode a, boolean positive) { super(a, positive); } public int getType() { return ExpressionNode.MULTIPLICATION_NODE; } public double getValue() { double prod = 1.0; for (Term t : terms) { if (t.positive) prod *= t.expression.getValue(); else prod /= t.expression.getValue(); } return prod; } }

The only new thing here is the **getValue()** method. Instead of adding or subtracting terms from a sum that is initially zero we now multiply or divide from a product. The product has to be initialised to 1.0 instead of 0.0 like the sum in the **AdditionExpressionNode**. In contrast to the previous two operations, exponentiation only takes two arguments, the base and the exponent. Accordingly, the ExponentiationExpressionNode looks slightly simpler.

public class ExponentiationExpressionNode implements ExpressionNode { private ExpressionNode base; private ExpressionNode exponent; public ExponentiationExpressionNode(ExpressionNode base, ExpressionNode exponent) { this.base = base; this.exponent = exponent; } public int getType() { return ExpressionNode.EXPONENTIATION_NODE; } public double getValue() { return Math.pow(base.getValue(), exponent.getValue()); } }

The constructor assigns the private base and exponent expressions and **getValue()** uses Math.pow() to calculate the result of the exponentiation. Again, because the base and the exponent are both expression nodes themselves, we call their **getValue()** methods to obtain a value.

The last type of expression node that needs implementing will handle mathematical functions. This is the only class which will be slightly more laborious because we have to use control statements to tell the different functions apart.

public class FunctionExpressionNode implements ExpressionNode { public static final int SIN = 1; public static final int COS = 2; public static final int TAN = 3; public static final int ASIN = 4; public static final int ACOS = 5; public static final int ATAN = 6; public static final int SQRT = 7; public static final int EXP = 8; public static final int LN = 9; public static final int LOG = 10; public static final int LOG2 = 11; private int function; private ExpressionNode argument; public FunctionExpressionNode(int function, ExpressionNode argument) { super(); this.function = function; this.argument = argument; } public int getType() { return ExpressionNode.FUNCTION_NODE; } public double getValue() { switch (function) { case SIN: return Math.sin(argument.getValue()); case COS: return Math.cos(argument.getValue()); case TAN: return Math.tan(argument.getValue()); case ASIN: return Math.asin(argument.getValue()); case ACOS: return Math.acos(argument.getValue()); case ATAN: return Math.atan(argument.getValue()); case SQRT: return Math.sqrt(argument.getValue()); case EXP: return Math.exp(argument.getValue()); case LN: return Math.log(argument.getValue()); case LOG: return Math.log(argument.getValue()) * 0.43429448190325182765; case LOG2: return Math.log(argument.getValue()) * 1.442695040888963407360; } throw new EvaluationException("Invalid function id "+function+"!"); } public static int stringToFunction(String str) { if (str.equals("sin")) return FunctionExpressionNode.SIN; if (str.equals("cos")) return FunctionExpressionNode.COS; if (str.equals("tan")) return FunctionExpressionNode.TAN; if (str.equals("asin")) return FunctionExpressionNode.ASIN; if (str.equals("acos")) return FunctionExpressionNode.ACOS; if (str.equals("atan")) return FunctionExpressionNode.ATAN; if (str.equals("sqrt")) return FunctionExpressionNode.SQRT; if (str.equals("exp")) return FunctionExpressionNode.EXP; if (str.equals("ln")) return FunctionExpressionNode.LN; if (str.equals("log"))return FunctionExpressionNode.LOG; if (str.equals("log2")) return FunctionExpressionNode.LOG2; throw ParseException("Unexpected Function "+str+" found!"); } public static String getAllFunctions() { return "sin|cos|tan|asin|acos|atan|sqrt|exp|ln|log|log2"; } }

We first define a series of integer constants for the different mathematical functions available. The private member **function** will be set to one of these pre-specified values. The expression node **argument** holds the argument of the function. The method **getValue()** decides, in a switch statement, which function to apply to the argument.

Two static methods are provided to help tokenizing and parsing an expression from a String. The method **stringToFunction()** will take a string of a function name and return the corresponding integer function identifier. If the method is passed a string that is not a recognised function then an exception is thrown. This method will be used by the parser to create new **FunctionExpressionNode** objects. The method **getAllFunctions()** returns a string containing all supported functions, separated by a |. This string may be used by the tokenizer to create a regular expression that identifies functions inside a string.

These are all the classes needed to represent mathematical expressions. But for now we have not created the code necessary to create the expression tree when parsing an expression from a string. This means that, for now, if we want to test our code we will have to create the tree by hand. The code for creating the expression **3*2^4 + sqrt(1+3)** is given below.

AdditionExpressionNode innerSum = new AdditionExpressionNode(); innerSum.add(new ConstantExpressionNode(1), true); innerSum.add(new ConstantExpressionNode(3), true); ExpressionNode sqrt = new FunctionExpressionNode( FunctionExpressionNode.SQRT, innerSum); ExpressionNode expo = new ExponentiationExpressionNode( new ConstantExpressionNode(2), new ConstantExpressionNode(4)); MultiplicationExpressionNode prod = new MultiplicationExpressionNode(); prod.add(new ConstantExpressionNode(3), true); prod.add(expo, true); AdditionExpressionNode expression = new AdditionExpressionNode(); expression.add(prod, true); expression.add(sqrt, true); System.out.println("The result is " + expression.getValue());

If we have done everything right then this code will print out

The result is 50.0

In the next contribution to this series I will modify the parser to generate the expression tree while parsing the input.

]]>We first add to our class **Token** that holds the information about the tokens in the input.

public class Token { public static final int EPSILON = 0; public static final int PLUSMINUS = 1; public static final int MULTDIV = 2; public static final int RAISED = 3; public static final int FUNCTION = 4; public static final int OPEN_BRACKET = 5; public static final int CLOSE_BRACKET = 6; public static final int NUMBER = 7; public static final int VARIABLE = 8; public final int token; public final String sequence; public Token(int token, String sequence) { super(); this.token = token; this.sequence = sequence; } }

In short, this class defines a number of static constants for the different types of tokens and a couple of fields that hold the data for the individual token. Note how we made the fields public final. This makes it immediately clear that we are dealing with an immutable object and there is no need for a multitude of getters in the class.

Next we write class called **Parser** which does the actual parsing of the expression. The tokens are stored in a **List** of **Token** objects and one **Token** object is stored as the lookahead.

The parser we are about to write will not do much except throwing an excpetion if the expression is invalid. If the parser runs without an exception being thrown we know that the expression is valid. In one of the future posts we will add to this parser and construct an internal representation of the expression that can be used for calculations.

public class Parser { LinkedList<Token> tokens; Token lookahead;

The main method of the parser is called **parse** and takes the tokens as parameter.

public void parse(List<Token> tokens) { this.tokens = (LinkedList<Token>) tokens.clone(); lookahead = this.tokens.getFirst(); expression(); if (lookahead.token != Token.EPSILON) throw new ParserException("Unexpected symbol %s found", lookahead); }

In the **parse** method we first create a shallow copy of the token list because we will be taking elements out of the list and we don’t want to create side effects on the parameters. Then **lookahead** is assigned the first token in the list. After these initialisations we call a method called **expression()**. The parser will have one method for every non-terminal symbol of the grammar that we designed in the last post. This means the method **expression()** will parse the non-terminal symbol **expression**.

Once the expression has been parsed completely there should be no symbols left in the list. This means that the **lookahead** should be equal to **Token.EPSILON**. If there is still a symbol left in the lookahead it means that there is an error in the input. After parsing the expression we can therefore perform an error check. This takes care of balancing parentheses and making sure the input is a valid expression.

Before we write the **expression()** method we create a small utility method that reads the next token from the list.

private void nextToken() { tokens.pop(); // at the end of input we return an epsilon token if (tokens.isEmpty()) lookahead = new Token(Token.EPSILON, "", -1); else lookahead = tokens.getFirst(); }

This method will be used frequently for reducing the matched terminal symbols. We pop the first token off the list and set the lookahead to the new head of the list. In case the list is empty, we create the special **EPSILON** symbol which shows to the parser that the input has finished.

Now for the **expression()** method. The **expression** non-terminal only has one rule so the method becomes quite simple.

private void expression() { // expression -> signed_term sum_op signedTerm(); sumOp(); }

For each non-terminal on the right hand side of the rule we call the appropriate method.

We continue with writing a method for the **sum_op** non-terminal.

private void sumOp() { if (lookahead.token == Token.PLUSMINUS) { // sum_op -> PLUSMINUS term sum_op nextToken(); term(); sumOp(); } else { // sum_op -> EPSILON } }

If the next symbol is **PLUSMINUS**, that is a plus or a minus, we apply the rule

sum_op -> PLUSMINUS term sum_op

Because the right hand side starts with a terminal that is allowed by the grammar, we can eat it up by calling **nextToken()**. Then we call the methods corresponding to the remaining non-terminals in the rule. If, on the other hand the next token is not **PLUSMINUS** there is no other match for **sum_op** except the **EPSILON** rule. This means that, in this case, we just do nothing and return out of the **sumOp()** method.

We can continue in the same way with **signedTerm()**

private void signedTerm() { if (lookahead.token == Token.PLUSMINUS) { // signed_term -> PLUSMINUS term nextToken(); term(); } else { // signed_term -> term term(); } }

Again, there are two possible rules. If the next token is **PLUSMINUS** we can eat it up and then parse the non-terminal **term**. Otherwise we parse the non-terminal **term** directly.

The next symbols are handled in pretty much the same way.

private void term() { // term -> factor term_op factor(); termOp(); } private void termOp() { if (lookahead.token == Token.MULTDIV) { // term_op -> MULTDIV factor term_op nextToken(); signedFactor(); termOp(); } else { // term_op -> EPSILON } }

The previous two methods will handle a multiplication of an arbitrary number of terms. The individual factors, except the first one, can be preceded by a **PLUSMINUS**. This is handled by the following method.

private void signedFactor() { if (lookahead.token == Token.PLUSMINUS) { // signed_factor -> PLUSMINUS factor nextToken(); factor(); } else { // signed_factor -> factor factor(); } }

The following two methods will handle factors which can contain exponentiation of an arbitrary number of terms.

private void factor() { // factor -> argument factor_op argument(); factorOp(); } private void factorOp() { if (lookahead.token == Token.RAISED) { // factor_op -> RAISED expression nextToken(); signedFactor(); } else { // factor_op -> EPSILON } }

The following method, **argument()**, handles either a fixed value as given by a variable or constant, a function, or a bracketed expression.

private void argument() { if (lookahead.token == Token.FUNCTION) { // argument -> FUNCTION argument nextToken(); argument(); } else if (lookahead.token == Token.OPEN_BRACKET) { // argument -> OPEN_BRACKET sum CLOSE_BRACKET nextToken(); expression(); if (lookahead.token != Token.CLOSE_BRACKET) throw new ParserException("Closing brackets expected and " + lookahead.sequence + " found instead"); nextToken(); } else { // argument -> value value(); } }

When we are parsing a bracketed expression, we first eat up the opening bracket by calling **nextToken()**. Then we call **expression()** to parse the expression inside the brackets. After that is done we are expecting the next token to be a closing bracket. If that is not the case than we have encountered a syntax error. In this case we throw an exception informing about the error.

Finally we have a method for the non-terminal value which can either expand to a **NUMBER** or to a **VARIABLE**.

private void value() { if (lookahead.token == Token.NUMBER) { // argument -> NUMBER nextToken(); } else if (lookahead.token == Token.VARIABLE) { // argument -> VARIABLE nextToken(); } else { throw new ParserException( "Unexpected symbol "+lookahead.sequence+" found"); } }

This concludes the coding of a recursive descent parser. If you haven’t figured it out by now, the name “recursive descent” comes from the fact that the parser performs a depth first search by recursively calling the same methods. You can spot the recursion in the **argument()** method which is indirectly called by the **expression()** method but also calls the **expression()** method itself.

In the next post in this series we will be constructing an internal representation of the expression.

]]>

Before we actually get down to designing the grammar we have to define all the terminal symbols that we expect.

Terminal Meaning PLUSMINUS + or - MULTDIV * or / RAISED ^ FUNCTION any mathematical function: sin, cos, exp, sqrt, ... OPEN_BRACKET ( CLOSE_BRACKET ) NUMBER any number VARIABLE the name of a variable

These are all the symbols that we allow in the user input. We assume that the tokenizer has already split the input into these terminal symbols and the parser now gets a stream of these symbols.

Remember from the last post that the parser works with terminal and non-terminal symbols. The non-terminal symbols can be expanded by the parser’s rules while the terminal symbols cannot be changed. The parsing starts with a single non-terminal. For our expression parser we call that symbol **expression**.

We want our parser to not only decide if the input is a valid expression. We also want it to generate an internal representation of that expression. This means we have to make sure that the mathematical operations are carried out in the right order. In an expression like **3 + 2*5** the multiplication has to be carried out first because the multiplication has a higher precedence than the addition (remember multiplication before addition).

For reasons that we will see later the rules start with the weakest operation. Essentially, in a forthcoming post, we will be creating a tree structure with the lowest precedence operations near the root.

On the top level an expression can be interpreted as a sum of one or more terms. The first term can start with a plus or minus.

expression -> signed_term sum_op sum_op -> PLUSMINUS term sum_op sum_op -> EPSILON

Remember from the last post that the way these rules are designed is necessary to make sure that we always know which rule to apply just by reading the next symbol from the input. The **signed_term** can be a term preceded by **PLUSMINUS** or just a term.

signed_term -> PLUSMINUS term signed_term -> term

Now we continue in the same way with products. Each term is a product of one or more factors.

term -> factor term_op term_op -> MULTDIV signed_factor term_op term_op -> EPSILON

A **signed_factor** is a **factor** that can start with a **PLUSMINUS** or just a factor.

signed_factor -> PLUSMINUS factor signed_factor -> factor

A factor can be either a function or an argument which might be raised to some power.

factor -> argument factor_op factor_op -> RAISED signed_factor factor_op -> EPSILON

The second rule here allows either a simple argument, because **factor_op** can be eliminated using the **EPSILON** rule. It also allows exponentiation via the third rule.

An argument is either just a value or a full expression enclosed in brackets.

argument -> value argument -> FUNCTION argument argument -> OPEN_BRACKET expression CLOSE_BRACKET

The third of these two rules allows us to build expressions of arbitrary complexity because we can keep on putting expressions within expressions as long as they are enclosed in brackets. The second rule allows function expressions to be used as arguments. Note that an expression like **sin x** is possible without any brackets around the **x** because the function argument can simply be a value. But what happens when the expression reads something like **sin 2*x**. This results in an ambiguity should **sin 2*x** be treated as **sin(2*x)** or as **(sin 2)*x**? I have tried typing the expression into Google calculator and it turns out that it makes the distinction between **sin 2x** which it interprets as **sin(2*x)** and **sin 2*x** which it interprets as **sin(2)*x**. WolframAlpha on the other hand will interpret both as **sin(2*x)** but when you add spaces, **sin 2 x** and **sin 2 * x**, will be interpreted as **sin(2)*x**. In our case, using the simple LL(1) grammar and ignoring white spaces, we won’t be able to implement such fine grained behaviour. This means our grammar will always interpret the expression as **sin(2)*x**. Also note that this means that an expression such as **sin x^2**. will be interpreted as **(sin(x))^2**.

Finally, a value can be a **NUMBER** or a **VARIABLE**

value -> NUMBER value -> VARIABLE

This is the complete grammar for mathematical expressions. Why not try out yourself if you can analyse an expression using this grammar?

In the next post in this series I will program a recursive descent parser based on this grammar.

]]>A grammar is made up of a collection of rules. Before we start explaining the rules, we have to talk about what the rules act upon. The tokenizer gave us a sequence of tokens and I said before that each type of token has a unique code. These token codes are called the terminal symbols of the grammar.

In addition to the terminal symbols we also introduce so-called non-terminal symbols. The non-terminal symbols represent somewhat more abstract concepts. For example a sum might be a non-terminal. A sum is made up of a sequence of terms separated by plus or minus signs. A sum does not correspond to an individual token but states how tokens must be arranged in the input. Usually non-terminals are written in lower case while terminal symbols are written in upper case.

Now we come back to the rules of the grammar. Each rule expresses a way to convert a non-terminal symbol into a sequence of terminal and non-terminal symbols. There might be more than one rule converting a specific non-terminal. A grammar also specifies a special non-terminal which is the starting point.

Before we start designing our real grammar form mathematical expressions, let’s look at a simple example.

(1) sum -> NUMBER PLUS NUMBER (2) sum -> NUMBER PLUS sum

In this grammar we start with the non-terminal **sum**. The two rules allow us to express any sum of numbers separated by plus signs. We start with the initial non-terminal

sum

By applying rule (2) we can replace **sum**

(2) -> NUMBER PLUS sum

Now we can apply rule (2) again to the **sum** on the right hand side, and we end up with

(2) -> NUMBER PLUS NUMBER PLUS sum

We finally apply rule (1) to the non-terminal **sum** which is still present in our right hand side. This leads us to

(1) -> NUMBER PLUS NUMBER PLUS NUMBER PLUS NUMBER

Now we have created a sequence that is purely made up of non-terminal symbols and expresses a sum of four numbers. In general, the grammar specified above lets us generate a sum of any length. We just have to keep applying rule (2) again and again, until we have enough **NUMBER**s to add.

One problem remains however. When we are parsing an input from the user, we need to decide which rules to apply. This has to be somehow deduced from the input sequence. We successively apply rules and match them with the input sequence while, at the same time, reading token by token from the input. In the simplest type of grammars, the so-called LL(1) grammars, looking at the next token on the input should be enough to decide which rule to apply.

Let’s take the example above again and assume our input from the tokenizer is

NUMBER PLUS NUMBER PLUS NUMBER

We start with the non-terminal **sum**, and we can immediately see that we have to apply first rule (2) and then rule (1). But with the restrictions of the LL(1) grammar, the parser can only see the next symbol. So, when we start parsing, the parser will only see

NUMBER ...

The parser does not know how many numbers are in the sum and can’t decide whether to use rule (1) or rule (2). In order to help the parser we have to re-design our grammar. We use a new terminal symbol called **EPSILON** which is used when a non-terminal can be removed if it doesn’t match any input.

(1) sum -> NUMBER sum_op (2) sum_op -> PLUS NUMBER sum_op (3) sum_op -> EPSILON

Now, let’s follow how the parser reacts to the input. We start by placing our starting non-terminal on the stack.

Stack: sum Input: NUMBER (PLUS NUMBER PLUS NUMBER)

The input in brackets is the input that is still in the stream but is not visible to the parser. The parser now decides to apply rule (1) based on the next token.

Rules applied: (1) Stack: NUMBER sum_op Input: NUMBER (PLUS NUMBER PLUS NUMBER)

Now the stack starts with the terminal **NUMBER** that matches in the terminal on the input stream. In this case we can discard the matching terminals.

Rules applied: (1) Stack: sum_op Input: PLUS (NUMBER PLUS NUMBER)

Now we see a **PLUS** on the input and the only way to convert **sum_op** into a string that starts with **PLUS** is by applying rule (2)

Rules applied: (1) (2) Stack: PLUS NUMBER sum_op Input: PLUS (NUMBER PLUS NUMBER)

We reduce again by discarding the matching terminal from the beginning of the stack.

Rules applied: (1) (2) Stack: NUMBER sum_op Input: NUMBER (PLUS NUMBER)

Rules applied: (1) (2) Stack: sum_op Input: PLUS (NUMBER)

Note how we can successively discard all matching terminals, the **NUMBER** and the **PLUS**. Again we have to apply rule (2) and reduce

Rules applied: (1) (2) (2) Stack: PLUS NUMBER sum_op Input: PLUS (NUMBER)

Rules applied: (1) (2) (2) Stack: NUMBER sum_op Input: NUMBER

Rules applied: (1) (2) (2) Stack: sum_op Input:

Now the input string is empty. This is where the **EPSILON** rule comes in. The **EPSILON** rule is used when there is no other rule that can be used. The **EPSILON** is a terminal that can be discarded without matching any input.

Rules applied: (1) (2) (2) (3) Stack: Input:

Now both the stack and the input string are empty. This means we are finished with parsing the input and we have a match.

This concludes our little introduction to LL(1) grammars. With the knowledge we have now, we can set about designing a grammar for mathematical expressions. We will do this in the next instalment of this series.

]]>The first step in writing a parser is to tokenize the input string. This means to separate the input string into short bits that represent the basic entities in the expression. We could do this by hand, reading one character at a time and assembling the tokens character by character. This would be tedious, difficult to maintain and hard to extend should we decide to add more features to the tokenizer.

Instead we decide to use the java.util.regex package to do the work for us. This will allow us to implement quite a general tokenizer that can be used for any kind of grammar.

Let’s start by creating a class called **Tokenizer** and defining an internal class **TokenInfo** that holds information about the individual tokens.

import java.util.regex.Pattern; public class Tokenizer { private class TokenInfo { public final Pattern regex; public final int token; public TokenInfo(Pattern regex, int token) { super(); this.regex = regex; this.token = token; } } }

As you can see **TokenInfo** contains two fields. The regular expression that is used to match the input string against the token is stored in the **Pattern regex**. We store **Pattern** objects instead of the regular expression string to improve performance. Regular expressions have to be compiled which takes time. **Pattern** stores the regular expression in compiled form. The code of the token is given by the integer value ‘token’. Each type of token should have its own code.

private LinkedList<TokenInfo> tokenInfos; public Tokenizer() { tokenInfos = new LinkedList<TokenInfo>(); }

We define a linked list, called **tokenInfos**, to hold the information for all the tokens. The linked list is created in the constructor of our **Tokenizer** class. Next we need to add the token information in our list. We do this via the **add()** method.

public void add(String regex, int token) { tokenInfos.add( new TokenInfo( Pattern.compile("^("+regex+")"), token)); }

The user can pass a regular expression string and a token code to the method. The method will then the “^” character to the user supplied regular expression. It causes the regular expression to match only the beginning of a string. This is needed because we will be removing any token always looking for the next token at the beginning of the input string.

We also want to store the information about the tokens we have seen. We need to store the token code and the string that corresponds to the token. The string is needed because the token code does not retain the full information about the input. When we have found a variable we will give the token a special code for variable tokens but we need to also keep the name of the variable so that we can later use it to store or retrieve information. To this end we define another internal class called Token and a linked list of these tokens.

public class Token { public final int token; public final String sequence; public Token(int token, String sequence) { super(); this.token = token; this.sequence = sequence; } }

private LinkedList<Token> tokens;

In our constructor of **Tokenizer** we now also have to initialize the tokens list.

public Tokenizer() { tokenInfos = new LinkedList<TokenInfo>(); tokens = new LinkedList<Token>(); }

We can now add token information in our main method.

public static void main(String[] args) { Tokenizer tokenizer = new Tokenizer(); tokenizer.add("sin|cos|exp|ln|sqrt", 1); // function tokenizer.add("\\(", 2); // open bracket tokenizer.add("\\)", 3); // close bracket tokenizer.add("[+-]", 4); // plus or minus tokenizer.add("[*/]", 5); // mult or divide tokenizer.add("\\^", 6); // raised tokenizer.add("[0-9]+",7); // integer number tokenizer.add("[a-zA-Z][a-zA-Z0-9_]*", 8); // variable }

The code above adds regular expressions that match functions, brackets, mathematical operators, integer numbers and variables. Note how each type of token gets a unique code. Note also how we have to escape special characters in the regular expressions with a double backslash.

We are now ready to tokenize an input string. This is done in the tokenize method. In this method we first define a local string that contains the input string but without any leading or trailing spaces. Also we clear the tokens list from any previous data.

public void tokenize(String str) { String s = new String(str); tokens.clear();

The main loop is carried out until the local input string is empty. When we find a token we will remove it from the front of the string. If we are successful and the whole string could be tokenized then we will eventually end up with an empty string.

while (!s.equals("")) { boolean match = false;

The **match** variable indicates if any of the tokens provided a match with the beginning of the input string. Initially we have no match. We now loop through all the token infos. and create a Matcher for the input string.

for (TokenInfo info : tokenInfos) { Matcher m = info.regex.matcher(s); if (m.find()) { match = true; String tok = m.group().trim(); tokens.add(new Token(info.token, tok)); s = m.replaceFirst(""); break; } }

If the pattern is found in the input string then match is set to true. The **group()** method of the **Matcher** returns the string of the last match. A new **Token** object is appended to the list of tokens. The token object contains the code of the token that resulted in the match and the matched string. In the next step, the matcher is used to remove the token from the beginning of the input string. We do this with the **replaceFirst()** method which replaces the first (and only) match with an empty string. Finally we break out of the loop over the token infos because we don’t need to check any other token types in this round.

After the loop has finished we check if a match was found. If not we throw an exception. We are using a **ParserException** which is extends **RuntimeException** without adding new functionality to it.

if (!match) throw new ParserException( "Unexpected character in input: "+s); } }

This is it! We are done with tokenizing the user input.

If you paid close attention you might have realized that the regular expression for variable tokens also matches any function token. This is not a bug. By storing the token information in a linked list, and always ensuring that we look for matches from the beginning of the list, we are giving precedence to patterns added first. If the input sequence starts with alphabetic characters the function pattern will first be tested and will match any one of the specified function. Only if there is no match the variable pattern will be tested.

To access the result of tokenizing an input string we define the getter for the tokens list.

public LinkedList<Token> getTokens() { return tokens; }

Now we are ready to tokenize the input. In our main method we add the following code:

try { tokenizer.tokenize(" sin(x) * (1 + var_12) "); for (Tokenizer.Token tok : tokenizer.getTokens()) { System.out.println("" + tok.token + " " + tok.sequence); } } catch (ParserException e) { System.out.println(e.getMessage()); }

The output of our program is

1 sin 2 ( 8 x 3 ) 5 * 2 ( 7 1 4 + 8 var_12 3 )

You can download the Java sources for the tokenizer from this link.

In the next instalment of this series we will be turning our attention towards designing a grammar for mathematical expressions.

]]>In this little series we will explain how to code a simple expression parser in Java. Sometimes it is necessary to read user input, such as formula expressions, some simple markup, or even formatting strings. This user input has to be parsed and converted into a form that you program can handle. Maybe you want to construct an object hierarchy or you wan to produce a formatted output string which reflects what the user has typed in. In addition you want you parser to tell the user when there was a problem with the input, such as a syntax error, and produce a meaningful error message.

The process of parsing user input contains a series of steps.

1) The input needs to be split into tokens. A token can be a single character or a series of characters. In a mathematical expression tokens might include the operators such as the plus sign ‘+’ or the minus sign ‘-‘, but also variable names like ‘x’ or function names such as ‘sin’. For an XML parser tokens could be ‘<‘ or a tag name. The task of recognizing tokens is handled by the tokenizer.

2) A grammar needs to be constructed. The grammar specifies a set of rules wich specify what kind of input is expected. The grammar works on the token level. A simple rule for parsing the sum of two variables could be

sum -> var + var

At this level, the grammar does not distinguish between different variables and in the rule above the two ‘var’ tokens could stand for different variables. Creating a grammar is a somewhat abstract process and is usually done on a piece of paper.

3) The grammar is turned into a parser. Different algorithms exist for generating the code of a parser. Depending on the complexity of the grammar this task is often carried out using a parser generator. Parser generators generate code which you can feed tokens into. The generated code will then evaluate the tokens and decide on the grammar rules to apply. In this series we will look at a very simple grammar only. For simple grammars the parser can be coded by hand using the recursive descent algorithm.

4) Actions are taken when specific rules of the grammar are applied. These actions are needed to turn the input string into a data structure that can be used by the program later on. Actions could involve building a tree structure of objects or writing a piece of formatted string into a stream.

In the continuation of this series I will talk about each of these four steps individually and step by step, we will be building a recursive descent parser for analysing mathematical expressions.

]]>