Writing a Parser in Java: Creating the Expression Tree

In the last post I  created classes that can represent a mathematical expression in the form of a tree data structure. We saw how expressions could be stored and evaluated but the code was not linked to the parser yet. The only way up to now to create an expression tree structure is to assemble it by hand. This is tedious and, of course, is not what we want from a parser. In this post I will extend the parser presented in a previous post to include code that will generate an expression tree on the fly.

Remember from the post in which we implemented the parser that every non-terminal symbol in the grammar was given a method inside the Parser class. When we defined these methods they were plain void methods that did not take any arguments. This is about to change. We will now modify all these methods to return an ExpressionNode.

private ExpressionNode expression() ...
private ExpressionNode sumOp(ExpressionNode expr) ...
private ExpressionNode signedTerm() ...
private ExpressionNode term() ...
private ExpressionNode termOp(ExpressionNode expr) ...
private ExpressionNode signedFactor() ...
private ExpressionNode factor() ...
private ExpressionNode factorOp(ExpressionNode expr) ...
private ExpressionNode argument() ...
private ExpressionNode value() ...

Inside each of these methods we now have to create the expression nodes that correspond to the parsed expression and return the result. Let’s start with the leaf nodes which are parsed inside value().

private ExpressionNode value() {
  // argument -> NUMBER
  if (lookahead.token == Token.NUMBER) {
    ExpressionNode expr = new ConstantExpressionNode(lookahead.sequence);
    nextToken();
    return expr;
  }

  // argument -> VARIABLE
  if (lookahead.token == Token.VARIABLE) {
    ExpressionNode expr = new VariableExpressionNode(lookahead.sequence);
    nextToken();
    return expr;
  }

  if (lookahead.token == Token.EPSILON)
    throw new ParserException("Unexpected end of input");
  else
    throw new ParserException("Unexpected symbol %s found", lookahead);
}

If we encounter a number then we create a new ConstantExpressionNode. We can use lookahead.sequence to initialise the constant. lookahead.sequence contains the string that corresponds to the current token. In case of a number, it will contain the string representation of the numerical value. ConstantExpressionNode had a convenience constructor that takes a string, converts it into a number and stores it internally as the value of the constant.

The code added for variables is almost identical except that we create a new VariableExpressionNode. Note how, in both cases, we create the node before calling nextToken(). We do this because nextToken() modifies the lookahead which would mean we loose the information about the current token.

Next up is the argument() method.

private ExpressionNode argument()
{
  // argument -> FUNCTION argument
  if (lookahead.token == Token.FUNCTION)
  {
    int function = FunctionExpressionNode.stringToFunction(lookahead.sequence);
    nextToken();
    ExpressionNode expr = argument();
    return new FunctionExpressionNode(function, expr);
  }
  // argument -> OPEN_BRACKET sum CLOSE_BRACKET
  else if (lookahead.token == Token.OPEN_BRACKET)
  {
    nextToken();
    ExpressionNode expr = expression();
    if (lookahead.token != Token.CLOSE_BRACKET)
    throw new ParserException("Closing brackets expected", lookahead);
    nextToken();
    return expr;
  }

  // argument -> value
  return value();
}

The argument non-terminal does not actually correspond to any type of node in the expression tree. All we need to do here is pass on the nodes we obtain from the sub-expressions. So, depending on the rule, we either return a terminal node via the value() method or a more complex sub-tree via the expression() method or we create a FunctionExpressionNode if we encounter a function.

If we find a function then we have to create a FunctionExpressionNode. To do this we first have to convert the string in lookahead.sequence to a function identifier. In the last post we defined a static helper method called stringToFunction() that can do this for us. This function will throw an exception if the string does not correspond to a supported function.

The factor() method produces argument non-terminal optionally wrapped by functionOp.

private ExpressionNode factor()
{
  // factor -> argument factor_op
  ExpressionNode a = argument();
  return factorOp(a);
}

The rule for factor produces and argument and a factor_op non-terminal. The factor_op non-terminal can include an exponentiation, or it might not. In order for factor_op to create the exponentiation it needs to know the base which is parsed by argument(). For this reason we decide to pass the node produced by argument() to the factorOp() method.

private ExpressionNode factorOp(ExpressionNode expression) {
  // factor_op -> RAISED factor
  if (lookahead.token == Token.RAISED) {
    nextToken();
    ExpressionNode exponent = signedFactor();

    return new ExponentiationExpressionNode(expression, exponent);
  }

  // factor_op -> EPSILON
  return expression;
}

If we find a RAISED token then factorOp() creates an ExponentiationExpressionNode with the base given by the expression passed to the the method and the exponent which is parsed from a recursive call to factor(). If the EPSILON rule is applied then we simply return expression.

We have called the signedFactor() method for the exponent which corresponds to the signed_factor non-terminal.

private ExpressionNode signedFactor()
{
  // signed_factor -> PLUSMINUS factor
  if (lookahead.token == Token.PLUSMINUS)
  {
    boolean positive = lookahead.sequence.equals("+");
    nextToken();
    ExpressionNode t = factor();
    if (positive)
      return t;
    else
      return new AdditionExpressionNode(t, false);
  }

  // signed_factor -> factor
  return factor();
}

A signed factor can start with a plus or minus sign. In the case of a minus sign, an AdditionExpressionNode is generated to represent the negative of the factor. In all other cases the expression from factor() is simply passed on.

The term non-terminal again does not correspond to any node in the expression tree. The implementation of term(), therefore, just passes on the expression nodes.

private ExpressionNode term() {
  // term -> factor term_op
  ExpressionNode f = factor();
  return termOp(f);
}

Again, termOp() might or might not produce a product, depending on which symbols are found next. In order for termOp() to be able to create a product including the first factor, we pass the factor to termOp().

private ExpressionNode termOp(ExpressionNode expression) {
  // term_op -> MULTDIV factor term_op
  if (lookahead.token == Token.MULTDIV) {
    MultiplicationExpressionNode prod;

    if (expression.getType() == ExpressionNode.MULTIPLICATION_NODE)
      prod = (MultiplicationExpressionNode)expression;
    else
      prod = new MultiplicationExpressionNode(expression, true);

    boolean positive = lookahead.sequence.equals("*");
    nextToken();
    ExpressionNode f = signedFactor();
    prod.add(f, positive);

    return termOp(prod);
  }

  // term_op -> EPSILON
  return expression;
}

The implementation of termOp() is slightly more complicated. Remember from the last post that we can add an arbitrary number of factors to a MultiplicationExpressionNode. This means, as we continue calling termOp() recursively we want to keep adding factors to the multiplication. When we discover a MULTDIV token we know that we are dealing with a product or division. We create a new MultiplicationExpressionNode only if the node that was passed to us is not already a MultiplicationExpressionNode. Then we add the next factor to the multiplication.

The EPSILON rule, on the other hand, simply returns the expression that was passed to termOp().

A generalisation of term is signed_term. This can include a leading plus or minus symbol.

private ExpressionNode signedTerm() {
  // signed_term -> PLUSMINUS term
  if (lookahead.token == Token.PLUSMINUS) {
    boolean positive = lookahead.sequence.equals("+");
    nextToken();
    ExpressionNode t = term();
    if (positive)
      return t;
    else
      return new AdditionExpressionNode(t, false);
  }

  // signed_term -> term
  return term();
}

Only in the case that the leading symbol is a minus, we have to create an AdditionExpressionNode and add the term with a negative sign. In all other cases we simply pass on the node obtained from term().

The methods expression() and sumOp() follow pretty much the same pattern as term() and termOp().

private ExpressionNode expression() {
  // expression -> signed_term sum_op
  ExpressionNode expr = signedTerm();
  return sumOp(expr);
}

The method expression() does not produce any nodes itself but passes the result of signedTerm() to sumOp() which can then assemble and AdditionExpressionNode if it has to.

private ExpressionNode sumOp(ExpressionNode expr) {
  // sum_op -> PLUSMINUS term sum_op
  if (lookahead.token == Token.PLUSMINUS) {
    AdditionExpressionNode sum;
    if (expr.getType() == ExpressionNode.ADDITION_NODE)
      sum = (AdditionExpressionNode)expr;
    else
      sum = new AdditionExpressionNode(expr, true);

    boolean positive = lookahead.sequence.equals("+");
    nextToken();
    ExpressionNode t = term();
    sum.add(t, positive);

    return sumOp(sum);
  }

  // sum_op -> EPSILON
  return expr;
}

As with termOp() we continue calling sumOp() recursively we want to keep adding terms to the sum. When we discover a PLUSMINUS token we know that we are dealing with an addition or subtraction. We create a new AdditionExpressionNode only if the node that was passed to us is not already a AdditionExpressionNode. Then we add the next term to the sum.

The EPSILON rule, on the other hand, simply returns the expression that was passed to sumOp().

Finally we modify the parse() method to return the ExpressionNode created in the process of parsing the input.

public ExpressionNode parse(LinkedList<Token> tokens) {
  this.tokens = (LinkedList<Token>)tokens.clone();
  lookahead = this.tokens.getFirst();

  ExpressionNode expr = expression();

  if (lookahead.token != Token.EPSILON)
    throw new ParserException("Unexpected symbol %s found", lookahead);

  return expr;
}

Using the parser

With all the above we are ready to use the parser. In our main() method we can now test the following.

Parser parser = new Parser();
try {
  ExpressionNode expression = parser.parse("3*2^4 + sqrt(1+3)");
  System.out.println("The value of the expression is "+expression.getValue());
}
catch (ParserException e) {
  System.out.println(e.getMessage());
}
catch (EvaluationException e) {
  System.out.println(e.getMessage());
}

This will produce the following output.

The value of the expression is 50.0

I hope you feel the same of exhilaration  when seeing this result. It just feels like everything finally came together and the parser works exactly as expected.

Unfortunately we are not quite done. While we are able to parse variables and named constants, we have not yet got a mechanism for setting the values of those constants. This will be the topic of the next and final post in this series. To make thins a bit more interesting, I will introduce the visitor design pattern to achieve this.

6 Comments

  1. Nico

    I hate to point out little typos, especially in an article as well written and informative as this series, but you do have some inconsistency in naming, even in your code samples, such as the method sum_op() which calls sumOp() recursively. But apart from that, thanks for this amazing amount of information, its been really interesting to follow along (in haxe, for me, but same thing).

    Reply
    1. lzshads

      Nico, try looking the article where the parser is implemented, now there’s suport to raised operation.

      Reply
  2. Nico

    Also, you call: ExpressionNode expression = parser.parse(“3*2^4 + sqrt(1+3)”);

    while parse() takes a list of tokens, and the tokenizer hasnt been updated for the new things such as RAISED, so even passing tokenizer.getTokens will still fail. I’ll check your next article to see if you talk about that, otherwise I’ll just read your source, as it seems some parts are missing from this article

    Reply
    1. Robin

      It’s because when you add the tokens in the tokenizer, the token numbers are not the same as in the Token class.

      Reply
  3. 2017

    St. John Lutheran Church

    Reply
  4. multiply

    A Short History of C. elegans Research | WormClassroom

    Reply

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>