Writing a Parser in Java: Implementing the Parser

In the last post about the Java expression parser we designed a grammar for analysing a mathematical expression. That was certainly all a bit abstract and theoretical for many readers. Now we are ready to put some meat on these abstract concepts and implement the parser based on that grammar.

We first add to our class Token that holds the information about the tokens in the input.

public class Token
{
  public static final int EPSILON = 0;
  public static final int PLUSMINUS = 1;
  public static final int MULTDIV = 2;
  public static final int RAISED = 3;
  public static final int FUNCTION = 4;
  public static final int OPEN_BRACKET = 5;
  public static final int CLOSE_BRACKET = 6;
  public static final int NUMBER = 7;
  public static final int VARIABLE = 8;

  public final int token;
  public final String sequence;

  public Token(int token, String sequence)
  {
    super();
    this.token = token;
    this.sequence = sequence;
  }
}

In short, this class defines a number of static constants for the different types of tokens and a couple of fields that hold the data for the individual token. Note how we made the fields public final. This makes it immediately clear that we are dealing with an immutable object and there is no need for a multitude of getters in the class.

Next we write class called Parser which does the actual parsing of the expression. The tokens are stored in a List of Token objects and one Token object is stored as the lookahead.

The parser we are about to write will not do much except throwing an excpetion if the expression is invalid. If the parser runs without an exception being thrown we know that the expression is valid. In one of the future posts we will add to this parser and construct an internal representation of the expression that can be used for calculations.

public class Parser
{
  LinkedList<Token> tokens;
  Token lookahead;

The main method of the parser is called parse and takes the tokens as parameter.

  public void parse(List<Token> tokens)
  {
    this.tokens = (LinkedList<Token>) tokens.clone();
    lookahead = this.tokens.getFirst();

    expression();

    if (lookahead.token != Token.EPSILON)
      throw new ParserException("Unexpected symbol %s found", lookahead);
  }

In the parse method we first create a shallow copy of the token list because we will be taking elements out of the list and we don’t want to create side effects on the parameters. Then lookahead is assigned the first token in the list. After these initialisations we call a method called expression(). The parser will have one method for every non-terminal symbol of the grammar that we designed in the last post. This means the method expression() will parse the non-terminal symbol expression.

Once the expression has been parsed completely there should be no symbols left in the list. This means that the lookahead should be equal to Token.EPSILON. If there is still a symbol left in the lookahead it means that there is an error in the input. After parsing the expression we can therefore perform an error check. This takes care of balancing parentheses and making sure the input is a valid expression.

Before we write the expression() method we create a small utility method that reads the next token from the list.

  private void nextToken()
  {
    tokens.pop();
    // at the end of input we return an epsilon token
    if (tokens.isEmpty())
      lookahead = new Token(Token.EPSILON, "", -1);
    else
      lookahead = tokens.getFirst();
  }

This method will be used frequently for reducing the matched terminal symbols. We pop the first token off the list and set the lookahead to the new head of the list. In case the list is empty, we create the special EPSILON symbol which shows to the parser that the input has finished.

Now for the expression() method. The expression non-terminal only has one rule so the method becomes quite simple.

  private void expression()
  {
    // expression -> signed_term sum_op
    signedTerm();
    sumOp();
  }

For each non-terminal on the right hand side of the rule we call the appropriate method.

We continue with writing a method for the sum_op non-terminal.

  private void sumOp()
  {
    if (lookahead.token == Token.PLUSMINUS)
    {
      // sum_op -> PLUSMINUS term sum_op
      nextToken();
      term();
      sumOp();
    }
    else
    {
      // sum_op -> EPSILON
    }
  }

If the next symbol is PLUSMINUS, that is a plus or a minus, we apply the rule

sum_op -> PLUSMINUS term sum_op

Because the right hand side starts with a terminal that is allowed by the grammar, we can eat it up by calling nextToken(). Then we call the methods corresponding to the remaining non-terminals in the rule. If, on the other hand the next token is not PLUSMINUS there is no other match for sum_op except the EPSILON rule. This means that, in this case, we just do nothing and return out of the sumOp() method.

We can continue in the same way with signedTerm()

  private void signedTerm()
  {
    if (lookahead.token == Token.PLUSMINUS)
    {
      // signed_term -> PLUSMINUS term
      nextToken();
      term();
    }
    else
    {
      // signed_term -> term
      term();
    }
  }

Again, there are two possible rules. If the next token is PLUSMINUS we can eat it up and then parse the non-terminal term. Otherwise we parse the non-terminal term directly.

The next symbols are handled in pretty much the same way.

  private void term()
  {
    // term -> factor term_op
    factor();
    termOp();
  }

  private void termOp()
  {
    if (lookahead.token == Token.MULTDIV)
    {
      // term_op -> MULTDIV factor term_op
      nextToken();
      signedFactor();
      termOp();
    }
    else
    {
      // term_op -> EPSILON
    }
  }

The previous two methods will handle a multiplication of an arbitrary number of terms. The individual factors, except the first one, can be preceded by a PLUSMINUS. This is handled by the following method.

  private void signedFactor()
  {
    if (lookahead.token == Token.PLUSMINUS)
    {
      // signed_factor -> PLUSMINUS factor
      nextToken();
      factor();
    }
    else
    {
      // signed_factor -> factor
      factor();
    }
  }

The following two methods will handle factors which can contain exponentiation of an arbitrary number of terms.

  private void factor()
  {
    // factor -> argument factor_op
    argument();
    factorOp();
  }

  private void factorOp()
  {
    if (lookahead.token == Token.RAISED)
    {
      // factor_op -> RAISED expression
      nextToken();
      signedFactor();
    }
    else
    {
      // factor_op -> EPSILON
    }
  }

The following method, argument(), handles either a fixed value as given by a variable or constant, a function, or a bracketed expression.

  private void argument()
  {
    if (lookahead.token == Token.FUNCTION)
    {
      // argument -> FUNCTION argument
      nextToken();
      argument();
    }
    else if (lookahead.token == Token.OPEN_BRACKET)
    {
      // argument -> OPEN_BRACKET sum CLOSE_BRACKET
      nextToken();
      expression();

      if (lookahead.token != Token.CLOSE_BRACKET)
        throw new ParserException("Closing brackets expected and "
          + lookahead.sequence + " found instead");

      nextToken();
    }
    else
    {
      // argument -> value
      value();
    }
  }

When we are parsing a bracketed expression, we first eat up the opening bracket by calling nextToken(). Then we call expression() to parse the expression inside the brackets. After that is done we are expecting the next token to be a closing bracket. If that is not the case than we have encountered a syntax error. In this case we throw an exception informing about the error.

Finally we have a method for the non-terminal value which can either expand to a NUMBER or to a VARIABLE.

  private void value()
  {
    if (lookahead.token == Token.NUMBER)
    {
      // argument -> NUMBER
      nextToken();
    }
    else if (lookahead.token == Token.VARIABLE)
    {
      // argument -> VARIABLE
      nextToken();
    }
    else
    {
      throw new ParserException(
        "Unexpected symbol "+lookahead.sequence+" found");
    }
  }

This concludes the coding of a recursive descent parser. If you haven’t figured it out by now, the name “recursive descent” comes from the fact that the parser performs a depth first search by recursively calling the same methods. You can spot the recursion in the argument() method which is indirectly called by the expression() method but also calls the expression() method itself.

In the next post in this series we will be constructing an internal representation of the expression.

 

14 Comments

  1. Stereo

    What should I do if I want to implement Boolean expression too? And if I want to use dinamic Typing for the Variables?

    Reply
    1. mikail

      To implement boolean expressions you have to extend the grammar to allow for boolean operators. You should look at the introduction to LL(1) grammars first and then extend the grammar I introduced in this post.

      For dynamic typing of the variables you have to allow for different leaf nodes in the expression tree representing the different basic types (int, float, boolean). The classes representing the operators should be aware of the types and have a type promotion mechanism, eg.
      int + float will produce a float

      Reply
  2. mieden

    How can i print every translation or every step of the parsing?

    Reply
    1. Mikail

      You can either include a System.out.println in the nextToken method or you can print a message every time you enter one of the methods that represents a non-terminal symbol. That should give you enough information on the parsing progress.

      Reply
  3. vinny

    Hello, nice tutorial , it helped me so much, but there is somehting i’d like to know, how can i introduce an error checking ? like a number after + or – , or parenthesys balancing ??
    thanks

    Reply
    1. Mikail

      I have added error checking in the parser. It simply consists of checking that there are no more symbols left once the expression has finished parsing. I have updated the tutorial accordingly and added a shot discussion after the parser() method.

      Reply
  4. Nico

    why is factorOp(a); called, when the method signature is: private void factor_op()? is something special happening here?

    Reply
    1. Mikail

      You are correct, the call should simply be factorOp(); and the method signature should be private void factorOp(). I have corrected the tutorial.

      Reply
  5. Luiz Felix

    I noticed that your constructor of Token has changed from Token(int, string) to Token(int, string, int), also a I exprerienced some problems due access level when creating a Token object for the Epsilon at the expression’s end.
    Trying to solve these things I looked at your GitHub code and noticed that Token was moved to and external file and the int position field was created, BUT none of this was mentioned here in the tutorial.

    Reply
  6. Sybrand Bakker

    I can not get this code
    this.tokens = (LinkedList) tokens.clone();
    to compile.
    It barfs because the clone method of Object is protected.
    I’m using Jdk 1.8.0_45

    Reply
    1. Harry

      My comment below is actually a reply to yours…

      Reply
  7. Harry

    The reason your code won’t compile is that ‘tokens’ in the above article is defined as a List, and a List doesn’t have the clone() method that a LinkedList supports. It appears that the code in the article is buggy, but the download code is better, so you might consider downloading the code and just reading the article.

    Reply
    1. Aly

      How I download the source code?

      Reply
  8. Sean

    So, to build your token list, you read an entire record in the input file and parse it out?

    Reply

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>