# Writing a Parser in Java: The Tokenizer

In this short series I am talking about how to write a parser that analyses mathematical expressions and turns them into an object tree that is able to evaluate that expression.

The first step in writing a parser is to tokenize the input string. This means to separate the input string into short bits that represent the basic entities in the expression. We could do this by hand, reading one character at a time and assembling the tokens character by character. This would be tedious, difficult to maintain and hard to extend should we decide to add more features to the tokenizer.

Instead we decide to use the java.util.regex package to do the work for us. This will allow us to implement quite a general tokenizer that can be used for any kind of grammar.

Let’s start by creating a class called Tokenizer and defining an internal class TokenInfo that holds information about the individual tokens.

import java.util.regex.Pattern;

public class Tokenizer
{
private class TokenInfo {
public final Pattern regex;
public final int token;

public TokenInfo(Pattern regex, int token) {
super();
this.regex = regex;
this.token = token;
}
}
}

As you can see TokenInfo contains two fields. The regular expression that is used to match the input string against the token is stored in the Pattern regex. We store Pattern objects instead of the regular expression string to improve performance. Regular expressions have to be compiled which takes time. Pattern stores the regular expression in compiled form. The code of the token is given by the integer value ‘token’. Each type of token should have its own code.

private LinkedList<TokenInfo> tokenInfos;

public Tokenizer() {
}

We define a linked list, called tokenInfos, to hold the information for all the tokens. The linked list is created in the constructor of our Tokenizer class. Next we need to add the token information in our list. We do this via the add() method.

public void add(String regex, int token) {
new TokenInfo(
Pattern.compile("^("+regex+")"), token));
}

The user can pass a regular expression string and a token code to the method. The method will then the “^” character to the user supplied regular expression. It causes the regular expression to match only the beginning of a string. This is needed because we will be removing any token always looking for the next token at the beginning of the input string.

We also want to store the information about the tokens we have seen. We need to store the token code and the string that corresponds to the token. The string is needed because the token code does not retain the full information about the input. When we have found a variable we will give the token a special code for variable tokens but we need to also keep the name of the variable so that we can later use it to store or retrieve information. To this end we define another internal class called Token and a linked list of these tokens.

public class Token {
public final int token;
public final String sequence;

public Token(int token, String sequence) {
super();
this.token = token;
this.sequence = sequence;
}
}
private LinkedList<Token> tokens;

In our constructor of Tokenizer we now also have to initialize the tokens list.

public Tokenizer() {
}

We can now add token information in our main method.

public static void main(String[] args) {
Tokenizer tokenizer = new Tokenizer();
tokenizer.add("\$$", 2); // open bracket tokenizer.add("\$$", 3); // close bracket
tokenizer.add("[+-]", 4); // plus or minus
tokenizer.add("[*/]", 5); // mult or divide
}

The code above adds regular expressions that match functions, brackets, mathematical operators, integer numbers and variables. Note how each type of token gets a unique code. Note also how we have to escape special characters in the regular expressions with a double backslash.

We are now ready to tokenize an input string. This is done in the tokenize method. In this method we first define a local string that contains the input string but without any leading or trailing spaces. Also we clear the tokens list from any previous data.

public void tokenize(String str) {
String s = new String(str);
tokens.clear();

The main loop is carried out until the local input string is empty. When we find a token we will remove it from the front of the string. If we are successful and the whole string could be tokenized then we will eventually end up with an empty string.

  while (!s.equals("")) {
boolean match = false;

The match variable indicates if any of the tokens provided a match with the beginning of the input string. Initially we have no match. We now loop through all the token infos. and create a Matcher for the input string.

    for (TokenInfo info : tokenInfos) {
Matcher m = info.regex.matcher(s);
if (m.find()) {
match = true;

String tok = m.group().trim();

s = m.replaceFirst("");
break;
}
}

If the pattern is found in the input string then match is set to true. The group() method of the Matcher returns the string of the last match. A new Token object is appended to the list of tokens. The token object contains the code of the token that resulted in the match and the matched string. In the next step, the matcher is used to remove the token from the beginning of the input string. We do this with the replaceFirst() method which replaces the first (and only) match with an empty string. Finally we break out of the loop over the token infos because we don’t need to check any other token types in this round.

After the loop has finished we check if a match was found. If not we throw an exception. We are using a ParserException which is extends RuntimeException without adding new functionality to it.

    if (!match) throw new ParserException(
"Unexpected character in input: "+s);
}
}

This is it! We are done with tokenizing the user input.

If you paid close attention you might have realized that the regular expression for variable tokens also matches any function token. This is not a bug. By storing the token information in a linked list, and always ensuring that we look for matches from the beginning of the list, we are giving precedence to patterns added first. If the input sequence starts with alphabetic characters the function pattern will first be tested and will match any one of the specified function. Only if there is no match the variable pattern will be tested.

To access the result of tokenizing an input string we define the getter for the tokens list.

public LinkedList<Token> getTokens() {
}

Now we are ready to tokenize the input. In our main method we add the following code:

try {
tokenizer.tokenize(" sin(x) * (1 + var_12) ");

for (Tokenizer.Token tok : tokenizer.getTokens()) {
System.out.println("" + tok.token + " " + tok.sequence);
}
}
catch (ParserException e) {
System.out.println(e.getMessage());
}

The output of our program is

1 sin
2 (
8 x
3 )
5 *
2 (
7 1
4 +
8 var_12
3 )

In the next instalment of this series we will be turning our attention towards designing a grammar for mathematical expressions.

1. roshan

What would be the regex to capture whitespace between words and the newline at end of a line?

1. cogitolearning (Post author)

The regular expression to match a whitespace is \s. If you enclose it in a Java string you have to escape the backslash and it becomes “\\s”. The expression for newlines is a bit more complicated because Unix and DOS/Windows use different conventions. For DOS it is \r\n and for Unix it is \n. If you want to capture either one you can use \r\n|\n or “\\r\\n|\\n” in a Java string.

2. __init__

For anyone wondering what the ParserException looks like, this is mine.

public class ParserException extends RuntimeException {
public ParserException(String msg) {
super(msg);
}
}

3. __init__

Also, this post is not the same as the src in the .zip. You need to trim stuff. The code here will not work with spaces as used above. Other than that, great tutorial, helped me immensely.

4. Roberto

Awesome tutorial!!

One observation: using “+|-” and “*|/” actually messes up the order of the original expression. Try an expression like “3 – (4 – 9)” with the code on this blog and you will get the first “-” added before the “3”. Same with “/”.

My work around was to add “-” and “/” separately but with the same token code:

1. Mikail

I have finally found the time to look into the problem. The “^” that is added to the regex before compiling into a Pattern only applies to the first option. Thus the regular expression “^+|-” matches a “+” at the beginning of the line or a minus anywhere. The correct expression should be “^(+|-)”. The bug is fixed by surrounding regex with brackets before compiling.

I have updated the tutorial and the code file.

5. Luricos

Thanks for this tutorial! One thing to note: You should point out that when the pattern is invalid PatternSyntaxException will be thrown during tokenizer.add(..).