Before we actually get down to designing the grammar we have to define all the terminal symbols that we expect.

Terminal Meaning PLUSMINUS + or - MULTDIV * or / RAISED ^ FUNCTION any mathematical function: sin, cos, exp, sqrt, ... OPEN_BRACKET ( CLOSE_BRACKET ) NUMBER any number VARIABLE the name of a variable

These are all the symbols that we allow in the user input. We assume that the tokenizer has already split the input into these terminal symbols and the parser now gets a stream of these symbols.

Remember from the last post that the parser works with terminal and non-terminal symbols. The non-terminal symbols can be expanded by the parser’s rules while the terminal symbols cannot be changed. The parsing starts with a single non-terminal. For our expression parser we call that symbol **expression**.

We want our parser to not only decide if the input is a valid expression. We also want it to generate an internal representation of that expression. This means we have to make sure that the mathematical operations are carried out in the right order. In an expression like **3 + 2*5** the multiplication has to be carried out first because the multiplication has a higher precedence than the addition (remember multiplication before addition).

For reasons that we will see later the rules start with the weakest operation. Essentially, in a forthcoming post, we will be creating a tree structure with the lowest precedence operations near the root.

On the top level an expression can be interpreted as a sum of one or more terms. The first term can start with a plus or minus.

expression -> signed_term sum_op sum_op -> PLUSMINUS term sum_op sum_op -> EPSILON

Remember from the last post that the way these rules are designed is necessary to make sure that we always know which rule to apply just by reading the next symbol from the input. The **signed_term** can be a term preceded by **PLUSMINUS** or just a term.

signed_term -> PLUSMINUS term signed_term -> term

Now we continue in the same way with products. Each term is a product of one or more factors.

term -> factor term_op term_op -> MULTDIV signed_factor term_op term_op -> EPSILON

A **signed_factor** is a **factor** that can start with a **PLUSMINUS** or just a factor.

signed_factor -> PLUSMINUS factor signed_factor -> factor

A factor can be either a function or an argument which might be raised to some power.

factor -> argument factor_op factor_op -> RAISED signed_factor factor_op -> EPSILON

The second rule here allows either a simple argument, because **factor_op** can be eliminated using the **EPSILON** rule. It also allows exponentiation via the third rule.

An argument is either just a value or a full expression enclosed in brackets.

argument -> value argument -> FUNCTION argument argument -> OPEN_BRACKET expression CLOSE_BRACKET

The third of these two rules allows us to build expressions of arbitrary complexity because we can keep on putting expressions within expressions as long as they are enclosed in brackets. The second rule allows function expressions to be used as arguments. Note that an expression like **sin x** is possible without any brackets around the **x** because the function argument can simply be a value. But what happens when the expression reads something like **sin 2*x**. This results in an ambiguity should **sin 2*x** be treated as **sin(2*x)** or as **(sin 2)*x**? I have tried typing the expression into Google calculator and it turns out that it makes the distinction between **sin 2x** which it interprets as **sin(2*x)** and **sin 2*x** which it interprets as **sin(2)*x**. WolframAlpha on the other hand will interpret both as **sin(2*x)** but when you add spaces, **sin 2 x** and **sin 2 * x**, will be interpreted as **sin(2)*x**. In our case, using the simple LL(1) grammar and ignoring white spaces, we won’t be able to implement such fine grained behaviour. This means our grammar will always interpret the expression as **sin(2)*x**. Also note that this means that an expression such as **sin x^2**. will be interpreted as **(sin(x))^2**.

Finally, a value can be a **NUMBER** or a **VARIABLE**

value -> NUMBER value -> VARIABLE

This is the complete grammar for mathematical expressions. Why not try out yourself if you can analyse an expression using this grammar?

In the next post in this series I will program a recursive descent parser based on this grammar.

]]>A grammar is made up of a collection of rules. Before we start explaining the rules, we have to talk about what the rules act upon. The tokenizer gave us a sequence of tokens and I said before that each type of token has a unique code. These token codes are called the terminal symbols of the grammar.

In addition to the terminal symbols we also introduce so-called non-terminal symbols. The non-terminal symbols represent somewhat more abstract concepts. For example a sum might be a non-terminal. A sum is made up of a sequence of terms separated by plus or minus signs. A sum does not correspond to an individual token but states how tokens must be arranged in the input. Usually non-terminals are written in lower case while terminal symbols are written in upper case.

Now we come back to the rules of the grammar. Each rule expresses a way to convert a non-terminal symbol into a sequence of terminal and non-terminal symbols. There might be more than one rule converting a specific non-terminal. A grammar also specifies a special non-terminal which is the starting point.

Before we start designing our real grammar form mathematical expressions, let’s look at a simple example.

(1) sum -> NUMBER PLUS NUMBER (2) sum -> NUMBER PLUS sum

In this grammar we start with the non-terminal **sum**. The two rules allow us to express any sum of numbers separated by plus signs. We start with the initial non-terminal

sum

By applying rule (2) we can replace **sum**

(2) -> NUMBER PLUS sum

Now we can apply rule (2) again to the **sum** on the right hand side, and we end up with

(2) -> NUMBER PLUS NUMBER PLUS sum

We finally apply rule (1) to the non-terminal **sum** which is still present in our right hand side. This leads us to

(1) -> NUMBER PLUS NUMBER PLUS NUMBER PLUS NUMBER

Now we have created a sequence that is purely made up of non-terminal symbols and expresses a sum of four numbers. In general, the grammar specified above lets us generate a sum of any length. We just have to keep applying rule (2) again and again, until we have enough **NUMBER**s to add.

One problem remains however. When we are parsing an input from the user, we need to decide which rules to apply. This has to be somehow deduced from the input sequence. We successively apply rules and match them with the input sequence while, at the same time, reading token by token from the input. In the simplest type of grammars, the so-called LL(1) grammars, looking at the next token on the input should be enough to decide which rule to apply.

Let’s take the example above again and assume our input from the tokenizer is

NUMBER PLUS NUMBER PLUS NUMBER

We start with the non-terminal **sum**, and we can immediately see that we have to apply first rule (2) and then rule (1). But with the restrictions of the LL(1) grammar, the parser can only see the next symbol. So, when we start parsing, the parser will only see

NUMBER ...

The parser does not know how many numbers are in the sum and can’t decide whether to use rule (1) or rule (2). In order to help the parser we have to re-design our grammar. We use a new terminal symbol called **EPSILON** which is used when a non-terminal can be removed if it doesn’t match any input.

(1) sum -> NUMBER sum_op (2) sum_op -> PLUS NUMBER sum_op (3) sum_op -> EPSILON

Now, let’s follow how the parser reacts to the input. We start by placing our starting non-terminal on the stack.

Stack: sum Input: NUMBER (PLUS NUMBER PLUS NUMBER)

The input in brackets is the input that is still in the stream but is not visible to the parser. The parser now decides to apply rule (1) based on the next token.

Rules applied: (1) Stack: NUMBER sum_op Input: NUMBER (PLUS NUMBER PLUS NUMBER)

Now the stack starts with the terminal **NUMBER** that matches in the terminal on the input stream. In this case we can discard the matching terminals.

Rules applied: (1) Stack: sum_op Input: PLUS (NUMBER PLUS NUMBER)

Now we see a **PLUS** on the input and the only way to convert **sum_op** into a string that starts with **PLUS** is by applying rule (2)

Rules applied: (1) (2) Stack: PLUS NUMBER sum_op Input: PLUS (NUMBER PLUS NUMBER)

We reduce again by discarding the matching terminal from the beginning of the stack.

Rules applied: (1) (2) Stack: NUMBER sum_op Input: NUMBER (PLUS NUMBER)

Rules applied: (1) (2) Stack: sum_op Input: PLUS (NUMBER)

Note how we can successively discard all matching terminals, the **NUMBER** and the **PLUS**. Again we have to apply rule (2) and reduce

Rules applied: (1) (2) (2) Stack: PLUS NUMBER sum_op Input: PLUS (NUMBER)

Rules applied: (1) (2) (2) Stack: NUMBER sum_op Input: NUMBER

Rules applied: (1) (2) (2) Stack: sum_op Input:

Now the input string is empty. This is where the **EPSILON** rule comes in. The **EPSILON** rule is used when there is no other rule that can be used. The **EPSILON** is a terminal that can be discarded without matching any input.

Rules applied: (1) (2) (2) (3) Stack: Input:

Now both the stack and the input string are empty. This means we are finished with parsing the input and we have a match.

This concludes our little introduction to LL(1) grammars. With the knowledge we have now, we can set about designing a grammar for mathematical expressions. We will do this in the next instalment of this series.

]]>The first step in writing a parser is to tokenize the input string. This means to separate the input string into short bits that represent the basic entities in the expression. We could do this by hand, reading one character at a time and assembling the tokens character by character. This would be tedious, difficult to maintain and hard to extend should we decide to add more features to the tokenizer.

Instead we decide to use the java.util.regex package to do the work for us. This will allow us to implement quite a general tokenizer that can be used for any kind of grammar.

Let’s start by creating a class called **Tokenizer** and defining an internal class **TokenInfo** that holds information about the individual tokens.

import java.util.regex.Pattern; public class Tokenizer { private class TokenInfo { public final Pattern regex; public final int token; public TokenInfo(Pattern regex, int token) { super(); this.regex = regex; this.token = token; } } }

As you can see **TokenInfo** contains two fields. The regular expression that is used to match the input string against the token is stored in the **Pattern regex**. We store **Pattern** objects instead of the regular expression string to improve performance. Regular expressions have to be compiled which takes time. **Pattern** stores the regular expression in compiled form. The code of the token is given by the integer value ‘token’. Each type of token should have its own code.

private LinkedList<TokenInfo> tokenInfos; public Tokenizer() { tokenInfos = new LinkedList<TokenInfo>(); }

We define a linked list, called **tokenInfos**, to hold the information for all the tokens. The linked list is created in the constructor of our **Tokenizer** class. Next we need to add the token information in our list. We do this via the **add()** method.

public void add(String regex, int token) { tokenInfos.add( new TokenInfo( Pattern.compile("^("+regex+")"), token)); }

The user can pass a regular expression string and a token code to the method. The method will then the “^” character to the user supplied regular expression. It causes the regular expression to match only the beginning of a string. This is needed because we will be removing any token always looking for the next token at the beginning of the input string.

We also want to store the information about the tokens we have seen. We need to store the token code and the string that corresponds to the token. The string is needed because the token code does not retain the full information about the input. When we have found a variable we will give the token a special code for variable tokens but we need to also keep the name of the variable so that we can later use it to store or retrieve information. To this end we define another internal class called Token and a linked list of these tokens.

public class Token { public final int token; public final String sequence; public Token(int token, String sequence) { super(); this.token = token; this.sequence = sequence; } }

private LinkedList<Token> tokens;

In our constructor of **Tokenizer** we now also have to initialize the tokens list.

public Tokenizer() { tokenInfos = new LinkedList<TokenInfo>(); tokens = new LinkedList<Token>(); }

We can now add token information in our main method.

public static void main(String[] args) { Tokenizer tokenizer = new Tokenizer(); tokenizer.add("sin|cos|exp|ln|sqrt", 1); // function tokenizer.add("\\(", 2); // open bracket tokenizer.add("\\)", 3); // close bracket tokenizer.add("[+-]", 4); // plus or minus tokenizer.add("[*/]", 5); // mult or divide tokenizer.add("\\^", 6); // raised tokenizer.add("[0-9]+",7); // integer number tokenizer.add("[a-zA-Z][a-zA-Z0-9_]*", 8); // variable }

The code above adds regular expressions that match functions, brackets, mathematical operators, integer numbers and variables. Note how each type of token gets a unique code. Note also how we have to escape special characters in the regular expressions with a double backslash.

We are now ready to tokenize an input string. This is done in the tokenize method. In this method we first define a local string that contains the input string but without any leading or trailing spaces. Also we clear the tokens list from any previous data.

public void tokenize(String str) { String s = new String(str); tokens.clear();

The main loop is carried out until the local input string is empty. When we find a token we will remove it from the front of the string. If we are successful and the whole string could be tokenized then we will eventually end up with an empty string.

while (!s.equals("")) { boolean match = false;

The **match** variable indicates if any of the tokens provided a match with the beginning of the input string. Initially we have no match. We now loop through all the token infos. and create a Matcher for the input string.

for (TokenInfo info : tokenInfos) { Matcher m = info.regex.matcher(s); if (m.find()) { match = true; String tok = m.group().trim(); tokens.add(new Token(info.token, tok)); s = m.replaceFirst(""); break; } }

If the pattern is found in the input string then match is set to true. The **group()** method of the **Matcher** returns the string of the last match. A new **Token** object is appended to the list of tokens. The token object contains the code of the token that resulted in the match and the matched string. In the next step, the matcher is used to remove the token from the beginning of the input string. We do this with the **replaceFirst()** method which replaces the first (and only) match with an empty string. Finally we break out of the loop over the token infos because we don’t need to check any other token types in this round.

After the loop has finished we check if a match was found. If not we throw an exception. We are using a **ParserException** which is extends **RuntimeException** without adding new functionality to it.

if (!match) throw new ParserException( "Unexpected character in input: "+s); } }

This is it! We are done with tokenizing the user input.

If you paid close attention you might have realized that the regular expression for variable tokens also matches any function token. This is not a bug. By storing the token information in a linked list, and always ensuring that we look for matches from the beginning of the list, we are giving precedence to patterns added first. If the input sequence starts with alphabetic characters the function pattern will first be tested and will match any one of the specified function. Only if there is no match the variable pattern will be tested.

To access the result of tokenizing an input string we define the getter for the tokens list.

public LinkedList<Token> getTokens() { return tokens; }

Now we are ready to tokenize the input. In our main method we add the following code:

try { tokenizer.tokenize(" sin(x) * (1 + var_12) "); for (Tokenizer.Token tok : tokenizer.getTokens()) { System.out.println("" + tok.token + " " + tok.sequence); } } catch (ParserException e) { System.out.println(e.getMessage()); }

The output of our program is

1 sin 2 ( 8 x 3 ) 5 * 2 ( 7 1 4 + 8 var_12 3 )

You can download the Java sources for the tokenizer from this link.

In the next instalment of this series we will be turning our attention towards designing a grammar for mathematical expressions.

]]>class LastFixingQuote : public Quote, public Observer

The class inherits from **Quote** but also from **Observer**. LastFixingQuote has to react to changes in the index value. It does this by observing the Index objects that it wraps.

LastFixingQuote(const boost::shared_ptr<Index>& index);

The constructor takes a shared pointer to an Index object. Once the index is set it cannot be changed again.

Real value() const; bool isValid() const;

The **value()** method and **isValid()** method are inherited from Quote and are implemented. **value()** returns the latest available fixing of the index. **isValid()** returns true if the index contains any fixings and false if the index does not contain fixings at all.

Date referenceDate() const;

The reference date is the last date in the time series contained in the index. Thus it is the date of the latest available fixing.

void update();

The **update()** method is implemented from the **Observer** class. It takes no action except calling **notifyObservers()** of the **Observable**. This makes sure that any observers of the quote are notified when the underlying index changes.

const boost::shared_ptr<Index>& index() const

Access to the index is granted through the index() method.

]]>