# Writing a Parser in Java: A Grammar for Mathematical Expressions

Previously I introduced the concept on the LL(1) grammar and explained how a parser would go about parsing an input. In this post I want to design a grammar that is capable of parsing mathematical expressions.

Before we actually get down to designing the grammar we have to define all the terminal symbols that we expect.

Terminal        Meaning

PLUSMINUS       + or -

MULTDIV         * or /

RAISED          ^

FUNCTION        any mathematical function:
sin, cos, exp, sqrt, ...

OPEN_BRACKET    (

CLOSE_BRACKET   )

NUMBER          any number

VARIABLE        the name of a variable

These are all the symbols that we allow in the user input. We assume that the tokenizer has already split the input into these terminal symbols and the parser now gets a stream of these symbols.

Remember from the last post that the parser works with terminal and non-terminal symbols. The non-terminal symbols can be expanded by the parser’s rules while the terminal symbols cannot be changed. The parsing starts with a single non-terminal. For our expression parser we call that symbol expression.

We want our parser to not only decide if the input is a valid expression. We also want it to generate an internal representation of that expression. This means we have to make sure that the mathematical operations are carried out in the right order. In an expression like 3 + 2*5 the multiplication has to be carried out first because the multiplication has a higher precedence than the addition (remember multiplication before addition).

For reasons that we will see later the rules start with the weakest operation. Essentially, in a forthcoming post, we will be creating a tree structure with the lowest precedence operations near the root.

On the top level an expression can be interpreted as a sum of one or more terms. The first term can start with a plus or minus.

expression  -> signed_term sum_op
sum_op      -> PLUSMINUS term sum_op
sum_op      -> EPSILON

Remember from the last post that the way these rules are designed is necessary to make sure that we always know which rule to apply just by reading the next symbol from the input. The signed_term can be a term preceded by PLUSMINUS or just a term.

signed_term -> PLUSMINUS term
signed_term -> term

Now we continue in the same way with products. Each term is a product of one or more factors.

term        -> factor term_op
term_op     -> MULTDIV signed_factor term_op
term_op     -> EPSILON

A signed_factor is a factor that can start with a PLUSMINUS or just a factor.

signed_factor -> PLUSMINUS factor
signed_factor -> factor

A factor can be either a function or an argument which might be raised to some power.

factor      -> argument factor_op
factor_op   -> RAISED signed_factor
factor_op   -> EPSILON

The second rule here allows either a simple argument, because factor_op can be eliminated using the EPSILON rule. It also allows exponentiation via the third rule.

An argument is either just a value or a full expression enclosed in brackets.

argument    -> value
argument    -> FUNCTION argument
argument    -> OPEN_BRACKET expression CLOSE_BRACKET

The third of these two rules allows us to build expressions of arbitrary complexity because we can keep on putting expressions within expressions as long as they are enclosed in brackets. The second rule allows function expressions to be used as arguments. Note that an expression like sin x is possible without any brackets around the x because the function argument can simply be a value. But what happens when the expression reads something like sin 2*x. This results in an ambiguity should sin 2*x be treated as sin(2*x) or as (sin 2)*x? I have tried typing the expression into Google calculator and it turns out that it makes the distinction between sin 2x which it interprets as sin(2*x) and sin 2*x which it interprets as sin(2)*x. WolframAlpha on the other hand will interpret both as sin(2*x) but when you add spaces, sin 2 x and sin 2 * x, will be interpreted as sin(2)*x. In our case, using the simple LL(1) grammar and ignoring white spaces, we won’t be able to implement such fine grained behaviour. This means our grammar will always interpret the expression as sin(2)*x. Also note that this means that an expression such as sin x^2. will be interpreted as (sin(x))^2.

Finally, a value can be a NUMBER or a VARIABLE

value       -> NUMBER
value       -> VARIABLE

This is the complete grammar for mathematical expressions. Why not try out yourself if you can analyse an expression using this grammar?

In the next post in this series I will program a recursive descent parser based on this grammar.

1. Markus

Where did you define EPSILON?

1. Tim

I’m new to the parsing scene, but in what I’ve read so far EPSILON is a generally recognized symbol for an empty string, and is usually used to depict the end of a recursively defined chaining of rules.

term -> factor term_op
term_op -> MULTDIV signed_factor term_op
term_op -> EPSILON

Notice how here the term_op definition ends with term_op. This is to facilitate multiple “MULTDIV signed_factor” groups in a row. The EPSILON then, is to indicate an end to that chain. I’ve also seen it written as :
term_op -> MULTDIV signed_factor term_op | EPSILON