Previously I introduced the concept on the LL(1) grammar and explained how a parser would go about parsing an input. In this post I want to design a grammar that is capable of parsing mathematical expressions.

Before we actually get down to designing the grammar we have to define all the terminal symbols that we expect.

Terminal Meaning PLUSMINUS + or - MULTDIV * or / RAISED ^ FUNCTION any mathematical function: sin, cos, exp, sqrt, ... OPEN_BRACKET ( CLOSE_BRACKET ) NUMBER any number VARIABLE the name of a variable

These are all the symbols that we allow in the user input. We assume that the tokenizer has already split the input into these terminal symbols and the parser now gets a stream of these symbols.

Remember from the last post that the parser works with terminal and non-terminal symbols. The non-terminal symbols can be expanded by the parser’s rules while the terminal symbols cannot be changed. The parsing starts with a single non-terminal. For our expression parser we call that symbol **expression**.

We want our parser to not only decide if the input is a valid expression. We also want it to generate an internal representation of that expression. This means we have to make sure that the mathematical operations are carried out in the right order. In an expression like **3 + 2*5** the multiplication has to be carried out first because the multiplication has a higher precedence than the addition (remember multiplication before addition).

For reasons that we will see later the rules start with the weakest operation. Essentially, in a forthcoming post, we will be creating a tree structure with the lowest precedence operations near the root.

On the top level an expression can be interpreted as a sum of one or more terms. The first term can start with a plus or minus.

expression -> signed_term sum_op sum_op -> PLUSMINUS term sum_op sum_op -> EPSILON

Remember from the last post that the way these rules are designed is necessary to make sure that we always know which rule to apply just by reading the next symbol from the input. The **signed_term** can be a term preceded by **PLUSMINUS** or just a term.

signed_term -> PLUSMINUS term signed_term -> term

Now we continue in the same way with products. Each term is a product of one or more factors.

term -> factor term_op term_op -> MULTDIV signed_factor term_op term_op -> EPSILON

A **signed_factor** is a **factor** that can start with a **PLUSMINUS** or just a factor.

signed_factor -> PLUSMINUS factor signed_factor -> factor

A factor can be either a function or an argument which might be raised to some power.

factor -> argument factor_op factor_op -> RAISED signed_factor factor_op -> EPSILON

The second rule here allows either a simple argument, because **factor_op** can be eliminated using the **EPSILON** rule. It also allows exponentiation via the third rule.

An argument is either just a value or a full expression enclosed in brackets.

argument -> value argument -> FUNCTION argument argument -> OPEN_BRACKET expression CLOSE_BRACKET

The third of these two rules allows us to build expressions of arbitrary complexity because we can keep on putting expressions within expressions as long as they are enclosed in brackets. The second rule allows function expressions to be used as arguments. Note that an expression like **sin x** is possible without any brackets around the **x** because the function argument can simply be a value. But what happens when the expression reads something like **sin 2*x**. This results in an ambiguity should **sin 2*x** be treated as **sin(2*x)** or as **(sin 2)*x**? I have tried typing the expression into Google calculator and it turns out that it makes the distinction between **sin 2x** which it interprets as **sin(2*x)** and **sin 2*x** which it interprets as **sin(2)*x**. WolframAlpha on the other hand will interpret both as **sin(2*x)** but when you add spaces, **sin 2 x** and **sin 2 * x**, will be interpreted as **sin(2)*x**. In our case, using the simple LL(1) grammar and ignoring white spaces, we won’t be able to implement such fine grained behaviour. This means our grammar will always interpret the expression as **sin(2)*x**. Also note that this means that an expression such as **sin x^2**. will be interpreted as **(sin(x))^2**.

Finally, a value can be a **NUMBER** or a **VARIABLE**

value -> NUMBER value -> VARIABLE

This is the complete grammar for mathematical expressions. Why not try out yourself if you can analyse an expression using this grammar?

In the next post in this series I will program a recursive descent parser based on this grammar.

MarkusWhere did you define EPSILON?

TimI’m new to the parsing scene, but in what I’ve read so far EPSILON is a generally recognized symbol for an empty string, and is usually used to depict the end of a recursively defined chaining of rules.

term -> factor term_op

term_op -> MULTDIV signed_factor term_op

term_op -> EPSILON

Notice how here the term_op definition ends with term_op. This is to facilitate multiple “MULTDIV signed_factor” groups in a row. The EPSILON then, is to indicate an end to that chain. I’ve also seen it written as :

term_op -> MULTDIV signed_factor term_op | EPSILON