Previously I introduced the concept on the LL(1) grammar and explained how a parser would go about parsing an input. In this post I want to design a grammar that is capable of parsing mathematical expressions.
Before we actually get down to designing the grammar we have to define all the terminal symbols that we expect.
Terminal Meaning PLUSMINUS + or - MULTDIV * or / RAISED ^ FUNCTION any mathematical function: sin, cos, exp, sqrt, ... OPEN_BRACKET ( CLOSE_BRACKET ) NUMBER any number VARIABLE the name of a variable
These are all the symbols that we allow in the user input. We assume that the tokenizer has already split the input into these terminal symbols and the parser now gets a stream of these symbols.
Remember from the last post that the parser works with terminal and non-terminal symbols. The non-terminal symbols can be expanded by the parser’s rules while the terminal symbols cannot be changed. The parsing starts with a single non-terminal. For our expression parser we call that symbol expression.
We want our parser to not only decide if the input is a valid expression. We also want it to generate an internal representation of that expression. This means we have to make sure that the mathematical operations are carried out in the right order. In an expression like 3 + 2*5 the multiplication has to be carried out first because the multiplication has a higher precedence than the addition (remember multiplication before addition).
For reasons that we will see later the rules start with the weakest operation. Essentially, in a forthcoming post, we will be creating a tree structure with the lowest precedence operations near the root.
On the top level an expression can be interpreted as a sum of one or more terms. The first term can start with a plus or minus.
expression -> signed_term sum_op sum_op -> PLUSMINUS term sum_op sum_op -> EPSILON
Remember from the last post that the way these rules are designed is necessary to make sure that we always know which rule to apply just by reading the next symbol from the input. The signed_term can be a term preceded by PLUSMINUS or just a term.
signed_term -> PLUSMINUS term signed_term -> term
Now we continue in the same way with products. Each term is a product of one or more factors.
term -> factor term_op term_op -> MULTDIV signed_factor term_op term_op -> EPSILON
A signed_factor is a factor that can start with a PLUSMINUS or just a factor.
signed_factor -> PLUSMINUS factor signed_factor -> factor
A factor can be either a function or an argument which might be raised to some power.
factor -> argument factor_op factor_op -> RAISED signed_factor factor_op -> EPSILON
The second rule here allows either a simple argument, because factor_op can be eliminated using the EPSILON rule. It also allows exponentiation via the third rule.
An argument is either just a value or a full expression enclosed in brackets.
argument -> value argument -> FUNCTION argument argument -> OPEN_BRACKET expression CLOSE_BRACKET
The third of these two rules allows us to build expressions of arbitrary complexity because we can keep on putting expressions within expressions as long as they are enclosed in brackets. The second rule allows function expressions to be used as arguments. Note that an expression like sin x is possible without any brackets around the x because the function argument can simply be a value. But what happens when the expression reads something like sin 2*x. This results in an ambiguity should sin 2*x be treated as sin(2*x) or as (sin 2)*x? I have tried typing the expression into Google calculator and it turns out that it makes the distinction between sin 2x which it interprets as sin(2*x) and sin 2*x which it interprets as sin(2)*x. WolframAlpha on the other hand will interpret both as sin(2*x) but when you add spaces, sin 2 x and sin 2 * x, will be interpreted as sin(2)*x. In our case, using the simple LL(1) grammar and ignoring white spaces, we won’t be able to implement such fine grained behaviour. This means our grammar will always interpret the expression as sin(2)*x. Also note that this means that an expression such as sin x^2. will be interpreted as (sin(x))^2.
Finally, a value can be a NUMBER or a VARIABLE
value -> NUMBER value -> VARIABLE
This is the complete grammar for mathematical expressions. Why not try out yourself if you can analyse an expression using this grammar?
In the next post in this series I will program a recursive descent parser based on this grammar.