Writing a Parser in Java: Introduction

Edit: The full parser is available for download here.

In this little series we will explain how to code a simple expression parser in Java. Sometimes it is necessary to read user input, such as formula expressions, some simple markup, or even formatting strings. This user input has to be parsed and converted into a form that you program can handle. Maybe you want to construct an object hierarchy or you wan to produce a formatted output string which reflects what the user has typed in. In addition you want you parser to tell the user when there was a problem with the input, such as a syntax error, and produce a meaningful error message.

The process of parsing user input contains a series of steps.

1) The input needs to be split into tokens. A token can be a single character or a series of characters. In a mathematical expression tokens might include the operators such as the plus sign ‘+’ or the minus sign ‘-‘, but also variable names like ‘x’ or function names such as ‘sin’. For an XML parser tokens could be ‘<‘ or a tag name. The task of recognizing tokens is handled by the tokenizer.

2) A grammar needs to be constructed. The grammar specifies a set of rules wich specify what kind of input is expected. The grammar works on the token level. A simple rule for parsing the sum of two variables could be

sum -> var + var

At this level, the grammar does not distinguish between different variables and in the rule above the two ‘var’ tokens could stand for different variables. Creating a grammar is a somewhat abstract process and is usually done on a piece of paper.

3) The grammar is turned into a parser. Different algorithms exist for generating the code of a parser. Depending on the complexity of the grammar this task is often carried out using a parser generator. Parser generators generate code which you can feed tokens into. The generated code will then evaluate the tokens and decide on the grammar rules to apply. In this series we will look at a very simple grammar only. For simple grammars the parser can be coded by hand using the recursive descent algorithm.

4) Actions are taken when specific rules of the grammar are applied. These actions are needed to turn the input string into a data structure that can be used by the program later on. Actions could involve building a tree structure of objects or writing a piece of formatted string into a stream.

In the continuation of this series I will talk about each of these four steps individually and step by step, we will be building a recursive descent parser for analysing mathematical expressions.



  1. lanre

    Please, i need a complete lesson on building parsers in Java for an NLP project i am about to carry out. PLEASE give me the complete lessons as soon as possible or send it to my mail box. And PLEASE i need to contact you so we can talk

    1. cogitolearning (Post author)

      This series of posts will only be treating a simple parser based on the LL(1) grammar. I don’t think that an LL(1) parser will be able to cope with any realistic natural language processing project. The forthcoming articles in this series will be published within the coming weeks.

Comments are closed.