Mouse: from Parsing Expressions to a practical parser

Version 1.9.2

Parsing Expression Grammar (PEG), introduced by Ford in [1], is a way to define the syntax of a programming language. It encodes a recursive-descent parser for that language. Mouse is a tool to transcribe PEG into an executable parser. Both Mouse and the resulting parser are written in Java, which makes them operating-system independent. An integral feature of Mouse is the mechanism for specifying semantics (also in Java).

Recursive-descent parsing with backtracking

A recursive-descent parser consists of procedures that correspond to syntax rules. The procedures call each other recursively, each being responsible for recognizing input strings defined by its rule. The syntax definition and its parser can be then viewed as the same thing, presented in two different ways. This design is very transparent and easy to modify when the language evolves.

The idea can be traced back to Lucas [2], who suggested it for grammars that later became known as the Extended Backus-Naur Form (EBNF). It has two problems. First, the langauage must not contain left-recursion, as this results in an infinite descent. Second, procedures corresponding to certain types of syntax rules must decide which procedure to call next.

There is so far no simple solution to the first problem. The second is easily solved for a class of languages that have the so-called LL(1) property: the decision can be made by looking at the next input symbol. Some syntax defintions can be transformed to satisfy this property. However, forcing the language into the LL(1) mold can make the grammar -- and the parser -- unreadable.

A more general solution is backtracking. It means that the parser proceeds by trial and error: tries the alternatives one after another and goes back if it took a wrong decision. However, an exhaustive search may require exponential time. A reasonable compromise is limited backtracking, also called "fast-back" in [3].

Limited backtracking was adopted in at least two of the early top-down designs: the Atlas Compiler Compiler of Brooker and Morris [4,5], and TMG (the TransMoGrifier) of McClure [6]. The syntax specification used in TMG was later formalized and analyzed by Birman and Ullman [7,8]. It appears in [9] as "Top-Down Parsing Language" (TDPL) and "Generalized TDPL" (GTDPL). Parsing Expression Grammar is a development of this latter.

A Web site devoted to PEG contains a list of publications and a link to discussion forum.

PEG programming

Parsers defined by PEG do not require a separate "lexer" or "scanner". Together with lifting of the LL(1) restriction, this gives a very convenient tool when you need an ad-hoc parser for some application. However, the limitation of backtracking may have unexpected effects that give an impression of PEG being unpredictable.

For some grammars, the limited backtracking is "efficient", in the sense that it finds everything that would be found by full backtracking. This is, in particular, the case for all langauges having LL(1) property. PEG with efficient backtracking agrees well with intuition and is therefore easier to understand. The Mouse package includes a tool, the PEG Explorer, that assists you in checking whether your PEG has efficient backtracking. It is described here.

In its external appearance, PEG is very like an EBNF grammar. It has been shown in [10,11] that PEG with efficient backtracking defines exactly the same language as its EBNF look-alike.

PEG has a feature that does not have an EBNF counterpart: the syntactic predicates !e and &e. As these operations are recognition-based, they cannot be embedded in the construction-based EBNF. The equivalence just quoted applies thus only to PEG without predicates.

Packrat parsing

Even the limited backtracking may require a lot of time. In [12,13] PEG was introduced together with a technique called packrat parsing. Packrat parsing handles backtracking by extensive memoization: storing all results of parsing procedures. It guarantees linear parsing time at a large memory cost.

(The name "packrat" comes from pack rat - a small rodent (Neotoma cinerea) known for hoarding unnecessary items. "Memoization", introduced in [14], is the technique of reusing stored results of function calls instead of recomputing them.)

Wikipedia lists a number of generators producing packrat parsers from PEG.

Mouse - not a pack rat

The amount of backtracking does not matter in small interactive applications where the input is short and performance not critical. Moreover, the usual programming languages do not require much backtracking. The experiments reported in [15,16] demonstrated a moderate backtracking activity in PEG parsers for programming languages Java and C.

In view of these facts, it is useful to construct PEG parsers where the complexity of packrat technology is abandoned in favor of simple and transparent design. This is the idea of Mouse: a parser generator that transcribes PEG into a set of recursive procedures that closely follow the grammar. The name Mouse was chosen in contrast to Rats!, one of the first generators producing packrat parsers [17]. Optionally, Mouse can offer a small amount of memoization using the technique described in [15].


This is an example of PEG:

Sum     = Space Sign Number (AddOp Number)* !_ {} ;
Number  = Digits? "." Digits Space {}
        / Digits Space {} ;
Sign    = ("-" Space)? ;
AddOp   = [-+] Space ;
Digits  = [0-9]+ ;
Space   = " "* ;

Mouse transcribes it into an executable parser written in Java. Each rule of PEG becomes a Java method named after the rule. For example, the first rule is transcribed into this method;

//  Sum = Space Sign Number (AddOp Number)* !_ {} ;
private boolean Sum()
    if (!Number()) return reject();
    while (Sum_0());
    if (!aheadNot()) return reject();
    return accept();

A pair of brackets {} at the end of a PEG rule indicates that you provide a semantic action for the rule. It is a method in a separate Java class and may look like this:

//  Sum = Space Sign Number AddOp Number ... AddOp Number
//          0     1    2      3     4         n-2   n-1
void Sum()
    int n = rhsSize();
    int s = (Double)rhs(2).get();
    if (!rhs(1).isEmpty()) s = -s;
    for (int i=4;i<n;i+=2)
      if (rhs(i-1).charAt(0)=='+')
        s += (Double)rhs(i).get();
        s -= (Double)rhs(i).get();
Special functions like "rhsSize()" and "rhs(i)" provide access to parts of the parsed text.


  1. ^ B. Ford. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Venice 2004, pp. 111 - 122.
  2. ^ P. Lucas. The Structure of Formula-Translators.
    ALGOL Bulletin Supplement 16, September 1961, pp. 1 - 21.
  3. ^ F. R. A. Hopgood. Compiling Techniques, MacDonalds 1969.
  4. ^ P. A. Brooker, D. Morris. A General Translation Program for Phrase Structure Languages. The Computer Journal 9, 1 (1962), pp. 1 - 10.
  5. ^ S. Rosen. A Compiler-Building System Developed by Brooker and Morris. Communications of ACM 7, 7 (1964) pp. 403 - 414.
  6. ^ R. M. McClure. TMG - a syntax directed compiler.
    Proceedings of the 20th ACM National Conference, Cleveland, Ohio 1965, pp. 262 - 274.
  7. ^ A. Birman. The TMG Recognition Schema. Ph.D. thesis, Princeton University, 1970.
  8. ^ A. Birman and J. D. Ullman. Parsing Algorithms with Backtrack.
    Information and Control 23 (1973), pp. 1 - 34.
  9. ^ A. V. Aho and J. D. Ullman. The Theory of Parsing, Translation and Compiling,
    Vol. I, Parsing. Prentice Hall, 1972.
  10. ^ F. Mascarenhas, S. Medeiros and R. Ierusalimschy. On the Relation between Context-Free Grammars and Parsing Expression Grammars.
    Science of Computer Programming 89 (2014) pp. 235 - 250.
  11. ^ R. R. Redziejowski. From EBNF to PEG.
    Fundamenta Informaticae 128, 1-2 (2013), pp. 177 - 191.
  12. ^ B. Ford. Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking.
    M.Sc. thesis, MIT 2002.
  13. ^ B. Ford. Packrat Parsing: Simple, Powerful, Lazy, Linear Time.
    Proceedings of the 2002 International Conference on Functional Programming, Pittsburgh 2002, pp. 36 - 47.
  14. ^ D. Michie. Memo Functions and Machine Learning. Nature 218 (1968), pp. 19 - 22.
  15. ^ R. R. Redziejowski. Parsing Expression Grammar as a Primitive Recursive-Descent Parser with Backtracking. Fundamenta Informaticae 79, 3-4 (2007), pp. 513 - 524.
  16. ^ R. R. Redziejowski. Some aspects of Parsing Expression Grammar.
    Fundamenta Informaticae 85, 1-4 (2008), pp. 441 - 454.
  17. ^ R. Grimm. Practical Packrat Parsing.
    Technical Report TR2004-854, Dept. of Computer Science, New York University (2004).



Download user's manual / tutorial (PDF file).
Download the Mouse package (gzipped TAR file).

License: Apache Version 2.

Sample grammars

Parsing Expression Grammar for Java 1.8

Based on Java Language Specification, Java SE8 Edition dated 2014-03-03, with corrections based on test results and javac parser code.

Download the grammar in Mouse format (text file).

Parsing Expression Grammar for Java 1.7

Based on Chapters 3 and 18 Based on of Java Language Specification, Java SE7 Edition dated 2012-07-27, with corrections based on other chapters and test results.

Download the grammar in Mouse format (text file).

Parsing Expression Grammar for Java 1.6

Based on Chapters 3 and 18 of Java Language Specification, Third Edition, with corrections based on other chapters and test results.

It is an updated version of the grammar used for experiments described in the paper Parsing Expression Grammar as a Primitive Recursive-Descent Parser with Backtracking.

Download the grammar in Mouse format (text file).

Parsing Expression Grammar for C

Based on ISO/IEC 9899.1999:TC2 standard without preprocessor directives,
meaning that it applies to preprocessed C source.

It is the grammar used for experiments described in the paper Some aspects of Parsing Expression Grammar. It has been tested against 100 preprocessed C files from different open source projects.

The typedef feature makes the syntax of C context-dependent, which cannot be expressed in PEG formalism. The problem is solved by semantic actions provided together with the grammar.

Download the grammar in Mouse format (text file).
Download semantic actions for the grammar (Java source).

Latest change 2017-12-14