# An overview of parsing algorithms

Part of "Parsing" series

Parser users tend to separate themselves into bottom-up and top-down tribes. Top-down users value the readability of recursive descent (

`RD`

) implementations of`LL`

parsing along with the ease of semantic action incorporation. Bottom-up users value the extended parsing power of`LR`

parsers, in particular the admissibility of left recursive grammars, although`LR`

parsers cannot cope with hidden left recursion and even`LR(0)`

parse tables can be exponential in the size of the grammar, while an`LL`

parser is linear in the size of the grammar.

A chart parser is a type of parser suited to parsing ambiguous grammars [23]. Chart parsers avoid exponential blowup in parsing time arising from the nondeterminism of a grammar by reducing duplication of work through the use of memoization. Top-down chart parsers (such as packrat parsers) use memoized recursion, whereas bottom-up chart parsers more specifically use dynamic programming (Section 1.7). The

`Earley`

parser is a top-down chart parser, and is mainly used for parsing natural language in computational linguistics [14]. It can parse any context-free grammar, including left-recursive grammars. The`Earley`

parser executes in cubic time in the general case, quadratic time for unambiguous grammars, and linear time for all`LR(k)`

grammars… The`Earley`

parser may be converted from top-down memoized recursive form into bottom-up dynamic programming form [43] Parsing with pictures is a chart parsing algorithm that provides an alternative approach to parsing context-free languages. The authors claim that this method is simpler and easier to understand than standard parsers using derivations or pushdown automata [35]. This parsing method unifies`Earley`

,`SLL`

,`LL`

,`SLR`

, and`LR`

parsers, and demonstrates that`Earley`

parsing is the most fundamental Chomskyan context-free parsing algorithm, from which all others derive.

## Algorithms

List of algorithms (based on this page):

`CYK`

(Cocke–Younger–Kasami) algorithm- bottom-up, dynamic programming, chart parser
- supports any context-free grammar in Chomsky Normal Form
`O(n^3)`

- online demo 1, online demo 2, list of part-of-speech tags

`Earley`

parser (Jay Earley, 1968):- dynamic programming, chart parser
- supports any context-free grammar
`O(n^3)`

`Earley`

gave an outline of a method for turning his recognizers into parsers butit turns out that this method is incorrect. Tomita’s`GLR`

parser returns a shared packed parse forest (SPPF) representation of all derivations of a given string from a given CFG but is worst-case unbounded polynomial order.

### LL

left-to-right, leftmost derivation (top-down), “recursive descent”

`LL(k)`

(Lewis and Stearns, 1968)`k`

tokens of lookahead- very expensive (when introduced)

`LL(1)`

- exactly one token lookahead
- much more efficient
- uses
`FIRST()`

and`FOLLOW()`

to build parse tables - online demo 1, online demo 2

`Efficient LL(k)`

(Terence Parr, 1990)`k`

tokens of lookahead`k`

is gradually increased w/ backtracking as a fallback- basis of original
`ANTLR`

In terms of recognition strength,

`LL`

techniques are widely held to be inferior to`LR`

parsers. The fact that any`LR(k)`

grammar can be rewritten to be`LR(1)`

, whereas`LL(k)`

is stronger than`LL(1)`

, appears to give`LR`

techniques the additional benefit of not requiring k-token lookahead and its associated overhead.In this paper, we suggest thatFurther, a practical method of generating efficient`LL(k)`

is actually superior to`LR(1)`

when translation, rather than acceptance, is the goal.`LL(k)`

parsers is presented. This practical approach is based on the fact that most parsing decisions in a typical`LL(k)`

grammar can be made without comparing k-tuples and often do not even require the full k tokens of look ahead. We denote such`"optimized" LL(k)`

parsers

`GLL`

Generalised`LL`

(Bernard Lang, 1974 & Scott and Johnstone, 2010)`LL`

analogue to`GLR`

- maintains multiple parsing descriptors on a stack

Recursive Descent (RD) parsers are popular because their control flow follows the structure of the grammar and hence they are easy to write and to debug. However, the class of grammars which admit RD parsers is very limited. Backtracking techniques may be used to extend this class, but can have explosive run-times and cannot deal with grammars with left recursion. Tomita-style

`RNGLR`

parsers are fully general but are based on`LR`

techniques and do not have the direct relationship with the grammar that an RD parser has. We develop the fully general`GLL`

parsing technique which is recursive descent-like, and has the property that the parse follows closely the structure of the grammar rules, but uses`RNGLR`

-like machinery to handle non-determinism.The resulting recognisers run in worst-case cubic time and can be built even for left recursive grammars.

`LL(*)`

(Terence Parr and Kathleen Fisher, 2011)`LL`

analogue to`LAR(m)`

- regular lookaheads w/ cyclic DFAs
- basis of
`ANTLR 3`

(?)

Despite the power of Parser Expression Grammars (

`PEG`

s) and`GLR`

, parsing is not a solved problem. Adding nondeterminism (parser speculation) to traditional`LL`

and`LR`

parsers can lead to unexpected parse-time behavior and introduces practical issues with error handling, single-step debugging, and side-effecting embedded grammar actions. This paper introduces the`LL(*)`

parsing strategy and an associated grammar analysis algorithm that constructs`LL(*)`

parsing decisions from ANTLR grammars. At parse-time, decisions gracefully throttle up from conventional fixed k≥1 lookahead to arbitrary lookahead and, finally, fail over to backtracking depending on the complexity of the parsing decision and the input symbols.`LL(*)`

parsing strength reaches into the context-sensitive languages, in some cases beyond what`GLR`

and`PEG`

s can express. By statically removing as much speculation as possible,`LL(*)`

provides the expressivity of`PEG`

s while retaining`LL`

’s good error handling and unrestricted grammar actions.

`ALL(*)`

Adaptive`LL(*)`

(Parsing: The Power of Dynamic Analysis, Terence Parr, Sam Harwell, and Kathleen Fisher, 2014)- parallel regular lookaheads
- dynamic; adapts to input sentences
- simulates augmented recursive transition networks (ATNs)
- basis of
`ANTLR 4`

Despite the advances made by modern parsing strategies such as

`PEG`

,`LL(*)`

,`GLR`

, and`GLL`

, parsing is not a solved problem. Existing approaches suffer from a number of weaknesses, including difficulties supporting side-effecting embedded actions, slow and/or unpredictable performance, and counter-intuitive matching strategies. This paper introduces the`ALL(*)`

parsing strategy that combines the simplicity, efficiency, and predictability of conventional top-down`LL(k)`

parsers with the power of a`GLR`

-like mechanism to make parsing decisions. The critical innovation is to move grammar analysis to parse-time, which lets`ALL(*)`

handle any non-left-recursive context-free grammar.ANTLR 4 generates`ALL(*)`

is O(n4) in theory but consistently performs linearly on grammars used in practice, outperform in general strategies such as`GLL`

and`GLR`

by orders of magnitude.`ALL(*)`

parsers and supports direct left-recursion through grammar rewriting.

A new parsing method called

`LLLR`

parsing is defined and a method for producing`LLLR`

parsers is described. An`LLLR`

parser uses an`LL`

parser as its backbone and parses as much of its input string using`LL`

parsing as possible. To resolve`LL`

conflicts it triggers small embedded`LR`

parsers. An embedded`LR`

parser starts parsing the remaining input and once the`LL`

conflict is resolved, the`LR`

parser produces the left parse of the substring it has just parsed and passes the control back to the backbone`LL`

parser. The`LLLR(k)`

parser can be constructed for any`LR(k)`

grammar. It produces the left parse of the input string without any backtracking and, if used for a syntax-directed translation, it evaluates semantic actions using the top-down strategy just like the canonical`LL(k)`

parser. An`LLLR(k)`

parser is appropriate for grammars where the`LL(k)`

conflicting nonterminals either appear relatively close to the bottom of the derivation trees or produce short substrings. In such cases an`LLLR`

parser can perform a significantly better error recovery than an`LR`

parser since the most part of the input string is parsed with the backbone`LL`

parser.`LLLR`

parsing is similar to`LL(∗)`

parsing except that it (a) uses`LR(k)`

parsers instead of finite automata to resolve the`LL(k)`

conflicts and (b) does not perform any backtracking.

### LR

left-to-right, rightmost derivation (“bottom-up”), “shift/reduce”

- Canonical
`LR`

(Don Knuth, 1965)- allows duplicated states
- fewer conflicts; much larger parse tables
- online demo

`SLR`

(Frank DeRemer)- simple
`LR`

:`LR(0)`

states and their transitions `FOLLOW()`

used to compute lookaheads- online demo

- simple
`LALR`

(Frank DeRemer, 1969)- similar to
`SLR`

but w/ smaller lookaheads - equivalently, a simplified version of canonical
`LR`

- more complicated lookahead calculation
- basis of yacc/bison
- recursive ascent (Thomas Penello, 1986)
- online demo

- similar to
`GLR`

(Masaru Tomita, 1984)- generalized
`LR(k)`

: returns all valid parses - spawns parallel parsing processes on conflicts
- graph-structured stack (
`GSS`

) - more efficient than Earley

- generalized
`LAR(m)`

(Practical arbitrary lookahead LR parsing, Bermudez and Schimpf, 1990)- regular lookaheads w/ cyclic DFAs

`GLR*`

(GLR* - An Efficient Noise-skipping Parsing Algorithm For Context Free Grammars Alon Lavie and Masaru Tomita, 1993)`SGLR`

Scannerless`GLR`

(Scannerless generalized-LR parsing, Eelco Visser, 1999)

Current deterministic parsing techniques have a number of problems. These include the limitations of parser generators for deterministic languages and the complex interface between scanner and parser. Scannerless parsing is a parsing technique in which lexical and context-free syntax are integrated into one grammar and are all handled by a single context-free analysis phase. This approach has a number of advantages including discarding of the scanner and lexical disambiguation by means of the context in which a lexical token occurs. Scannerless parsing generates a number of interesting problems as well. Integrated grammars do not fit the requirements of the conventional deterministic parsing techniques. A plain context-free grammar formalism leads to unwieldy grammars, if all lexical information is included. Lexical disambiguation needs to be reformulated for use in context-free parsing. The

`scannerless generalized-LR`

parsing approach presented in this paper solves these problems. Grammar normalization is used to support an expressive grammar formalism without complicating the underlying machinery. Follow restrictions are used to express longest match lexical disambiguation. Reject productions are used to express the prefer keywords rule for lexical disambiguation. The`SLR(1)`

parser generation algorithm is adapted to implement disambiguation by general priority and associativity declarations and to interpret follow restrictions.`Generalized-LR`

parsing is used to provide dynamic lookahead and to support parsing of arbitrary context-free grammars including ambiguous ones. An adaptation of the`GLR`

algorithm supports the interpretation of grammars with reject productions.

`RIGLR`

Reduction Incorporated Generalized`LR`

(Elizabeth Scott and Adrian Johnstone. Generalised bottom up parsers with reduced stack activity, 2005)

We describe a generalized bottom up parser in which non-embedded recursive rules are handled directly by the underlying automaton, thus limiting stack activity to the activation of rules displaying embedded recursion. Our strategy is motivated by Aycock and Horspool’s approach, but uses a different automaton construction and leads to parsers that are correct for all context-free grammars, including those with hidden left recursion. The automaton features edges which directly connect states containing reduction actions with their associated goto state: hence we call the approach reduction incorporated generalized LR parsing. Our parser constructs shared packed parse forests in a style similar to that of Tomita parsers. We give formal proofs of the correctness of our algorithm, and compare it with Tomita’s algorithm in terms of the space and time requirements of the running parsers and the size of the parsers' tables.

`RNGLR`

Right nulled`GLR`

parsers (Elizabeth Scott and Adrian Johnstone. Right nulled GLR parsers, 2006)

The right nulled generalized

`LR`

parsing algorithm is a new generalization of`LR`

parsing which provides an elegant correction to, and extension of, Tomita’s`GLR`

methods whereby we extend the notion of a reduction in a shift-reduce parser to include right nulled items. The result is a parsing technique which runs in linear time on`LR(1)`

grammars and whose performance degrades gracefully to a polynomial bound in the presence of non`LR(1)`

rules. Compared to other`GLR`

-based techniques, our algorithm is simpler and faster.

`BRNGLR`

Binary Right Nulled GLR (BRNGLR: a cubic Tomita-style GLR parsing algorithm, Elizabeth Scott, Adrian Johnstone, Rob Economopoulos, 2007)

Tomita-style generalised

`LR`

(`GLR`

) algorithms extend the standard`LR`

algorithm to non-deterministic grammars by performing all possible choices of action. Cubic complexity is achieved if all rules are of length at most two. In this paper we shall show how to achieve cubic time bounds for all grammars by binarising the search performed whilst executing reduce actions in a`GLR`

-style parser. We call the resulting algorithm Binary Right Nulled`GLR`

(`BRNGLR`

) parsing. The binarisation process generates run-time behaviour that is related to that shown by a parser which pre-processes its grammar or parse table into a binary form, but without the increase in table size and with a reduced run-time space overhead.BRNGLR parsers have worst-case cubic run time on all grammars, linear behaviour on`LR(1)`

grammars and produce, in worst-case cubic time, a cubic size binary SPPF representation of all the derivations of a given sentence.

A major research goal for compilers and environments is the automatic derivation of tools from formal speciﬁcations. However, the formal model of the language is often inadequate; in particular,

`LR(k)`

grammars are unable to describe the natural syntax of many languages, such as C++ and Fortran, which are inherently non-deterministic. Designers of batch compilers work around such limitations by combining generated components with ad hoc techniques (for instance, performing partial type and scope analysis in tandem with parsing). Unfortunately, the complexity of incremental systems precludes the use of batch solutions. The inability to generate incremental tools for important languages inhibits the widespread use of language-rich interactive environments. We address this problem by extending the language model itself, introducing a program representation based on parse DAGs that is suitable for both batch and incremental analysis. Ambiguities unresolved by one stage are retained in this representation until further stages can complete the analysis, even if the resolution depends on further actions by the user. Representing ambiguity explicitly increases the number and variety of languages that can be analyzed incrementally using existing methods.

`MSGLR`

Modular`SGLR`

(A Modular SGLR Parsing Architecture for Systematic Performance Optimization, Denkers, Jasper et al., 2018)

`SGLR`

parsing is an approach that enables parsing of context-free languages by means of declarative, concise and maintainable syntax definition. Existing implementations suffer from performance issues and their architectures are often highly coupled without clear separation between their components. This work introduces a modular`SGLR`

architecture with several variants implemented for its components to systematically benchmark and improve performance. This work evaluates these variants both independently and combined using artificial and real world programming languages grammars. The architecture is implemented in Java as JSGLR2, the successor of the original parser in Spoofax, interpreting parse tables generated by SDF3.The improvements combined result into a parsing and imploding time speedup from 3x on Java to 10x on GreenMarl with respect to the previous JSGLR implementation.

We present the Incremental Scannerless Generalized

`LR`

(`ISGLR`

) parsing algorithm, which combines the benefits of Incremental Generalized`LR`

(`IGLR`

) parsing and Scanner-less Generalized LR (`SGLR`

) parsing. The`ISGLR`

parser can reuse parse trees from unchanged regions in the input and thus only needs to parse changed regions. We also present incremental techniques for imploding the parse tree to an Abstract Syntax Tree (AST) and syntax highlighting. Scannerless parsing relies heavily on non-determinism during parsing, negatively impacting the incrementality of`ISGLR`

parsing.We evaluated the`ISGLR`

parsing algorithm using file histories from Git, achieving a speedup of up to 25 times over non-incremental`SGLR`

### PEG

`TDPL`

or`TS`

(The tmg recognition schema, Alexander Birman, 1970)`PEG`

: parsing expression grammar (Parsing Expression Grammars: A recognition-based Syntactic Foundation, Bryan Ford, 2004)- similar to CFG-based grammars, but alternation is inherently ordered
- often implemented with “packrat” parsers that memoize partial results
- recursive descent with backtracking

For decades we have been using Chomsky’s generative system of grammars, particularly context-free grammars(CFGs)and regular expressions(REs), to express the syntax of programming languages and protocols. The power of generative grammars to express ambiguity is crucial to their original purpose of modeling natural languages, but this very power makes it unnecessarily difficult both to express and to parse machine-oriented languages using CFGs. Parsing Expression Grammars(

`PEG`

s) provide an alternative recognition-based formal foundation for describing machine-oriented syntax, which solves the ambiguity problem by not introducing ambiguity in the first place Where CFG express nondeterministic choice between alternatives,`PEG`

s instead use prioritized choice.`PEG`

s address frequently felt expressiveness limitations of CFGs and REs, simplifying syntax definitions and making it unnecessary to separate their lexical and hierarchical components.A linear-time parser can be built for any PEG, avoiding both the complexity and fickleness of`LR`

parsers and the inefficiency of generalized CFG parsing. While PEGs provide a rich set of operators for constructing grammars, they are reducible to two minimal recognition schemas developed around 1970,`TS/TDPL`

and`gTS/GTDPL`

, which are here proven equivalent ineffective recognition power.

Abstract. Parsing Expression Grammars (

`PEG`

s) are specifications of unambiguous recursive-descent style parsers. PEGs incorporate both lexing and parsing phases and have valuable properties, such as being closed under composition. In common with most recursive-descent systems, raw`PEG`

s cannot handle left-recursion; traditional approaches to left-recursion elimination lead to incorrect parses. In this paper, I show how the approach proposed for direct left-recursive Packrat parsing by Warth et al. can be adapted for ‘pure’`PEG`

s. I then demonstrate that this approach results in incorrect parses for some`PEG`

s, before outlining a restrictive subset of left-recursive`PEG`

s which can safely work with this algorithm. Finally I suggest an alteration to Warth et al.’s algorithm that can correctly parse a less restrictive subset of directly recursive`PEG`

s.

Packrat parsing is a linear-time implementation method of recursive descent parsers. The trick is a memoization mechanism, where all parsing results are memorized to avoid redundant parsing in cases of backtracking. An arising problem is extremely huge heap consumption in memoization, resulting in the fact that the cost of memoization is likely to outweigh its benefits. In many cases, developers need to make a difficult choice to abandon packrat parsing despite the possible exponential time parsing. Elastic packrat parsing is developed in order to avoid such a difficult choice. The heap consumption is upper-bounded since memorized results are stored on a sliding window buffer. In addition, the buffer capacity is adjusted by tracing each of nonterminal backtracking activities at runtime. Elastic packrat parsing is implemented in a part of our Nez parser. We demonstrate that the elastic packrat parsing achieves stable and robust performance against a variety of inputs with different backtracking activities.

A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar. A

`packrat`

parser uses memoization to reduce the time complexity for recursive descent parsing from exponential to linear in the length of the input. Recursive descent parsers are extremely simple to write, but suffer from two significant problems:(i) left-recursive grammars cause the parser to get stuck in infinite recursion, and (ii) it can be difficult or impossible to optimally recover the parse state and continue parsing after a syntax error. Both problems are solved by the, a novel reformulation of`pika`

parser`packrat`

parsing as a dynamic programming algorithm, which requires parsing the input in reverse: bottom-up and right to left, rather than top-down and left to right. This reversed parsing order enables`pika`

parsers to handle grammars that use either direct or indirect left recursion to achieve left associativity, simplifying grammar writing, and also enables optimal recovery from syntax errors, which is a crucial property for IDEs and compilers.`Pika`

parsing maintains the linear-time performance characteristics of`packrat`

parsing as a function of input length. The`pika`

`parser`

was benchmarked against the widely-used Parboiled2 and ANTLR4 parsing libraries, and the pika`parser`

performed significantly better than the other`parsers`

for an expression grammar, although for a complex grammar implementing the Java language specification, a large constant performance impact was incurred per input character for the`pika`

parser, which allowed Parboiled2 and ANTLR4 to perform significantly better than the`pika`

parser for this grammar (in spite of ANTLR4’s parsing time scaling between quadratically and cubically in the length of the input with the Java grammar).Therefore, if performance is important,Several new insights into precedence, associativity, and left recursion are presented.`pika`

parsing is best applied to simple to moderate-sized grammars, or to very large inputs, if other parsing alternatives do not scale linearly in the length of the input.

**See also**:

We show how one way of removing non-determinism from this formalism yields a formalism with the semantics of PEGs. We also prove, based on these new formalisms, how

`LL(1)`

grammars define the same language whether interpreted as CFGs or as`PEG`

s, and also show how strong-`LL(k)`

, right-linear, and LL-regular grammars have simple language-preserving translations from CFGs to`PEG`

s– On the relation between context-free grammars and parsing expression grammars

- Adaptable Parsing Expression Grammars (APEG), 2014
- Parsing Expression Grammars with Unordered Choices, 2017

### Earley variations

We present the design and theory of a new parsing engine, YAKKER, capable of satisfying the many needs of modern programmers and modern data processing applications. In particular, our new parsing engine handles (1) full scannerless context-free grammars with (2) regular expressions as right-hand sides for defining nonterminals. YAKKER also includes (3) facilities for binding variables to intermediate parse results and (4) using such bindings within arbitrary constraints to control parsing. These facilities allow the kind of data-dependent parsing commonly needed in systems applications, particularly those that operate over binary data. In addition, (5) nonterminals may be parameterized by arbitrary values, which gives the system good modularity and abstraction properties in the presence of data-dependent parsing. Finally, (6) legacy parsing libraries, such as sophisticated libraries for dates and times, may be directly incorporated into parser specifications… We prove the correctness of our translation of data-dependent grammars into these new automata and then show how to implement the automata efficiently using a variation of Earley’s parsing algorithm.

We present a new, virtual machine based approach to parsing, heavily based on the original Earley parser. We show how to translate grammars into virtual machine instruction sequences that are then used by the parsing algorithm. Additionally, we introduce an optimization that merges shared rule prefixes to increase parsing performance. Finally, we present and evaluate an implementation of scannerless Earley Virtual Machine

### Parsing with derivatives

Might et al. (2011) introduced parsing with derivatives,which handles arbitrary context-free grammars while be-ing both easy to understand and simple to implement. De-spite much initial enthusiasm and a multitude of independent implementations, its worst-case complexity has never been proven to be better than exponential. In fact, high-level arguments claiming it is fundamentally exponential have been advanced and even accepted as part of the folklore. Performance ended up being sluggish in practice, and this sluggishness was taken as informal evidence of exponentiality – On the Complexity and Performance of Parsing with Derivatives

- Derivatives of Regular Expressions, JANUSZ A. BRZOZOWSKI, 1964
- Regular-expression derivatives re-examined, 2009
- Parsing with Derivatives, 2011
- POSIX Regular Expression Parsing with Derivatives, 2014
- LL(1) Parsing with Derivatives and Zippers, 2019
- Derivative Grammars: A Symbolic Approach to Parsing with Derivatives, 2019

## PS

History:

- If you want to know more history of BNF and REG see Guy Steele talk.
- Parsing: a timeline

Visualizing algorithms:

About ambiguity:

- The Usability of Ambiguity Detection Methods for Context-Free Grammars
- Ambiguity Detection for Context-FreeGrammars in Eli

More reading:

- Unifying LL and LR parsing
- Fast Context-Free Grammar Parsing Requires Fast Boolean Matrix Multiplication
- Comparison of Parsing Techniques
- Critical Analysis of Extensible Parsing Tools and Techniques
- Awesome binary parsing

If you want to learn more about dynamic programming read this.