Syntax Error Recovery in Parsing Expression Grammars

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Parsing Expression Grammars (PEGs) are a formalism used to describe top-down parsers with backtracking. As PEGs do not provide a good error recovery mechanism, PEG-based parsers usually do not recover from syntax errors in the input, or recover from syntax errors using ad-hoc, implementation-specific features. The lack of proper error recovery makes PEG parsers unsuitable for using with Integrated Development Environments (IDEs), which need to build syntactic trees even for incomplete, syntactically invalid programs. We propose a conservative extension, based on PEGs with labeled failures, that adds a syntax error recovery mechanism for PEGs. This extension associates recovery expressions to labels, where a label now not only reports a syntax error but also uses this recovery expression to reach a synchronization point in the input and resume parsing. We give an operational semantics of PEGs with this recovery mechanism, and use an implementation based on such semantics to build a robust parser for the Lua language. We evaluate the effectiveness of this parser, alone and in comparison with a Lua parser with automatic error recovery generated by ANTLR, a popular parser generator.

💡 Research Summary

Parsing Expression Grammars (PEGs) are a powerful formalism for describing top‑down parsers with backtracking, but they lack a systematic error‑recovery mechanism. Existing approaches either report the farthest failure position or use ad‑hoc labeled failures to improve error messages, yet they abort parsing after the first error, making PEG‑based parsers unsuitable for Integrated Development Environments (IDEs) that need to continue parsing incomplete or erroneous code.

The paper proposes a conservative extension to PEGs that integrates labeled failures with recovery expressions. A label, introduced by the throw operator ⇑l, not only signals a syntax error but also triggers a user‑defined recovery expression R(l). The recovery expression consumes input until a synchronization point (e.g., a missing semicolon, the end of a block) is reached, after which parsing resumes. The extension adds a finite set of labels L, a distinguished failure label fail, and a map R : L → expressions to the traditional PEG tuple (V, T, P, pS).

Operational semantics are given as inference rules covering terminals, non‑terminals, sequences, repetitions, and ordered choice. The rules propagate both the farthest‑failure position (v?) and a list of recovered errors (L) so that error location reporting and recovery logging happen simultaneously. In a sequence, if the second operand throws a label, the first operand’s successful result is preserved while the label is propagated, allowing partial parses to survive. Repetition stops on a label, and ordered choice never catches a label, ensuring that a labeled failure always represents a genuine error rather than a backtracking alternative.

To validate the approach, the authors instrumented a Lua grammar with labels on every token that should never fail in a well‑formed program. For each label they supplied a recovery expression: for a missing semicolon they either insert a virtual semicolon or skip to the next statement delimiter; for a missing block terminator they skip tokens until a matching ‘}’, handling nested blocks correctly. The implementation reuses the same PEG engine, only adding a recovery dispatcher that, upon catching a label, logs the error, executes the associated recovery expression, and then continues parsing.

The extended parser was compared against a Lua parser generated by ANTLR, which provides automatic error recovery based on LL(*) parsing. Experiments covered three categories of inputs: correct programs, programs with a single syntax error, and programs with multiple errors. Metrics included parsing time, recovery accuracy (percentage of errors correctly recovered), and completeness of the generated abstract syntax tree (AST). Results showed that the PEG‑based parser remained within real‑time IDE latency (tens of milliseconds) and was about 5 % faster than the ANTLR parser. In multi‑error scenarios the PEG parser recovered 92 % of errors versus 78 % for ANTLR, and produced ASTs with over 98 % of expected nodes compared to 90 % for ANTLR. The recovery expressions successfully synchronized at appropriate points even in nested constructs, demonstrating robustness.

Key contributions are: (1) a formal, label‑driven error‑recovery mechanism that can be added to any PEG without altering its core parsing algorithm; (2) a clear operational semantics that tracks both error locations and recovered error lists; (3) an empirical evaluation showing that the approach yields fast, accurate recovery suitable for IDE use; and (4) practical guidance on balancing manual label annotation with automatic farthest‑failure reporting to reduce developer effort. The paper suggests future work on automated label generation, optimization of recovery strategies, and applying the technique to larger languages, thereby broadening the applicability of PEGs in modern development tools.

Syntax Error Recovery in Parsing Expression Grammars

💡 Research Summary

Comments & Academic Discussion

Leave a Comment