Problems & pains in parsing: a story of lexer-hack

A tale of popular issues in parsing mainstream languages because of design oversights.

  • By Swarnim
  • ·
  • Insights
  • C
  • C++
Last updated on Nov 6, 2022

Parsing is defined as the process of turning streams of bits into meaningful structures, generally trees or bounded graphs. Perhaps we could even define it as the process of understanding data streams. But this definition leads to several questions, like "What does it mean to parse a language?", or "What kinds of weird issues might one face when looking into this abyss?". I will try to go over these in this post. Hopefully, it will also help us understand why language design matters not just to the consumers of a language but also to the developers and people trying to make tools for it.

Before starting, I would like to clarify that we will only discuss the parsing of programming languages. Therefore, we'll only be concerned with "Syntax Trees" as the primary output.

How is language parsing approached?

A crucial step to parsing any language is deciding on the grammar to be used. Language grammars have a lot of categories depending on the "parsing complexity" of the language and how "expressive" or "flexible" the language is at expressing various semantics.

Each of the categories below is a superset of the ones above it.

The primary generalizations of grammar are:

  • Regular Grammar: The least flexible but most straightforward.
    • Regular expressions are a good example (It's even in the name!).
  • Context-Free Grammar(CFG): Much more flexible; it can express constructs of higher complexity.
    • Many programming language grammars (e.g., SQL) and all regular grammars are a part of this family.
  • Context-Sensitive Grammar(CSG): Extremely flexible and absurdly complicated.
    • e.g., "C++" and any grammar from previous categories.

Furthermore, there is, in general, one last category of grammar to consider:

  • Mostly Context-Free Grammar: These grammars are not entirely CFGs but their design incorporates context sensitivity, allowing for greater flexibility while keeping complexity from increasing significantly. For example - "Range concatenation grammar." These are often also referred to as "mildly context-sensitive grammars".

These generalizations allow us to do the following things:

  • Understand language grammars
  • Create new ones easily
  • Constrain the definition of a valid program in a language
  • Create improved tooling for a language

Example of CFG toy language

There are many standard tools available for parsing or generating a parser using a language's CFG, such as yacc or its slightly more feature-rich superset, bison. Some libraries, such as ANTLR, are even more capable, allowing use beyond simple CFGs.

Such tools commonly accept a form of PEG (Parsing Expression Grammar), generally "BNF (Backus-Naur Form)". BNF is a formal notation capable of representing context-free grammar. However, grammars created with it are inflexible in that a BNF grammar must be unambiguous; i.e., any input that matches a BNF grammar must always parse into only one parse tree.

Let's consider some code we'd like to parse:

LET X = 10      # initialize with value 10
SET X = X + 10  # write to X
PRINT X         # read X

And this is one possible syntax tree (aka, parse tree) that could be generated for the snippet above:

SOURCE
  LET_OP
  | IDENTIFIER “X”
  | INTEGER “10”
  SET_OP
  | IDENTIFIER “X”
  | BIN_EXPR
  | | OP ADD
  | | LHS
  | | | IDENTIFIER “X”
  | | RHS
  | | | INTEGER “10”
  FUNCTION_CALL
  | FUNCTION_NAME “PRINT”
  | ARGS
  | | IDENTIFIER “X”

Another popular format for representing CFGs is ABNF (Augmented Backus-Naur Form). Here's how an ABNF grammar for our code snippet would look:

<LET_OP>        ::= "LET" <IDENTIFIER> "=" <EXPR>
<SET_OP>        ::= "SET" <IDENTIFIER> "=" <EXPR>

<EXPR>          ::= <BIN_EXPR> | <UN_EXPR> | <IDENTIFIER> | <LITERAL>
<BIN_EXPR>      ::= <LHS> <OP> <RHS>
<LHS>           ::= <EXPR>
<UN_EXPR>       ::= <OP> <EXPR>

<FUNCTION_CALL> ::= <FUNCTION_NAME> <ARGS>
<ARGS>          ::= <EXPR> "," <ARGS> | <EXPR>

<LITERAL>       ::= <INTEGER> | <FLOAT> | <STRING> | <CHAR>

<OP>            ::= <ADD> | <SUB> | <MUL> | <DIV>
<ADD>           ::= "+"
<SUB>           ::= "-"
<MUL>           ::= "*"
<DIV>           ::= "/"

But what’s the story of the problems and pains with parsing C?

C is a prime example of a language that is extremely popular, with many large and robust codebases written in it, but has been plagued by problems in its grammar's design.

The main underlying problem with C is that it is a context-sensitive language. However, what exacerbates these is that C requires type information (i.e., information about the structure of data used within the program) to produce valid syntax trees.

Here's is a typical example of this issue:

(A) * B

This expression could be one of two things:

  • Dereference B and typecast it to the type A, i.e., (A)(*B)
  • Multiply A and B, i.e., A * B

How the code should be understood completely depends on the expression A. Such ambiguities harm readability, thereby requiring an IDE for reading simple C programs. This makes it a rather annoying language to deal with. This also makes it harder to debug issues in isolation, as you may need to parse the entire codebase to make sense of a short snippet.

The classic solution: Lexer-hack

"Lexer hack" works by first performing a "rough" parse, collecting information about symbols (the names of variables and data types), and then feeding that data back into the lexer, so it can make more informed decisions when parsing.

Generally, the first parse would collect type aliases (created from typedef statements), raw type definitions (struct declarations), and some rudimentary scoping data to figure out overlapping definitions for each scope.

This information is then fed back to the lexer, which will correctly infer context based on type information at any point in code.

lexer(src) => tokens
-> parser(tokens) => parse-tree
-> semantics(parse-tree) => typedefs table
-> lexer(tokens + typedefs) => valid tokens
-> parser() + semantics() => valid syntax-tree
-> Final Result

This solution not only makes it harder to implement a C parser with general-purpose tools, but also hurts the parser's performance when compared to a typical top-down or bottom-up parsing strategy with limited lookahead.

Another common pitfall is scoping restrictions. For instance:

typedef struct {
  ...
} A;
int main() {
  int A = 10; // redefinition, sad :( we can't resolve A as a type definition now...
  int B = 20;
  ...
  int C = (A)*B;
}

This happens because in C, without the added context, the lexer can't distinguish between typedefs and variables. Both types and variables are identifiers, so they have the same format in most syntax trees.

Using the lexer hack, we give the lexer extra context to infer whether A is a typedef alias or a variable. In this case, A would be interpreted as a variable, and the expression (A)*B would be parsed as (A * B).

An alternative solution: Lexerless parsing

In software development, there rarely is just one answer to a problem. So it follows that there are alternatives to the lexer hack. The first of these is "lexerless parsing". It's relatively more complicated to implement as it requires a conceptual rethinking of the parsing process, where the stream of characters is both tokenized and parsed simultaneously.

This can be a rather ugly solution, with the key disadvantage of being much slower due to the lack of a pipeline between lexing and parsing steps. It also ends up producing complicated parsers even for trivial grammars. Thus, it is only used in situations where having a distinction between the lexer and the parser would complicate the system or cause issues in parsing (as in the case of C). Examples of languages that use lexerless parsing are TeX and Makefiles.

The best solution: Clang's semantic parsing

The best solution, in my opinion, is present in Clang.

Simply put, Clang's lexer does not require semantic analysis. It does not attempt to distinguish between variables and types. Instead, it generates generic identifier tokens. The responsibility of determining if a token is a variable or a type definition falls on the parser, which uses semantic analysis to disambiguate individual cases as per the symbols present in the current scope.

This approach provides much greater flexibility with a comparatively lower performance overhead. It is also the more popular approach in languages where the type or class name identifiers and variable name identifiers aren't separated.

You can look at the source code of Clang's parser here.

Concluding thoughts

Parsers and research in language theory have come a long way since the inception of computer science as a field of study, and have led to some exciting developments in how analysis tools are built.

Understanding the importance of having a language with non-ambiguous syntax (a sane grammar) is essential to building any language with its associated tools and ensuring it is readable. Even if your language requires certain features that can't be implemented very cleanly within the constraints of a CFG, you should still try to limit the scope of such features.

An excellent example to explain this fact is Rust, which is context-free, though it still has a few non-context-free elements (one example would be raw string constants). Rust is a language for which a semantically correct parsable (Yes, this definition requires all three of these words in this particular order, which is annoying) syntax cannot be represented by a CFG.

But the highlight is that Rust has taken extra care to ensure that context sensitivity depends only on syntax, and not on a project's global semantic scope. As such, this allows languages like Rust to be parsed by Range Concatenation Grammars (RCG), which are powerful enough to parse complex inputs such as raw string constants while still delegating to a CFG-style parser for other parts of the syntax.

Coming back to C, this is not the end of the parsing issues present in the language. But let's put aside topics for later. If you ask me, preferably never.

And let's not forget we haven't even started with C++, from templates and operator overloading to constructors and everything else. C++ is perhaps one of the most challenging languages to parse, no matter which version of the standard you pick up. For this reason, C++ compilers end up with bloat, even without supporting modern language features or a standard library.

Parsing C++ is literally undecidable. Here's a quick example:

A < B, C > D();

Feel free to rack your brain on the snippet above, and when you finally feel confused enough, go ahead and visit "Most Vexing Parse" (from Effective STL by Scott Meyers). There are many issues with parsing C++, and one life isn't enough to discuss all of them, so I present these immortal words to live by:

Just use libclang.

So, is C++ impossible to parse? No, of course not. If it were, the Clang and MSVC compilers wouldn't exist. It's just that it is hard - really hard - especially since C++'s template metaprogramming aspects are Turing complete. In the real world, this problem is mitigated by limiting the template instantiation step size. In short, C++ templates run into that familiar issue known as the Halting Problem.

So the following can be undecidable:

auto x = A<T>::V;

Some would say this only makes type resolution undecidable, and not the language itself. But since the lexer hack forces the compiler to depend on type information in the lexing phase, Q.E.D. the type info is required from the beginning of the compilation anyway.

auto x = (A<B>::ValueOrType) * B;

*Plasters the snake-eating-its-own-tail meme*

References and further reading

Ship clean and secure code.