Writing A Recursive Descent Parser For C Grammar

Matthis · ·
parsingprecedence climbingrecursive descentc

1. Introduction And Motivation

I recently wrote a compiler for a subset of C, which did not feature the many interesting caveats of actually compiling C code. The most interesting part (at least for me) is writing the compiler backend. But well, before arriving there, I had to go through the depths of parsing correct C grammar.

Parsing C grammar might sound relatively easy at first, but once you get to the nitpicky details of it, it becomes unreasonably hard and confusing (e.g. the lexer hack or declaration grammar, to name a few).

I’ll start with parsing expressions and then work up to the harder parts.

2. Parsing Expressions

Parsing expressions is relatively straightforward. As you might have guessed, I do this using precedence climbing.

2.1 A Brief Overview Of Precedence Climbing

I start by defining the operators and their precedence. I made use of a static map to handle this

const static std::unordered_map<ast::BinaryExpr::BinaryOp, PrecedenceInfo> precedence_table{
    {ast::BinaryExpr::BinaryOp::Comma, {.precedence = 1, .is_left_assoc = true}},

    // Assign (right assoc with prec = 2)
    {ast::BinaryExpr::BinaryOp::Assign, {.precedence = 2, .is_left_assoc = false}},
    {ast::BinaryExpr::BinaryOp::AddAssign, {.precedence = 2, .is_left_assoc = false}},
    {ast::BinaryExpr::BinaryOp::SubAssign, {.precedence = 2, .is_left_assoc = false}},
    {ast::BinaryExpr::BinaryOp::MulAssign, {.precedence = 2, .is_left_assoc = false}},
    {ast::BinaryExpr::BinaryOp::DivAssign, {.precedence = 2, .is_left_assoc = false}},
    ...

This table defines the precedence of all binary operators and their associativity, meaning whether they bind from left to right or from right to left. You might have noticed I defined comma as a binary operator. If you start to look into a bit of C, you will eventually come across an expression of this form a = 0,b = 1; this is what’s known as the comma operator. The expressions here will be evaluated from left to right and eventually (if the comma operator is being used as an RValue) take the value of the result of the last expression.

In the context of precedence climbing, associativity means in what order the operators bind. For example assignments bind right to left, because otherwise this would not work

a = b = c = 1;

which is equivalent to

a = (b = (c = 1));

If assignments were binding left to right, this expression would yield an error.

2.2 The Algorithm

The parser loops over operators, comparing their precedence to a running minimum. If the next operator’s precedence meets or exceeds the minimum, the parser recurses to build the right-hand side. If it’s lower, the loop breaks, letting the caller claim the result as its operand.

parse_expression(min_precedence):
  lhs = parse_atom()
  while prec(current_operator) >= min_prec:
    min_prec = prec(current_operator)
    op = current_operator
    new_min_prec = right_assoc ? min_prec : min_prec + 1
    advance()
    rhs = parse_expression(new_min_prec)
    lhs = make_binary_expression(lhs, op, rhs)

return lhs

Consider this expression

3 * 4 + 5

The parser calls parse_expr(0). It parses 3 as the left-hand side, then sees * with precedence 2. Since 2 ≥ 0, the parser enters the loop and recurses with parse_expr(3) (left-associative, so min = 2 + 1).

Inside that recursive call, the parser picks up 4 as the left-hand side, then sees + with precedence 1. Since 1 < 3, the loop doesn’t execute. The call returns 4.

Back in the outer call, the parser now has lhs = BinaryExpr(3 * 4). It sees + with precedence 1. Since 1 ≥ 0, the parser enters the loop again and recurses with parse_expr(2).

That call parses 5, sees no more operators, and returns 5.

This produces the final result: BinaryExpr(BinaryExpr(3 * 4) + 5).

The key moment is when parse_expr(3) sees +. Its precedence is too low to continue, so 4 gets returned to the * expression rather than being captured by +. This is how lower-precedence operators end up as the outer expression.

I also include postfix expression parsing for call expressions, member expressions, and array accesses.

In practice the method for parsing expressions looks like this

ast::Expr* Parser::parse_expr(unsigned min_precedence) {
  ast::Expr* lhs = parse_unary();
  while (true) {
    auto operation = token_binop_map.find(current_.type);
    if (operation == token_binop_map.end()) {
      break;
    }
    auto precedence_info = precedence_table.find(operation->second);
    if (precedence_info == precedence_table.end()) {
      break;
    }

    auto [prec, is_left_assoc] = precedence_info->second;

    if (prec < min_precedence) {
      break;
    }

    advance();

    // This is for left associative operations
    unsigned new_min = is_left_assoc ? prec + 1 : prec;

    auto* rhs = parse_expr(new_min);
    lhs = arena_.create<ast::BinaryExpr>(lhs, rhs, operation->second, lhs->span());
  }

  return lhs;
}

I won’t include the rest of the methods as for full C these become quite bloated with edge cases.

3. Statements

Statements in C can be loops, switch statements, or generally the main language concepts. They can also be declarations and expressions.

Parsing these statements is generally the easier part of parsing C source code. It follows the same rules as Java for example, with a few exceptions here and there. The main logic is the same, so the parser branches on the keywords it sees and parses the corresponding statement. If no keyword is found, the parser either treats it as a typename (which becomes a declaration statement) or tries to parse an expression.

A key issue is disambiguating between declarations and expressions. The parser can’t locally decide whether Foo * x introduces a pointer declaration or is an expression involving multiplication, since Foo may or may not be a type. This makes C grammar context-sensitive and a real pain to parse.

For this disambiguation I need to track declared typenames while parsing. I have done this using a scoped typedef table. This is also known as the “lexer hack” I mentioned earlier.

class TypedefTable {
  Scope* current_scope;
  arena::Arena& arena_;

 public:
  explicit TypedefTable(arena::Arena& arena) : arena_(arena) {
    current_scope = arena.create<Scope>(nullptr);
  }

  void push_scope() {
    Scope* parent = current_scope;
    auto* new_scope = arena_.create<Scope>(parent);
    current_scope = new_scope;
  }

  void pop_scope() {
    if (current_scope->parent != nullptr) {
      current_scope = current_scope->parent;
    }
  }

  void register_typedef(std::string_view typedef_, type::QualType type) {
    current_scope->typedefs.insert({typedef_, type});
  }

  [[nodiscard]] std::optional<type::QualType> get_typedef(std::string_view typedef_) const {
    return current_scope->lookup(typedef_);
  }
};

This is just a stack implementation pushing and popping scopes at compound statements for tracking typedefs. The parser then checks whether the current token is a decl_spec token (so a typename or type qualifiers or storage class) and if so parses a declaration.

ast::Stmt* Parser::parse_stmt() {
  if (is(TokenType::Return)) {
    return parse_return_stmt();
  }
  ...
  if (is_decl_spec()) {
    return parse_decl_stmt();
  }
  ast::ExprStmt* expr_stmt = parse_expr_stmt();
  if (expr_stmt == nullptr) {
    diag_.error(current_.span, "Expected expression").emit();
    synchronize_statement();
    return arena_.create<ast::ErrorStmt>(current_.span);
  }
  return expr_stmt;
}

4. Declarations

Parsing declarations is probably the hardest part of parsing C source code. The core concept of declarations is to be read and parsed inside-out rather than from left to right.

Declarations consist of the declaration specifier and the declarator.

4.1 Declaration Specifier

The declaration specifier holds information about the storage class of the declaration, the type qualifiers, and the base type of the declaration. A complex example would be

static const volatile long long int ...

This inherently means the declaration that is being parsed has the storage class static and the base type is a const volatile long long int.

4.2 Declarators

Declarators, on the other hand, describe how the base type is “wrapped”. An identifier is specified (in the case of a non-abstract declarator), and then the declaration is read from the identifier outwards to the type specifier.

int *(*fp(int))[10]

This then reads as

To tackle this, I don’t eagerly build a declaration AST node. Instead, I first look at how the declarator is structured and then build the type it annotates.

I chose to go down the Clang route, first constructing DeclaratorChunks which are then used to wrap the type.

I define structures for declaration chunks and the declarator. A declaration chunk can be one of three kinds: Pointer, Array, and Function.


struct DeclChunk {
  enum class Kind : uint8_t { Pointer, Array, Function } kind;

  type::Qualifiers pointer_quals{};

  std::optional<std::size_t> array_size;

  std::vector<type::QualType> param_types;
  bool is_variadic = false;
};

// Intermediate result from the declarator parser
struct RawDeclarator {
  std::optional<std::string_view> name;
  std::vector<DeclChunk> chunks;
};

The parser reads the declarator from left to right to determine the chunks. It starts by searching for a pointer token; if there is one, it recurses and pushes a pointer chunk into the declarator structure.

RawDeclarator Parser::parse_declarator() {
  if (matches(TokenType::Star)) {
    // ... (Parse the qualifiers for the pointer)
    RawDeclarator inner = parse_declarator();
    inner.chunks.push_back({.kind = DeclChunk::Kind::Pointer,
                            .pointer_quals = quals,
                            .array_size = 0,
                            .param_types = {}});
    return inner;
  }
  return parse_direct_declarator();
}

If there is no pointer, the parser tries to parse the “direct” declarator, looking for an identifier and other declarator structures.


RawDeclarator Parser::parse_direct_declarator() {
  RawDeclarator result;

  if (is(TokenType::Identifier)) {
    result.name = current_.lexeme;
    advance();
  } else if (matches(TokenType::LParen)) {
    result = parse_declarator();
    expect(TokenType::RParen);
  }

  while (is(TokenType::LBracket) || is(TokenType::LParen)) {
    if (matches(TokenType::LBracket)) {
      // ... (Size handling)
      expect(TokenType::RBracket);
      result.chunks.push_back(
          {.kind = DeclChunk::Kind::Array, .array_size = size, .param_types = {}});
    } else if (matches(TokenType::LParen)) {
      std::vector<type::QualType> params;
      if (!is(TokenType::RParen)) {
        do {
          // ... Recursively parse declarators for parameters as those are also specifiers with declarators
          params.push_back(resolved_type);
        } while (matches(TokenType::Comma));
      }
      expect(TokenType::RParen);
      result.chunks.push_back(
          {.kind = DeclChunk::Kind::Function, .array_size = 0, .param_types = params});
    }
  }

  return result;
}

If the parser sees an identifier, it sets the declarator name and continues parsing the full declarator. Then it handles postfix declarator parts, which are arrays and functions. After building all of this, the parser has a RawDeclarator with a name and the chunks it is made out of. This structure is then used to wrap the base type from the declaration specifiers inside-out.


type::QualType Parser::resolve_declarator_type(type::QualType base, const RawDeclarator& decl) {
  type::QualType current = base;
  for (const auto& chunk : std::views::reverse(decl.chunks)) {
    switch (chunk.kind) {
      case DeclChunk::Kind::Pointer:
        current = type::QualType{types_.get_pointer(current), chunk.pointer_quals};
        break;
      case DeclChunk::Kind::Array:
        current = type::QualType{types_.get_array(current, *chunk.array_size)};
        break;
      case DeclChunk::Kind::Function:
        current = type::QualType{types_.get_function(
            {.return_type = current, .params = chunk.param_types, .variadic = false})};
        break;
    }
  }
  return current;
}

5. Closing Thoughts

If you walk away from this post with one thing, let it be this: C’s grammar is not context-free, and that single fact poisons everything.

Expression parsing is the easy part. Precedence climbing is a well-understood algorithm, and once your operator table is right, it mostly just works. The real difficulty begins the moment you need to answer the question “is this token the start of a declaration or an expression?” and that question haunts you at every level of the parser.

The declarator grammar is where C becomes torture toward parser writers. The inside-out reading order means you can’t just build an AST node as you go, you have to collect structural information first and then fold it back into a type after the fact. The DeclChunk approach I borrowed from Clang works, but it took me longer to get right than every other part of the parser combined.

There are things I didn’t cover here that made the implementation even messier in practice:

Struct and union declarations, which nest their own declarators and introduce new typenames mid-parse, or initializer lists with designated initializers, which look simple in the spec and are anything but.

Most compiler tutorials teach you to parse a clean, minimal language and then move on to the fun parts: type checking, optimization, codegen. That’s reasonable. But if you want to parse real C, you should know upfront that the parser will be the ugliest part of your compiler, and that’s fine. It’s ugly because the language is ugly in this specific way, and no amount of clever architecture will fully hide that. This is not because C is a bad language, but mostly because requirements for compilers were just different back in the day. If you want your parser to look good, choose another language to compile.

You can definitely try to use parser generators, but be warned: those will also need special care for the context-sensitive grammar and declarations in C.

I will release the full parser source code, once it is in a state where it can parse full C programs.

Comments