Skip to content

simontran7/crawfish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

143 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

crawfish

simple programming language for the layman

Caution

The compiler can't compile crawfish programs yet.

Installation (Building from Source)

Dependencies

  • Rust Compiler
  • LLVM

Steps

  1. Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  1. Install llvm-config-20 and libpolly-20-dev
wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
sudo ./llvm.sh 20
sudo apt install libpolly-20-dev
  1. Git clone the repository
git clone https://github.com/simontran7/crawfish.git
  1. cd into the crawfish/ directory, then build the project with GNU Make
cd crawfish/
cargo build --release
  1. Move the target/release/crawfish binary to a desired location (e.g. in /Users/<your name>), then add it to your PATH by adding the following line to your .bashrc file
# in your .bashrc
export PATH=$PATH:<path to the crawfish compiler executable>

Usage

Crawfish compiler

Usage:
    crawfish [OPTIONS] <COMMAND> [ARGS...]

Command:
    compile <file>.crw            compile the current file

Options:
    -h, --help                    print possible commands
    -v, --version                 print compiler version

See docs/tour-of-crawfish.md for a tour of the language.

ARCHITECTURE

Compilers typically follow one of two architectures:

  • Pipeline Architecture: Compilation is organized as a sequence of discrete phases, where the output of one phase becomes the input of the next. Data flows in one direction through the pipeline, and each phase has no knowledge of what came before or after it.
    • Streaming Architecture: Compilation stages communicate through function calls, processing data incrementally. No complete intermediate representation is ever fully materialized in memory. For instance, a token or AST node is passed forward and processed as soon as it is produced.
    • Materializing Architecture: Each compilation phase produces a complete data structure that is then consumed by the next phase in a clear, sequential flow. For example, the lexer fully produces a token list, which the parser then fully consumes to produce a complete AST, which the type checker then fully consumes, and so on.
  • Query-based Architecture: Rather than pushing data through a fixed sequence of phases, computations are expressed as queries that pull the information they need on demand. Each query (e.g. "what is the type of this expression?") is answered by invoking other queries recursively, and results are memoized so that redundant work is avoided. This approach naturally supports incremental compilation — when a source file changes, only the queries whose inputs have changed need to be re-evaluated — and it eliminates the strict ordering constraint of the pipeline model, since any phase can request information from any other phase as needed.

Since it was my first time writing a compiler, I decided to go with the more conventional pipeline architecture. Specifically, the materializing pipeline architecture. According to Carbon's documentation, the mater architecture is generally superior for two key reasons:

  1. clarity: each phase has explicit input and output structures, making data flow transparent rather than obscured through function calls
  2. performance: it exhibits better cache locality by keeping related data together, avoiding the memory fragmentation inherent in the streaming architecture.

Consider the lexing stage: in a batched approach, the lex() method allocates a token buffer (typically a dynamic array) and repeatedly calls next_token(), appending results sequentially. This design leverages modern CPU architecture effectively (source):

  • I-cache locality: Tight loops reduce instruction cache misses
  • D-cache prefetching: Sequential access patterns enable effective hardware prefetching when data fits in cache
  • Branch prediction: Predictable, localized control flow improves branch predictor accuracy
  • Meta-compilation optimizations: The host compiler can inline helpers and unroll loops in the lexing code itself

Beyond using a pipeline-style architecture, each compilation pass should accept an input and produce a new data structure as output, rather than modifying the input in place. This immutable approach offers several advantages (source):

  • Correctness: Incomplete construction becomes impossible (e.g., nodes must be created with all required fields), which allows the compiler to catch missing data at compile-time rather than runtime.
  • Debuggability: Since immutability provides a complete audit trail of what happened at each compilation stage, it's easier to trace how and where the error arose.
  • Compactness: Each data structure is tailored to its specific purpose and may omit unnecessary information, leading to compact data structures. Compactness improves cache performance, and simplifies unit testing.

Overall, the crawfish compiler has the following stages:

flowchart TD
  A[Source] -->|Lexical Analysis I| B[List of tokens]
  B -->|Lexical Analysis II| C[List of token trees]
  C -->|Syntactic Analysis| D[Abstract Syntax Tree]
  D -->|Semantic Analysis| E[High-level Intermediate Representation]
  E -->|IR Lowering| F[Mid-level Intermediate Representation]
  F -->|LLVM IR generation| G[LLVM IR]
  G -->|LLVM codegen| H[Machine Code]
Loading

Diagnostics

Diagnostics, rendered using the ariadne crate, are collected in a list during each compiler stage. At the end of each stage, if there are no errors, the compiler proceeds to the next stage. If there are errors, the diagnostics are emitted and the compiler stops (source). This is to prevent cascading effects from earlier stages polluting later ones.

Stage 1: Lexical Analysis

Lexical analysis is the first stage of the compiler pipeline. It is the stage that transforms raw source code into a stream of meaningful units called tokens.

Tokens

Each token consists of two fields:

  • kind: TokenKind: indicates the token type (e.g., identifier, literal, operator)
  • span: Span: a pair of 4-byte positions marking the start and exclusive end in the source code

Since only two token kinds have an associated lexeme: identifiers and literals, we forgo a lexeme field, and instead, attach a handle (to their lexeme stored in the interner) as a payload to the kind.

Token Trees

Traditional lexers often emit a list of tokens. However, inspired by Rust's lexical analysis, crawfish's lexer emits a list of token trees instead.

A token tree is a data structure constructed from the list of tokens. Each node is either:

  • A leaf node: a single token
  • A subtree: delimited groups (i.e., (...), [...], or {...}), containing more token trees and the delimiters at the root

For example, for the following source code:

func foo(x : Int) {
    let y = x + 1;
}

Traditionally, lexical analysis involves a lexer outputting a list of tokens from the source code:

[func, foo, (, x, :, Int, ), {, let, y, =, x, +, 1, ;, }]

Instead, our lexical analysis stage involves an additional stage which takes the lexer producing a list of tokens, and produces a list of token trees as the final output to pass on to the parser:

<root>
├── func
├── foo
├── ()
│   ├── x
│   ├── :
│   └── Int
└── {}
    ├── let
    ├── y
    ├── =
    ├── x
    ├── +
    ├── 1
    └── ;

and in code:

vec![
    TokenTree::Token(func),
    TokenTree::Token(foo),
    TokenTree::Delimited {
        open: Token(OpenParen, ...),
        close: Token(CloseParen, ...),
        inner: vec![
            TokenTree::Token(x),
            TokenTree::Token(:),
            TokenTree::Token(Int),
        ],
        span: ...,
    },
    TokenTree::Delimited {
        open: Token(OpenBrace, ...),
        close: Token(CloseBrace, ...),
        inner: vec![
            TokenTree::Token(let),
            TokenTree::Token(y),
            TokenTree::Token(=),
            TokenTree::Token(x),
            TokenTree::Token(+),
            TokenTree::Token(1),
            TokenTree::Token(;),
        ],
        span: ...,
    },
]

Once more, u/matthieum outlines that creating token trees validates delimiter balancing before the actual parsing. For instance, for a block expression { without a closing }, you can catch it during token tree construction. Creating token trees also gives you a critical property: errors are bounded. If there's a syntax error inside a block expression { ... }, the braces essentially act as a firewall, where the syntax error cannot leak out and confuse parsing of the surrounding code as we can sync to the closing delimiter with confidence by simply moving past the whole Delimited token tree its in.

fn foo() { ... }

fn bar() {
    let x = @#$%^&   // garbage here, but it's contained
}

fn baz() { ... }  // this still parses fine!

Note

For a nice visual of a token tree, see this.

Tokenizer

The tokenizer is hand-written, and conceptually a deterministic finite automaton (DFA) that recognizes a regular language. Lexer generators or regular expressions are convenient, but don't offer as much flexibility as a hand-written tokenizer. It's also another dependency to maintain.

Structurally, the lexer is a simple struct with two fields:

  • source: a string slice of the user's source code, useful for accurate span tracking and error messages
  • cursor: an iterator over the characters of the source (Chars<'a>),

The public API exposes two key methods:

trait Lexer {
    /// Creates a new lexer from the given source code.
    pub fn new(source: &'a str) -> Lexer<'a>;

    /// Tokenizes the entire input source and returns a list of tokens.
    pub fn lex(&mut self) -> Vec<Token>;
}

Token Tree Parser

The token tree parser is responsible for turning the list of tokens produced from the lexer into a list of token trees.

Its API is as follows:

trait TokenTreeParser {
    /// Creates and returns an instance of `TokenTreeParser`.
    pub fn new(source: &'a str) -> Lexer<'a>;

    /// Parses the tokens into token trees, returning any delimiter errors encountered.
    pub fn parse(&mut self) -> Result<Vec<TokenTree>, Vec<DelimiterError>>;
}

As we see, it's possible to encounter three types of delimiter errors (i.e., specific unbalanced delimiter cases) when building the token tree: - An opening delimiter was not closed. - A closing delimiter was found without a matching opening delimiter. - A closing delimiter did not match the expected opener.

According to u/matthieum's earlier comment, it's possible to build a resilient token tree parser (use indentation as a heuristic to guess where the "virtual" closing brace should be, recover, and continue parsing) but instead, we just round up all the delimiter errors and return it if there is any.

Interning

Interning is common memory optimization at the lexer stage. We typically intern identifiers, and literals.

An interner has two methods:

trait Interner {
    /// Interns `string` and returns a string id.
    pub fn intern(&mut self, string: &str) -> Symbol;

    /// Resolves a string id back to its original string slice.
    pub fn resolve(&self, id: Symbol) -> Option<&str>;
}

By interning, our symbol tables's keys (used in semantic analysis) are now simply integers, which allows for $O(1)$ hashing (instead of $O(\text{string length})$), and allows for fast equality checks for lookups (source).

Note

A non-thread safe interner is much simpler to create (which involves one HashMap and one ArrayList) than a thread-safe one.

Error Recovery

To create a resilient lexer, as explained by matklad, the simplest approach is to produce an error token whenever we encounter an invalid token.

Stage 2: Syntactic Analysis

Syntactic analysis is the second stage, where the parser consumes the tokens and imposes structure by producing a syntax tree. This syntax tree may be a concrete syntax tree (CST), which represents to a context-free grammar, or more commonly, an abstract syntax tree, which corresponds to an abstract grammar.

Abstract Syntax Tree

Credits to u/matthieum for the AST's design (Comment 1, Comment 2)

Rather than employing a conventional object-oriented arena allocator and tree-of-pointers design (where each node requires heap allocation and 64-bit pointers), I opted for a Structure of Arrays (SoA) layout, where each field corresponds to a list of a specific node type (e.g., one field for if statements, another for integer literals, etc.). To point to other nodes, we use 32-bit handles rather than raw pointers. Since the compiler must know which array a handle refers to, the node type is encoded into the high-order bits of the handle, while we allocate the remaining lower bits for the actual index:

untyped handle = 0010 0000 0000 0000 0000 0000 0000 0101
                 └─────┬────────────────────────────────┘
                 kind (5 bits)    index (27 bits)

         kind = 0010 → BinaryOp
         index = 5

This SoA layout has huge performance gains, while also conveniently enabling as-is serialization for fast incremental builds (if a source file hasn't changed, the compiler can simply remap its pre-parsed AST). Specifically, the AST can be dumped directly to disk and later reloaded via mmap() without reallocation or pointer patching.

For the different AST node types, I designed them based on common practices I've observed (where AST Explorer proved invaluable). Every crawfish source code file contains a SourceFileNode which serves as the "root" of the tree. It has a field named body, which is a list of node handles to items at the top level (e.g., function definitions, or constant definitions). Other nodes include LetStatementNode, IdentifierNode, IntLiteralNode, etc. Some nodes may reference other nodes by storing handles in their fields. For example, a LetStatementNode has two notable fields: name and value, where name stores a handle to a node of type IdentifierNode. Additionally, every node shares a span field, which is identical to the span found in tokens.

The AST's API contains insertion methods to create nodes, returning a node handle. There are also getters which return the nodes, given some handle. It also contains a dump() method, which returns a heap-allocated string for pretty printing.

Parser

The parsing state is controlled by the struct Parser, which manages the parsing state with two core fields:

  • cursor: an object to iterate through a slice of token trees
  • errors: a list of syntactic diagnostics
  • interner: a reference to the global string interner
  • ast: an instance of Ast produced by the parser

The public API also exposes two core methods:

trait Parser {
    /// Creates a new parser from the list of token trees produced by the token tree parser.
    pub fn new(source: &'a str, token_trees: &'a [TokenTree], interner: &'a StringInterner) -> Self;

    /// Parses and returns an abstract syntax tree and a list of errors encountered during parsing.
    pub fn parse(mut self) -> Result<Ast, Vec<SyntacticDiagnostic>>
}

Recursive Descent Parsing

Like the lexer, the parser is hand-written (conceptually a pushdown automaton for a context-sensitive language). Specifically, it is a recursive descent parser with Pratt parsing for expressions—a combination that excels at error recovery and producing clear diagnostics.

Recursive descent parsing is a top-down parsing technique that constructs a parse tree by starting from the root and working downward toward the leaves. It maps every non-terminal in a BNF grammar to a concrete parse_<non-terminal>() method. As a recap from theory of computation: a non-terminal is a symbol representing a syntactic category that can be replaced by a sequence of other symbols, while a terminal is a fundamental, indivisible symbol that constitutes the language being defined. Essentially, recursive descent parsing translates the grammar's rules into imperative code (credits to Crafting Interpreters for this table):

Grammar Notation Code Representation
Terminal Code to match and consume a token
Nonterminal Call to that rule's function
| if or switch statement
* or + while or for loop
? if statement

The parse() method is the star of the show. It iterates through the list of tokens produced by the lexer and carefully checks for tokens that may indicate the beginning of a valid top-level declaration. If so, it calls the correct parse method.

Note

I learned from this article a helpful trick that uniformalizes code: ensure that the parser consumes tokens where they belong, not where they're recognized. For example, we recognize the the func keyword in parse_top_level_item(), but it gets consumed in parse_function_definition() because func is part of the function definition's syntax. Recognition and ownership are separate concerns!

Pratt Parsing

Recursive descent parsing works remarkably well for parsing statements and declarations, but less so for expressions. This is because parsing expressions is tricky to get right, as expressions may be ambiguous. The parser must parse according to the language's operator precedence (which determines which operations are performed first when you have multiple different operators) and operator associativity (which determines the direction of evaluation when you have multiple operators of the same precedence level). For instance, consider the expression $8 - 4 - 2$: with left associativity, it becomes $(8 - 4) - 2 = 2$, but with right associativity, it becomes $8 - (4 - 2) = 6$. In programming languages, most arithmetic operators are left-associative (addition, subtraction, multiplication, division), while assignment and exponentiation are typically right-associative.

To cleanly parse expressions, we can use a clever technique called Top-down Operator Parsing, also known as Pratt Parsing. At its core, Pratt parsing assigns each operator an integer called a binding power. Operator precedence and binding power follow a positive linear relationship: the higher the precedence, the higher the binding power. Each operator has a left binding power, and a right binding power, where the left binding power is used to bind any operands from the leftThe goal is that when an operand encounters an operator on its left and/or right side, it will "bind" to the operator with the higher binding power, creating the effects of operator precedence.

For operator associativity, we determine expression groupings by first assigning each operator a binding power according to the rules above (and for empty "spots" beside an operand, assigning a binding power of 0). Then, for each operand, we consider the binding power to its left and the binding power of the operator to its right. For every operand, we "move" it closer to the operator with the higher binding power. For every significant grouping, it becomes a tree, which acts as an "operand" and undergoes the same process (i.e., every tree moves toward the operator with the higher binding power) until we have a single expression tree.

However, when an operator has operators to its left and right with equal binding power, it raises the question of which operator it should "bind" to. This is where operator associativity comes into play. Left associativity is imposed by assigning two binding powers: a left binding power and a right binding power. To make an operator left associative, we make its right binding power slightly higher than its left binding power. To impose right associativity on an operator, we assign a slightly higher binding power to its left binding power compared to its right binding power. This allows the syntax to have deterministic operator associativity. In essence, Pratt parsing binds operands to their operators from strongest binding power to weakest.

For example, the following depicts the relationship between operator precedence and binding power (from low to high) for the basic arithmetic operators:

precedence       operator    binding power
  high             *, /         (3, 4)

  low              +, -         (1, 2)

To change the default precedence, we enclose subexpressions in parentheses to ensure they are evaluated first.

Implementation-wise, Pratt parsing occurs in the parse_expression() method and makes use of two types of helper methods:

  • nud() (aka null denotation): applies to a token when there is no unclaimed parsed expression to the left. Notably, it handles identifiers, literals, prefix operators, and grouped expressions.
  • led()(aka left denotation): applies to a token when there is a left unclaimed parsed expression. Notably, it handles infix expressions or postfix expressions.

This means that if nothing has been parsed yet (i.e., no "left" expression), the parser uses the nud() method, and if something has already been parsed (i.e., there is a "left" expression), the parser uses the led() function. For example, in the expression 5 + 3:

  • 5 uses nud() since it's a literal with nothing to the left
  • + uses led() (it's an infix operator with 5 to its left)
  • 3 uses nud() (it's the right operand)

How do we use parse_expression()? Within another parse_<non-terminal>() method that may be composed of expressions, we call parse_expression(Parser::)?, where 0 represents the right-binding power of an "empty" operator as discussed above. Setting it to 0 ensures that parse_expression() will be initiated. This makes sense because we can imagine the beginning of an expression having no operators:

expression:            a     *      b     +      c
binding power: (0, 0)  N/A  (3, 4)  N/A  (1, 2)  N/A (0, 0)

For infix operators, the goal is to continue recursing until we find an operator where the current left binding power is no longer higher than the previous right binding power.

Note

Top-down operator precedence (a.k.a. Pratt parsing) is similar to precedence climbing and the Shunting Yard. Pratt differs from precedence climbing in that the latter uses a precedence table while the former uses explicit binding powers. The Shunting Yard algorithm differs from Pratt parsing through the use of an explicit stack, rather than the implicit call stack used in Pratt parsing.

Error Recovery

Whenever programmers write code, it often contains syntax errors (mistakes in the code's structure). However, many parsers stop immediately when they find the first syntax error. This means the compiler only tells you about one error at a time. You have to fix that one error, then recompile to find the next one, which makes finding all errors slow and frustrating. Instead, a good parser should try to find as many syntax errors as possible in one run. To do this, the parser must recover from errors instead of halting, and always emit an AST. We call this sort of parser a resilient parser.

To produce a resiliant parser, there are two common error recovery techniques: synchronization, and error nodes. Synchronization is the idea where whenever the parser encounters a syntax error, the parser enters a state of panic, and will attempt to return to a stable state where it knows how to parse by looking for specific synchronization points:

  • Delimiters: semicolons (;), or close parenthese ()), or close brace (})
  • Keywords that start new constructs: struct, func, if, while, for, return, let, etc.

Without synchronizing, one syntax error would cause the parser to misinterpret subsequent valid code as erroneous, creating cascading errors.

Additionally, the parser should insert error nodes in the AST. Error nodes are placeholder nodes where the error occured which allow the current parse to complete and accumulate multiple errors in one pass, rather than bailing at the first failure. The parser should also aim to only create the smallest possible error node that allows it to continue parsing correctly. This includes creating error nodes for missing "connective" tokens (i.e., =, :, ->), as we cannot reliably decide the user's intent. As such, it's best for the parser to create an entire error node.

However, we don't emit error nodes for missing delimiters, but rather, recover by inserting a synthetic token (i.e., the parser assumes the token is there, and truck along), but of course, also emitting an diagnostic.

Stage 3: Semantic Analysis

Semantic Analysis involves traversing the AST starting at the root node, and constructing a high-level intermediate representation (HIR). Specifically, semantic analysis involves three phases:

  • Name Resolution: resolve identifiers at use-sites to their bindings
  • Type Checking: verify type correctness of expressions and statements
  • Desugaring: As the HIR is constructed, certain AST constructs are desugared along the way.

Symbol Table

An identifier's scope is the part of a program where it's accessible. An identifier may refer to different values in different parts of the program. Crawfish particularly has static scope, which means the visibility and accessibility of an identifier are determined by its physical location within the source code, at compile-time.

The symbol table is an essential data structure which maps identifiers to their semantic information. Particularly, it is typically designed as a stack of scopes, where a scope is represented as a single hash map HashMap<Symbol, SemanticInfo>. When entering a scope, the semantic analyzer pushes a new scope frame (i.e., a new hash map) onto the stack, and when exiting a scope, pop from the stack the topmost hash map. When adding a symbol to the current scope, the semantic analyzer adds an entry to the topmost hash map, and to look up a name in any scope, it searches from the top most hash map down to the bottom of the stack so that we search for the first name that isn't shadowed (source). However, looking up a name is different for nested non-closure function definitions in crawfish do not capture the surrounding environment (unlike closures). In other words, a nested func cannot access the parameters or local bindings of its enclosing function.

// Example: nested function that tries to access outer variable
func outer(x: i32) -> i32 {
    let y = 10;

    func inner() -> i32 {  // <- Push RibKind::Item here
        x + y  // Error: can't see x or y (they're locals from outer)
    }

    inner()
}

This means that scope frames are not always "transparent" (i.e., the semantic analyzer can see outer names). As such, each scope frame has an associated ScopeKind to be used as a flag that tells lookup when to start filtering out locals. Most scopes are ScopeKind::Normal, which behave like ordinary lexical scopes. However, when the semantic analyzer enters a nested function item, it pushes a ScopeKind::Item scope, which acts as an item boundary. During lookup, once resolution crosses an Item scope, outer local bindings (BindingId::Local) are no longer visible, while item bindings (BindingId::Item) remain visible.

Note

Since most code don't nest very deeply, we may further optimize the symbol table (specifically, avoiding allocation churn) by pre-allocating around 4 to 8 empty hashmaps and reusing cleared hashmaps instead of allocating and dropping them (source)

Type System

Crawfish is statically typed (all types resolved at compile time) and strongly typed (no implicit coercions between incompatible types).

Crawfish notably has full Algebraic Data Types (ADTs):

  • Product types: struct, both named or tuple structs
  • Sum types: enum, where each variant can carry different data, making it a tagged union.

Subtyping is a semantic relationship between types: A is a subtype of B if A can be used wherever B is expected. There are two kinds of subtyping mechanisms:

  • Nominal Typing: identity by name. Two types are compatible only if they explicitly declare a relationship (same name, or one explicitly implements/extends the other).
class Dog { bark() {} }
class Cat { bark() {} }  // also has bark

function makeNoise(d: Dog) {}
makeNoise(new Cat())  // Error (Cat is not Dog, even though it has the same shape)
  • Structural Typing: identity by shape. Two types are compatible if they have the same structure, regardless of name. Duck typing (with the analogy "If it walks like a duck and quacks like a duck, it's a duck") is the runtime analog of structural typing (i.e., compatibility is checked at call time rather than statically).
interface Barker {
    bark();
}

class Dog { bark() {} }
class Cat { bark() {} }

function makeNoise(b: Barker) {}
makeNoise(new Cat())  // Ok (Cat has bark(), so it satisfies Barker)

Crawfish's subtyping mechanism is nominal typing. As such, its mechanism for static dispatch is through traits bounds (monomorphization).

func foo[T: Bar](x: T) { ... }

The compiler generates a separate foo for every concrete T at compile time. There's no runtime cost, but code size grows with each instantiation. Dynamic dispatch is achieved through dyn Trait (vtable).

func foo(x: &dyn Bar) { ... }

The concrete type is erased. At runtime, x is a fat pointer: (data pointer, vtable pointer). The vtable holds function pointers for each method. Method calls go through an indirect function pointer dereference, which incurs a small, but nonzero runtime cost.

High-Level Intermediate Representation (HIR)

Our high-level intermediate representation is close to our AST, and should be able to support high-level optimizations such as inlining and constant folding (source).

There are two common approaches for high-level IRs (source):

  • Add a type field to AST nodes (i.e., mutating the AST)
  • Storing the type of every expression in a side table (hash map), where the keys are expressions, and the values, the type.

A third approach, similar to the first, is to produce an entirely new data structure during semantic analysis — a graph-based high-level intermediate representation where each node carries its resolved type directly, rather than annotating the original AST in place.

Unlike the AST, the HIR doesn't follow a SoA design because it's less ergonomic for semantic analysis due to the several passes involved (source).

Semantic Analyzer

The semantic analysis state is controlled by the struct SemanticAnalyzer, which manages the semantic analysis state with several fields:

  • ...
  • ...

The public API also exposes two core methods:

trait SemanticAnalyzer {
    /// Creates a new semantic analyzer from the AST produced by the parser.
    pub fn new(ast: &'ast Ast, string_interner: &'ast StringInterner) -> Self;

    /// Analyzes and returns an HIR and a list of errors encountered during semantic analysis.
    pub fn analyze(mut self) -> Result<Hir, Vec<SemanticDiagnostics>>;
}

The semantic analyzer is responsible for name resolution and type checking. It also involves a scanning top-level item definitions to populate the symbol table to allow top-level and block-level forward references (source, source 2 (page 23)).

Name resolution and type checking must be interleaved when resolution depends on types (source). For instance:

// Which `foo`? Depends on type of `x`
x.foo()

// Which `+`? Depends on types of operands (if you have overloading)
a + b

The semantic analyzer type checks using the bidirectional typing technique (popularized Dunfield & Krishnaswami's paper Bidirectional Typing). The idea of bidirectional typing is that it "checks" on certain expressions, and "syntheses" on others.(source):

  • Checking (outside-in): You arrive at an AST node already holding a type. Someone above you in the tree decided what type this expression needs to be, and your job is to confirm it. For instance, 7 + x checked against int, you don't need to figure out what type + produces, as you already know the result must be int, so you push that expectation down into the operands. Both 7 and x must be int because you're adding two things that need to produce an int. You check each leaf against that expectation.
  • Synthesis (inside-out): You arrive at a node with no type in hand. Your job is to produce one. For 7 + x, you start at the leaves: 7 synthesizes int, then you look up x in the symbol table and find its type. Now you reason about +: it requires both operands to match, so you verify they do, and then you return a type to whoever called you. The type of the whole expression bubbles up.

The concrete difference is the direction of information flow through your recursive calls:

  • In checking, the recursive call looks like check(subexpr, expected_type) (i.e., you're passing a type into the subtree).
  • In synthesis, the recursive call looks like ty = infer(subexpr) (i.e., you're receiving a type from the subtree).

Checking is particularly helpful because it helps localize errors (source). For instance, synthesizing. It is helps reduce

The semantic analyzer also performs unification-based type checking. This technique allows types to be deferred and inferred from wherever the necessary information eventually appears in the program. It works by introducing a unification variable ?a whenever a type is not yet known, then accumulating constraints on what ?a must be as more information becomes available. For example, if two types need to match but one or both are unknown, instead of failing immediately, the analyzer records an equality constraint — "?a must equal int", or "?a must equal ?b". As the program is traversed and constraints are collected, the analyzer builds up a substitution: a mapping from each variable to what it resolves to. Solving is therefore the process of finding a consistent substitution that satisfies all constraints simultaneously.

Stage 4: HIR Lowering

Rather than going from HIR directly to LLVM, it makes sense to construct lower the high-level intermediate representation to a different intermediate representation (source).

Mid-level Intermediateion Representation (MIR)

The mid-level intermediate representation (MIR) is an static-single assignment control-flow graph.

Stage 5: Code generation (Codegen)

The code generation stage is when the MIR is turned into an LLVM-IR (an IR that ressembling some typed assembly language with many annotations), which gets passed to LLVM. From there, LLVM performs optimizations on it, then, emits machine code.

The Rust Compiler Development Guide outlines a few benefits to using LLVM.

LLVM IR

About

simple programming language for the layman

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages