A compiler + bytecode VM for Jzero — a small subset of Java — written entirely in Rust.
This project follows the book Build Your Own Programming Language (Edition 2) by Clinton L. Jeffery, replacing the original Java/Unicon tooling with idiomatic Rust equivalents.
Add to your Cargo.toml:
[dependencies]
jzero = "0.1.0"Then use the Compiler API:
use jzero::Compiler;
let out = Compiler::new()
.source(r#"
public class hello {
public static void main(String argv[]) {
System.out.println("hello, jzero!");
}
}
"#)
.run(&[])
.unwrap();
println!("{}", out.stdout); // hello, jzero!Jzero is a strict subset of Java designed for teaching compiler construction. Every valid Jzero program is also a valid Java program. It supports a minimal but complete set of features: classes, methods, control flow, basic types (int, double, bool, string), arrays, and simple I/O.
| Phase | Crate | Book Chapter | Status |
|---|---|---|---|
| Lexical Analysis | jzero-lexer |
Ch. 3 | ✅ Done |
| Parsing (accept/reject) | jzero-parser |
Ch. 4 | ✅ Done |
| Syntax Tree Construction | jzero-ast, jzero-parser |
Ch. 5 | ✅ Done |
| Symbol Tables | jzero-symtab, jzero-semantic |
Ch. 6 | ✅ Done |
| Type Checking (base types) | jzero-semantic |
Ch. 7 | ✅ Done |
| Type Checking (arrays, methods, structs) | jzero-semantic |
Ch. 8 | ✅ Done |
| Intermediate Code Generation | jzero-codegen |
Ch. 9 | ✅ Done |
| Bytecode Generation | jzero-codegen |
Ch. 13 | ✅ Done |
| Bytecode Interpreter / VM | jzero-vm |
Ch. 12 | ✅ Done |
| Operators & Built-in Functions | jzero-codegen, jzero-vm |
Ch. 15 | ✅ Done |
Chapters 10 (IDE/syntax coloring), 11 (transpiler), 14 (native code), 16 (domain control structures), and 17 (garbage collection) were intentionally skipped as detours unrelated to the core compiler pipeline or beyond the scope of Jzero.
jzero-rs/
├── Cargo.toml
├── crates/
│ ├── jzero/ # Public facade crate (published to crates.io)
│ ├── jzero-lexer/ # Lexical analysis (Logos)
│ ├── jzero-parser/ # Parsing & syntax tree construction (LALRPOP)
│ ├── jzero-ast/ # Syntax tree data structures & DOT output
│ ├── jzero-symtab/ # Symbol table types (SymTab, SymTabEntry, TypeInfo)
│ ├── jzero-semantic/ # Symbol table construction & type checking
│ ├── jzero-codegen/ # TAC + bytecode generation
│ ├── jzero-vm/ # Bytecode interpreter + string pool
│ └── jzero-cli/ # CLI tool (j0, not published)
└── tests/
└── examples/
├── hello.java # Minimal hello-world
├── hello_loop.java # Loop + array + I/O
├── concat.java # String concatenation
├── countdown.java # While loop + argv.length
├── fizzbuzz.java # Nested if/else + modulo + String.valueOf
└── greet.java # String concatenation + String.valueOf
Clean one-way dependency chain:
jzero-lexer → jzero-symtab → jzero-ast → jzero-parser → jzero-semantic → jzero-codegen → jzero-vm → jzero
| Original (Book) | Rust Equivalent | Notes |
|---|---|---|
| JFlex | Logos | Derive-macro based, zero-copy |
| BYACC/J | LALRPOP | LR(1) with lane table algorithm |
tree.java |
jzero-ast crate |
Uniform Tree struct with DOT output |
symtab.java |
jzero-symtab crate |
Rust enum-based type hierarchy |
typeinfo.java + subclasses |
TypeInfo enum |
Base, Array, Method, Class variants |
address.java + tac.java |
jzero-codegen crate |
Address enum + Tac struct |
byc.java + j0machine.java |
jzero-codegen + jzero-vm |
Bytecode format + stack-machine VM |
stringpool.java |
StringPool in jzero-vm |
Runtime string interning for concatenation |
# Build everything
cargo build
# Run all tests (single-threaded to avoid global ID counter races)
cargo test --workspace -- --test-threads=1
# Parse a file and visualize the syntax tree
cargo run --bin j0 -- tests/examples/hello.java --png
# Print TAC intermediate code (Chapter 9)
cargo run --bin j0 -- tests/examples/hello_loop.java --codegen
# Compile to bytecode and print assembler listing (Chapter 13)
cargo run --bin j0 -- tests/examples/hello_loop.java --bytecode
# Compile and execute in the VM (Chapters 12+13)
cargo run --bin j0 -- tests/examples/hello_loop.java --run a b c d e
# String concatenation example (Chapter 15)
cargo run --bin j0 -- tests/examples/concat.java --run// tests/hello_loop.java
public class hello_loop {
public static void main(String argv[]) {
int x;
x = argv.length;
x = x + 2;
while (x > 3) {
System.out.println("hello, jzero!");
x = x - 1;
}
}
}Running j0 tests/examples/hello_loop.java --run a b c d e executes the program end-to-end:
hello, jzero!
hello, jzero!
hello, jzero!
hello, jzero!
no errors
// tests/concat.java
public class concat {
public static void main(String argv[]) {
String s;
s = "hello, " + "jzero!";
System.out.println(s);
}
}Running j0 tests/examples/concat.java --run produces:
hello, jzero!
no errors
A two-pass tree traversal builds one SymTab per scope:
- First pass registers all field and method signatures, enabling forward references within a class.
- Second pass walks method bodies to insert parameters and local variables.
System.out.printlnis pre-registered in the global scope.- Each
Treenode gets itsstabfield set to the nearest enclosing scope (inherited, top-down).
Three cooperative functions — check_type, check_kids, check_types — perform a selective post-order traversal of method bodies.
Key design decisions:
TypeInfoenum replaces the book'stypeinfoclass hierarchy —Base,Array,Method,Classvariants with exhaustive matching."return"dummy symbol — each method's stab gets a"return"entry with the declared return type.ReturnStmtnodes look it up directly.mkclspass — after symbol tables are fully built, a dedicated pass constructs a completeClassTypefor each class.- String concatenation —
String + Stringis allowed by the type checker and emitsSADDin the TAC layer.String - Stringis rejected.
Five passes over the typed syntax tree produce TAC:
assign_addresses → genfirst → genfollow → gentargets → gencode → emit
Each instruction is a fixed 8-byte word: [opcode:1][region:1][operand:6, little-endian].
The .j0 binary file layout:
Word 0: magic "Jzero!!\0"
Word 1: version "1.0\0\0\0\0\0"
Word 2: first-instruction word offset
Word 3…: data section (string literals, NUL-terminated)
Word N…: startup sequence + instructions
The startup sequence calls main with the real argc (number of CLI arguments passed after --run), so argv.length behaves correctly without hardcoding.
HALT NOOP ADD SUB MUL DIV MOD NEG PUSH POP CALL RETURN GOTO BIF
LT LE GT GE EQ NEQ LOCAL LOAD STORE
SPUSH SPOP SADD ITOS
SPUSH, SPOP, SADD were added in Chapter 15 for string operations.
ITOS converts an integer to a string pool key (String.valueOf).
| TAC | Bytecode |
|---|---|
ADD op1,op2,op3 |
PUSH op2, PUSH op3, ADD, POP op1 |
ASN op1,op2 |
PUSH op2, POP op1 |
BGT label,op2,op3 |
PUSH op2, PUSH op3, GT, BIF label |
PARM arg + CALL fn,n |
PUSH fn_addr, PUSH arg, …, CALL n |
Proc (method entry) |
LOCAL n (pre-allocates local slots) |
SADD op1,op2,op3 |
SPUSH op2, SPUSH op3, SADD, SPOP op1 |
String.valueOf(x) |
ITOS op1,op2 (int → string pool key) |
Label resolution is two-pass: pass 1 records byte offsets, pass 2 patches branch targets. All GOTO/BIF targets are relocated by code_base_bytes to be absolute offsets from word 0.
Saved registers (ip, bp, fn_slot) are kept in an off-stack call_stack: Vec<(usize, i64, i64)>, leaving the data stack clean for locals:
stack[bp+0] = fn_addr (loc:0)
stack[bp+1] = arg0 (loc:8)
stack[bp+2] = local0 (loc:16)
…
LOCAL n pre-allocates all local slots at function entry so expression temporaries never overwrite them.
String concatenation (+) is implemented via a runtime StringPool:
SPUSH offset— reads a NUL-terminated string from the data section atoffset(or resolves a pool key if the value is negative), interns it in the pool, and pushes a negative integer key onto the stack.SADD— pops two pool keys, concatenates the strings, interns the result, and pushes the new key.SPOP dst— pops a pool key and stores it in the destination stack slot.ITOS— pops an integer, converts it to a decimal string viaString.valueOf, interns it in the pool, and pushes the pool key.do_println— resolves its argument viaresolve_string(), which handles both data-section offsets (≥ 0) and pool keys (< 0) transparently.
Pool keys are negative integers (-1, -2, …) so they are visually distinct from data-section offsets on the stack.
Why LALRPOP over grmtools/lrpar? The original grammar has inherent LALR(1) ambiguities. grmtools resolved conflicts silently in ways that broke dotted method calls like System.out.println(...). LALRPOP's LR(1) lane table algorithm handles more grammars without conflicts, and its explicit conflict reporting made it easier to restructure the grammar correctly.
- Build Your Own Programming Language (Edition 2) — Clinton L. Jeffery
- Logos — fast lexer generator for Rust
- LALRPOP — LR(1) parser generator for Rust