Skip to content

emfeltham/FormulaCompiler.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

487 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FormulaCompiler.jl

CI Documentation Docs Docs

FormulaCompiler.jl provides efficient per-row model matrix evaluation for Julia statistical models through position-mapped compilation and compile-time specialization. The package transforms StatsModels.jl formulas into specialized evaluators that achieve zero-allocation performance for repeated row-wise operations.

Overview

The package resolves StatsModels.jl formula complexity at compile time, thereby enabling type-stable execution. This approach supports efficient counterfactual analysis, marginal effects computation, and scenarios requiring many model matrix evaluations.

Key characteristics include:

  • Zero-allocation evaluation after initial compilation
  • O(1) memory overhead for counterfactual analysis via variable override system
  • Support for categorical mixtures in marginal effects calculations
  • Dual automatic differentiation (via ForwardDiff.jl) and finite difference backends for derivatives
  • Built on the StatsModels.jl ecosystem (GLM.jl, MixedModels.jl, StandardizedPredictors.jl) allowing users to work with the existing and flexible domain-specific language.

Installation

The package is registered, and the current version is available via

using Pkg
Pkg.add(url="https://github.com/emfeltham/FormulaCompiler.jl")

Basic Usage

using FormulaCompiler, GLM, DataFrames, Tables

# Fit model
df = DataFrame(
    y = randn(1000),
    x = randn(1000),
    z = abs.(randn(1000)) .+ 0.1,
    group = categorical(rand(["A", "B", "C"], 1000))
)

model = lm(@formula(y ~ x * group + log(z)), df)

# Compile formula once
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

# Evaluate rows without allocation
compiled(row_vec, data, 1)  # Zero allocations after warmup

Core Interfaces

Direct Compilation Interface

The most efficient interface for performance-critical applications:

# Pre-compile for optimal performance
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

# Evaluate individual rows
compiled(row_vec, data, 1)    # Row 1
compiled(row_vec, data, 100)  # Row 100

# Evaluate multiple rows
matrix = Matrix{Float64}(undef, 10, length(compiled))
for i in 1:10
    compiled(view(matrix, i, :), data, i)
end

Convenience Interface

Alternative interface with automatic caching:

# Single row (allocating)
row_values = modelrow(model, data, 1)

# Multiple rows (allocating)
matrix = modelrow(model, data, [1, 5, 10, 50])

# In-place with automatic caching
row_vec = Vector{Float64}(undef, size(modelmatrix(model), 2))
modelrow!(row_vec, model, data, 1)

Object-Based Interface

Stateful evaluator for repeated use:

evaluator = ModelRowEvaluator(model, df)
result = evaluator(1)           # Row 1
evaluator(row_vec, 1)          # In-place evaluation

Counterfactual Analysis

FormulaCompiler implements a type-stable counterfactual vector system that enables variable substitution with O(1) memory overhead, avoiding data duplication. This is particularly useful for policy analysis and treatment effect evaluation.

Basic Counterfactual Evaluation

data = Tables.columntable(df)
compiled = compile_formula(model, data)

# Build counterfactual data structure
data_cf, cf_vecs = build_counterfactual_data(data, [:treatment], 1)
treatment_cf = cf_vecs[1]

# Evaluate with counterfactual values
row_vec = Vector{Float64}(undef, length(compiled))
update_counterfactual_replacement!(treatment_cf, true)
compiled(row_vec, data_cf, 1)

Multi-Variable Sensitivity Analysis

# Define analysis grid
treatment_values = [false, true]
dose_values = [50.0, 100.0, 150.0]
region_values = ["North", "South"]

# Create counterfactual data structure
data_cf, cf_vecs = build_counterfactual_data(data, [:treatment, :dose, :region], 1)
treatment_cf, dose_cf, region_cf = cf_vecs

# Evaluate all combinations (2×3×2 = 12 scenarios)
results = Matrix{Float64}(undef, 12, length(compiled))
row_vec = Vector{Float64}(undef, length(compiled))
scenario_idx = 1

for treatment in treatment_values
    for dose in dose_values
        for region in region_values
            update_counterfactual_replacement!(treatment_cf, treatment)
            update_counterfactual_replacement!(dose_cf, dose)
            update_counterfactual_replacement!(region_cf, region)

            compiled(row_vec, data_cf, 1)
            results[scenario_idx, :] .= row_vec
            scenario_idx += 1
        end
    end
end

Memory Efficiency

The counterfactual vector system uses O(1) memory regardless of dataset size:

# Traditional approach: full column copy
traditional = copy(data.x)
traditional[500_000] = 42.0

# CounterfactualVector approach: ~32 bytes overhead
cf_vec = counterfactualvector(data.x, 500_000)
update_counterfactual_replacement!(cf_vec, 42.0)

# Identical interface, minimal memory usage
@assert traditional[500_000] == cf_vec[500_000]

Ecosystem Integration

GLM.jl Models

using GLM

# Linear models
linear_model = lm(@formula(mpg ~ hp * cyl + log(wt)), mtcars)
compiled = compile_formula(linear_model, Tables.columntable(mtcars))

# Generalized linear models
logit_model = glm(@formula(vs ~ hp + wt), mtcars, Binomial(), LogitLink())
compiled_logit = compile_formula(logit_model, Tables.columntable(mtcars))

MixedModels.jl Integration

The package automatically extracts fixed effects from mixed models:

using MixedModels

mixed_model = fit(MixedModel, @formula(y ~ x + z + (1|group) + (1+x|cluster)), df)
compiled = compile_formula(mixed_model, Tables.columntable(df))  # Fixed effects only

StandardizedPredictors.jl Integration

All three StandardizedPredictors.jl transformations are supported: ZScore(), Center(), and Scale():

using StandardizedPredictors

contrasts = Dict(:x => ZScore(), :z => Center(), :w => Scale())
model = lm(@formula(y ~ x + z + w + group), df, contrasts=contrasts)
compiled = compile_formula(model, Tables.columntable(df))

Categorical Mixtures

FormulaCompiler supports categorical mixtures—weighted combinations of categorical levels—for efficient profile-based marginal effects computation:

using FormulaCompiler, Margins

# Create mixture specification
reference_grid = DataFrame(
    x = [1.0, 2.0, 3.0],
    group = mix("Treatment" => 0.6, "Control" => 0.4)
)

# Compile and evaluate
compiled = compile_formula(model, Tables.columntable(reference_grid))
output = Vector{Float64}(undef, length(compiled))
compiled(output, reference_grid, 1)

# Multiple mixture variables
grid = DataFrame(
    age = [30, 40, 50],
    treatment = mix("Control" => 0.3, "Drug_A" => 0.4, "Drug_B" => 0.3),
    dose = mix("Low" => 0.25, "Medium" => 0.5, "High" => 0.25)
)
compiled = compile_formula(model, Tables.columntable(grid))

Mixture evaluation maintains zero-allocation performance through compile-time specialization. Each mixture specification generates a specialized evaluation method.

Derivatives

FormulaCompiler provides computational primitives for zero-allocation derivative computation. For marginal effects, cf. Margins.jl.

using FormulaCompiler, GLM

# Build derivative evaluator
vars = [:x, :z]
de = derivativeevaluator(:ad, compiled, data, vars)

# Compute Jacobian: J[i,j] = ∂modelmatrix[i]/∂vars[j]
J = Matrix{Float64}(undef, length(compiled), length(vars))
derivative_modelrow!(J, de, 1)  # Row 1

Backend Selection

While automatic differentiation is the strongly preferred default option, two backends are available:

  • :ad (automatic differentiation via ForwardDiff.jl): Recommended for standard formulas. Provides machine-precision accuracy and approximately 20% faster performance than finite differences.
  • :fd (finite differences): Recommended for formulas containing boolean predicates. Guarantees zero allocations for all formula types.

Both backends achieve zero-allocation performance through pre-allocated buffers and in-place operations.

Supported Link Functions

The following GLM link functions are supported for marginal effects on the mean response μ:

  • Identity
  • Log
  • Logit
  • Probit
  • Cloglog
  • Cauchit
  • Inverse
  • Sqrt
  • InverseSquare

Supported Formula Features

The package supports the complete StatsModels.jl formula language:

  • Basic terms: x, log(z), x^2, (x > 0)
  • Boolean variables: treated as continuous (false → 0.0, true → 1.0)
  • Categorical variables: all contrast types, ordered and unordered
  • Categorical mixtures: weighted combinations (e.g., mix("A" => 0.6, "B" => 0.4))
  • Interactions: two-way and multi-way (x * group, x * y * z)
  • Mathematical functions: log, exp, sqrt, sin, cos, abs, ^
  • Boolean predicates: (x > 0), (z >= mean(z))
  • Complex formulas: x * log(z) * group + sqrt(abs(y))
  • Standardized predictors: ZScore, Center, Scale, and custom transformations
  • Mixed models: automatic fixed-effects extraction

Performance Considerations

Compilation Pattern

The package is designed for scenarios where compilation cost is amortized over many evaluations:

# Compile once
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

# Evaluate many times (zero allocations)
for i in 1:n_rows
    compiled(row_vec, data, i)
end

Memory Layout

Column-table format provides optimal performance:

data = Tables.columntable(df)  # Convert once, reuse

Pre-allocation

Pre-allocate output vectors for zero-allocation performance:

row_vec = Vector{Float64}(undef, length(compiled))  # Reuse across calls

Batch Operations

Batch evaluation is more efficient than individual allocating calls:

# Efficient: batch evaluation with pre-allocation
matrix = Matrix{Float64}(undef, 1000, length(compiled))
for i in 1:1000
    compiled(view(matrix, i, :), data, i)
end

# Inefficient: repeated allocation
results = [modelrow(model, data, i) for i in 1:1000]

Performance Characteristics

Evaluation Performance

Typical performance for formula evaluation after compilation:

using BenchmarkTools

# Traditional approach
@benchmark modelmatrix(model)[1, :]  # Constructs entire matrix

# FormulaCompiler approach
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
@benchmark compiled(row_vec, data, 1)  # Zero allocations

FormulaCompiler typically achieves a significant speedup compared to full model matrix construction for single-row evaluation, with zero allocations after compilation.

Derivative Performance

Both derivative backends achieve zero-allocation performance:

  • Automatic differentiation (:ad): approximately 50-60ns per row for standard formulas
  • Finite differences (:fd): approximately 65-85ns per row for standard formulas

All operations use pre-allocated buffers, validated through comprehensive allocation tests.

Architecture

The package uses a compilation pipeline based on position mapping:

  • Position mapping: Formula terms are mapped to fixed scratch and output positions during compilation, with all position information encoded in type parameters.
  • Type specialization: Each unique formula generates a specialized evaluation method with concrete types throughout.
  • Adaptive dispatch: Small formulas (≤25 operations) use recursive dispatch; larger formulas use generated functions to respect Julia's compilation limits.
  • Zero-allocation execution: All memory layouts are determined at compile time, enabling allocation-free runtime evaluation.

Use Cases

The package is designed for applications requiring many model matrix evaluations:

  • Marginal effects computation requiring numerical derivatives
  • Monte Carlo simulations requiring millions of model evaluations
  • Bootstrap resampling with repeated matrix construction
  • Applications requiring low-latency prediction
  • Large-scale inference with memory constraints

Related Packages

Documentation

Additional documentation is available in the docs/ directory and the online documentation.

Contributing

Contributions are welcome. Please open an issue or pull request on GitHub.

License

MIT License. See LICENSE for details.

Citation

@misc{feltham_formulacompilerjl_2026,
  title = {{{FormulaCompiler}}.Jl and {{Margins}}.Jl: {{Efficient Marginal Effects}} in {{Julia}}},
  shorttitle = {{{FormulaCompiler}}.Jl and {{Margins}}.Jl},
  author = {Feltham, Eric},
  year = {2026},
  month = jan,
  number = {arXiv:2601.07065},
  eprint = {2601.07065},
  primaryclass = {stat},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2601.07065},
  urldate = {2026-01-13},
  abstract = {Marginal effects analysis is fundamental to interpreting statistical models, yet existing implementations face computational constraints that limit analysis at scale. We introduce two Julia packages that address this gap. Margins.jl provides a clean two-function API organizing analysis around a 2-by-2 framework: evaluation context (population vs profile) by analytical target (effects vs predictions). The package supports interaction analysis through second differences, elasticity measures, categorical mixtures for representative profiles, and robust standard errors. FormulaCompiler.jl provides the computational foundation, transforming statistical formulas into zero-allocation, type-specialized evaluators that enable O(p) per-row computation independent of dataset size. Together, these packages achieve 622x average speedup and 460x memory reduction compared to R's marginaleffects package, with successful computation of average marginal effects and delta-method standard errors on 500,000 observations where R fails due to memory exhaustion, providing the first comprehensive and efficient marginal effects implementation for Julia's statistical ecosystem.},
  archiveprefix = {arXiv},
  keywords = {Statistics - Computation},
}

About

High‑performance compilation of statistical formulas to zero‑allocation evaluators with FD/AD derivative support, serving as a stable computational foundation for post-estimation analysis and simulation with formula-based linear models in Julia.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors