A small ML compiler and runtime for transformer inference and training.
JAX-shaped IR + autodiff + transforms (jvp, hvp, vmap) on top of
backend-specific kernels for CPU, Apple Silicon (Metal / MLX), NVIDIA
(CUDA), AMD (ROCm), Google TPU, cross-platform GPU (wgpu), and
microcontrollers (Cortex-M).
Status: 0.2.9, Apple-Silicon-first. The CPU and Apple GPU paths are mature; CUDA / ROCm / TPU / WGPU work but have seen less mileage; Cortex-M is a separate INT8 product. Multi-backend runtime helpers (
GraphDevices,DeviceRouter) — seedocs/backend-selection.md. In-graph RNG (Op::RngNormal/Op::RngUniform) with ONNXRandom*import — seeCHANGELOG.mdandrlx-runtime/README.md.
Most ML stacks pick a side: either a graph compiler (XLA, TVM, MLIR) or
a kernel runtime (cuDNN, MPS, MLX). RLX is both, end-to-end, in one
language, with a vocabulary modelled on jax.lax. The IR knows about
Op::Scan, Op::DenseSolve, Op::FakeQuantize, attention with
MaskKind, and an Op::Custom / Op::CustomFn extension surface;
the optimizer knows about fusion, AMP precision policy, autodiff in
both directions, vmap, broadcast legalization, and PTQ insertion; the
runtime knows about backend dispatch, compile caching, and
heterogeneous cost-based device selection.
The rlx prelude crate is the recommended entry point — it pulls in
the IR, optimizer, runtime, and re-exports the common types:
[dependencies]
rlx = { version = "0.2", features = ["cpu"] }For Apple Silicon GPU acceleration (note: mlx is git-only for 0.1.0
— see below):
rlx = { version = "0.1", features = ["cpu", "metal"] }
mlxandrocmaren't on crates.io for 0.1.0 (workspace- relative submodule / kernel-source paths). For those features, use the workspace git tree:rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["mlx"] }
| feature | backend | platform |
|---|---|---|
cpu (default) |
NEON / AVX + Accelerate / OpenBLAS | every host |
metal |
Metal Performance Shaders + MSL | macOS (Apple Silicon) |
mlx |
Apple MLX (vendored) | macOS (Apple Silicon) |
gpu |
wgpu (Vulkan / DX12 / WebGPU / Metal) | cross-platform |
cuda |
cuBLAS / cuDNN / NVRTC | Linux / Windows + NVIDIA |
rocm |
hipBLAS / MIOpen | Linux + AMD |
tpu |
libtpu PJRT plugin | Linux + GCP TPU |
blas-accelerate |
macOS Accelerate | macOS |
blas-mkl |
Intel MKL | Intel / AMD CPUs |
blas-openblas |
OpenBLAS | cross-platform CPU |
Off by default; enable per workload:
| feature | what |
|---|---|
gguf |
GGUF v1 / v2 / v3 parser + dequant |
bench |
uniform benchmark harness |
sparse |
sparse linear algebra (custom-op scaffold) |
linalg |
dense linalg via LAPACK (custom-op scaffold) |
splat |
3D Gaussian splatting (CPU reference render custom op) |
The Backend model doesn't fit microcontrollers or hardware synthesis.
For those, depend on the standalone crates directly — they're not
exposed through the prelude:
rlx-cortexm—no_stdARMv7E-M INT8 kernels.rlx-fpga— IR → SystemVerilog → bitstream.
A single use rlx::prelude::*; covers the common surface: graph
types, Session, Device, ops + activations, and Result.
use rlx::prelude::*;
let mut g = Graph::new("hello");
let x = g.input("x", Shape::new(&[1, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 2], DType::F32));
let y = g.matmul(x, w, Shape::new(&[1, 2], DType::F32));
g.set_outputs(vec![y]);
let mut compiled = Session::new(Device::Cpu).compile(g);
compiled.set_param("w", &[1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]);
let out = compiled.run(&[("x", &[1.0, 2.0, 3.0, 4.0])]);Prefer NumPy-style expressions? The rlx::tensor DSL (rlx-tensor)
builds the same IR with operator-overloaded, lazy Tensor handles —
(&a + &b).relu() traces instead of executing, then fuses + memory-plans
across any backend when you call .to_vec():
use rlx_tensor::Tensor; // crate `rlx-tensor`, feature `eval`
let a = Tensor::from_vec(vec![1.0, 2.0, 3.0], [3]);
let c = (&a + &Tensor::ones([3])).relu();
assert_eq!(c.to_vec(), vec![2.0, 3.0, 4.0]); // auto-picks the fastest backendDomain-specific namespaces if you want narrower star-imports:
rlx::ops::* (IR helper enums), rlx::quant::*, rlx::autodiff::*.
Or the full per-crate surface
via rlx::ir::… / rlx::opt::… / rlx::runtime::… etc. — every
workspace crate is reachable as a module on rlx.
Op::Fft is a first-class IR primitive with CPU, Metal, MLX, CUDA,
ROCm, wgpu, and TPU lowering. Graph helpers in rlx_ir::Graph cover
real-input spectra and signal-processing workflows:
fft_real/rfft/irfft— Hermitianirfftmirrors the conjugate halffftfreq/rfftfreq— sample-frequency constantspsd/psd_real— power spectral densitystft,fft_conv1d— short-time FFT and frequency-domain convolution
Pow-2 f32 transforms use native GPU kernels on CUDA / ROCm / wgpu / Metal;
non-pow2 and f64 / C64 fall back to partial host sync. Bench with
cargo run -p rlx-bench --release --example bench_fft --features metal,gpu.
Python bindings: pyrlx.Graph.fft, .rfft, .irfft, .fftfreq (see
pyrlx/tests/test_fft.py).
Or depend on each crate directly (rlx-ir, rlx-opt, rlx-runtime,
…) for the smallest possible dep tree.
rlx prelude — re-exports framework crates + common types
rlx-ir leaf — types, shape, op enum, verifier, HIR hooks
rlx-tensor NumPy-style symbolic Tensor DSL (lazy, trace → fuse → any backend)
rlx-flow block assembly-line API for model builders
rlx-fusion MIR fusion passes + unfuse for AD
rlx-autodiff grad / jvp / hvp / vmap on MIR
rlx-compile CompilePipeline, legalization, memory plan, precision
rlx-opt facade — re-exports fusion + autodiff + compile
rlx-driver Device enum + cross-cutting types
rlx-cpu CPU kernels (NEON / AVX / Accelerate / OpenBLAS)
rlx-metal Apple Metal native (MSL + MPSGraph + ICB)
rlx-mlx Apple MLX (vendored, hand-rolled C++ shim)
rlx-cuda NVIDIA CUDA (cuBLAS + cuDNN + NVRTC + Graphs)
rlx-rocm AMD ROCm/HIP (hipBLAS + MIOpen + hipGraph)
rlx-tpu Google TPU via libtpu PJRT
rlx-wgpu Cross-platform GPU via wgpu
rlx-cortexm ARMv7E-M INT8 kernels (no_std)
rlx-fpga IR → Verilog → bitstream
rlx-runtime user-facing Session / CompiledGraph
rlx-gguf standalone GGUF parser + dequant (every llama.cpp scheme: Q4_0..Q8_0, Q2_K..Q8_K, IQ1..IQ4, TQ1/TQ2, MXFP4, NVFP4)
rlx-macros #[rlx_model] AOT macro
rlx-bench benchmark harness
rlx-sparse downstream: CSR LU / mat-vec / CG (custom-op scaffold)
rlx-linalg downstream: dense linalg via LAPACK (custom-op scaffold)
rlx-splat downstream: 3D Gaussian splatting (self-contained; `rlx_splat::register()`)
pyrlx Python bindings via PyO3
Each crate has its own README.md covering public surface, build
commands, and internal gotchas.
cargo build --release # cpu only
cargo build --release --features metal,mlx # apple silicon GPU
cargo test --release --workspace # 865 testsFor Apple Silicon, MLX is a git submodule under rlx-mlx-sys:
git submodule update --init rlx-mlx-sys/vendor/mlx
# or: git clone --recurse-submodules …RLX keeps native fast paths as the default while still allowing transparent fallback when a backend has not wired an op yet.
| Path | When | Effect |
|---|---|---|
| Native | OpKind is in the backend's supported_ops claim |
Backend thunk (MSL, CUDA, CPU ref, …) |
| Common IR | Registered logical kernel, not in supported_ops |
Lowered to primitive MIR (MatMul, Reduce, …) — portable, often slower |
| Rewritten | Structural unfuse / lower (e.g. fused matmul → primitives) | Same semantics, different graph shape |
| Unsupported | Still illegal after rewrite | Compile fails with a diagnostic report |
Policy (default PreferNative): native if claimed, else common IR.
Override globally with RLX_KERNEL_DISPATCH=common|native, or per compile
via CompileOptions::kernel_dispatch and
force_common_kinds / force_native_kinds.
See what a compile will do — set RLX_DISPATCH_REPORT=1 or
RLX_VERBOSE=1 before Session::compile; the runtime prints a per-kind
summary (native / common-ir / rewritten / missing). On failure, the error
includes both legalization details and the dispatch report.
use rlx::prelude::*;
use rlx::runtime::{
dispatch_report_for_device, legalize_graph_for_device_with_options, CompileOptions,
ModelReflection,
};
use rlx::opt::format_dispatch_report;
use rlx_flow::ModelExecutionConfig;
use rlx_ir::CompilationMode;
// Unified component (variant + dispatch + eager/lazy/AOT + profile + layer stack)
let config = ModelExecutionConfig::qwen35_prefill(1, 512)
.with_compilation_mode(CompilationMode::Lazy);
let _key = config.cache_key();
// Static probe (common-ir kinds only; no unfuse)
let report = dispatch_report_for_device(&graph, Device::Metal)?;
eprintln!("{}", format_dispatch_report(&report));
// Full rewrite + legalize probe (same path as compile)
let opts = CompileOptions::new(); // or compile_options_for_device(&config, Device::Metal)
let (graph, report) =
legalize_graph_for_device_with_options(graph, Device::Metal, &opts)?;supports_graph uses the backend
supported_ops claim set when a backend is registered, so device
picking stays aligned with compile rather than hand-maintained op tables.
More detail: rlx-ir/README-logical-kernels.md
(registered logical kernels, splat example, API table).
To speed up a workload: implement the native thunk, add the OpKind to
that backend's supported_ops, and re-run with RLX_DISPATCH_REPORT=1
until the kind moves from common-ir to native.
- Fast local gate:
just ci(build, workspace tests, lint, pyrlx pytest). - Always gate benches on throttle.
scripts/check-throttle.shrefuses to proceed under thermal pressure (pmset -g therm). Silent 10× slowdowns are a real failure mode on Macs.--warnmode for CI;RLX_ALLOW_THROTTLE=1for one-off bypass. - Use
rlx_ir::Tickfor sub-ms timing (CNTVCT_EL0 directly, notInstant::now). Re-exported fromrlx_runtimefor convenience. - Touch every backend when you add an Op. New ops mean: rlx-ir
(op.rs, infer.rs, graph.rs, verify.rs), every backend's thunks +
cost models (rlx-cpu, rlx-metal, rlx-mlx, rlx-cuda, rlx-rocm, rlx-tpu,
rlx-wgpu — sister-crate ports are usually mechanical), the optimizer
fusion patterns, and ideally a parity test in burnembed. Use
RLX_DISPATCH_REPORT=1after compile to confirm native vs common-ir. - Bench every change in burnembed. The integration testbed at
/Users/Shared/burnembedis the canonical bench loop:cargo run --release --example bench_rlx_single --features ndarray,blas-accelerate,rlx,hf-download -- --model minilm6. Models pulled live from HF. - PLAN.md drives priorities; the
## Landedsection at the bottom tracks what's already in tree, with bench deltas. PRs targeting plan items are expected to add a delta line — even "within noise" is data worth recording.
Recent phases (from git log) — A → J: dtype dispatch, AutoMixed precision, cast-tax elimination, segmented ICB, f16 reduction kernels, MPSGraph extension. K → L: rlx-cuda full stack (cuBLAS/Lt + cuDNN + WMMA + CUDA Graphs + multi-stream + mixed-precision GemmEx + NVRTC disk cache + NVTX), followed by rlx-rocm sister crate at parity.
Pre-1.0; 0.x minor bumps may include breaking IR changes. The Op
enum and the Graph builder API in particular are still evolving as
new ops land. Pin exact versions in production until 1.0.
| Area | State |
|---|---|
| CPU forward + backward | Mature; 26 unit tests + integration suites |
| Apple Metal forward | Mature; 78-warning third-party noise silenced |
| Apple MLX forward + backward | Mature; tier-1/2/3 backward parity |
| NVIDIA CUDA | Functional; less battle-tested |
| AMD ROCm | Sister-crate parity to CUDA |
| TPU | Real-model E2E parity (MiniLM-L6) via PJRT |
| WGPU | Functional; coop-matrix paths under test |
| Cortex-M (INT8) | Production: 96.6% MNIST on nRF52840 hardware |
| FPGA | Per-graph datapath + bitstream emit |
| Reverse-mode AD | Phase 1–9 complete; SelectiveScan, FusedTL |
Forward-mode AD (jvp/hvp) |
Functional; thin public API |
vmap |
MVP — leading-axis batching |
| QAT (PTQ + STE + LSQ) | Complete: EMA, Fixed, PerBatch, propagation |
| Qwen3 LM (safetensors + GGUF) | End-to-end on Metal: 100% top-1 parity vs HF; matches/beats Python MPS on most prefill shapes. Q4_K_M GGUF loads + runs |
| Op::DequantMatMul GGUF schemes | Every llama.cpp scheme covered: Q2..Q8 K-quants, Q4_0/Q8_0, IQ4_NL/XS, IQ2_XXS/XS/S, IQ3_XXS/S, IQ1_S/M, TQ1_0/TQ2_0, MXFP4, NVFP4. CPU + Metal + CUDA have native fused kernels (per-block dequant + cuBLAS/MPS sgemm); MLX/ROCm/wgpu use host-side dequant via rlx-cpu::gguf_matmul. CoreML/ANE covers the K-quant subset via constexpr_blockwise_shift_scale. |
| Sampler chain | SamplerChain in rlx-runtime::samplers: Temperature, DynamicTemperature, TopK, TopP, TopNSigma, TypicalP, Mirostat v1/v2, XTC, DRY, RepetitionPenalty. Wired into SampleOpts::into_chain(); classic top-k/top-p stay on the fast path via is_classic(). |
| Quantized KV cache | Per-layer K/V stored as q4_0 / q5_0 / q8_0 / f16 blocks via rlx-runtime::quantized_kv. Optional mmap-kv feature spills to a file-backed mapping for long contexts. |
Eugene Hauptmann, Nataliya Kosmyna (MIT-RLX).
PRs welcome; the roadmap (PLAN.md) drives priorities. Per-crate
README.md files document build commands and gotchas; treat them as
the canonical "how does this crate work" reference.
GPL-3.0-only. See LICENSE.