microsoft/BitNet
Official inference framework for 1-bit and 1.58-bit quantized large language models. BitNet runs a 100B-parameter model on a single CPU at approximately 5–7 tokens per second — human reading speed — with no GPU required. The kernels are optimized for ARM (Apple Silicon, Snapdragon) and x86 (Intel, AMD) architectures via hand-tuned assembly.
Why It Matters
The GPU has been the price of admission to frontier AI for two years. BitNet changes that. Quantizing weights to 1 or 1.58 bits (values of -1, 0, +1) reduces both memory footprint and arithmetic complexity so dramatically that a 100B-parameter model fits in commodity RAM and runs using only CPU integer operations — no matrix multiplication, no FP16, no tensor cores. For teams concerned with inference cost, edge deployment, data privacy (local execution, no cloud), or simply getting AI running in environments where GPU provisioning is impractical, BitNet is the most important infrastructure project of 2026. It currently supports llama.cpp-compatible models and ships with pre-quantized BitNet b1.58 models of various sizes.