ggml-org/llama.cpp

llama.cpp is a high-performance C/C++ implementation for large language model inference, originally created by Georgi Gerganov in March 2023 shortly after Meta released its LLaMA models. With over 96,000 GitHub stars and more than 15,000 forks, it stands as one of the most consequential open-source AI projects ever built. The core premise is deceptively simple: strip away the Python runtime overhead and GPU requirements that dominate the LLM ecosystem, and rewrite inference from scratch in portable C/C++ with zero external dependencies. The project supports an extraordinary range of hardware backends -- eleven and counting -- including Apple Metal, NVIDIA CUDA, AMD HIP, Intel SYCL, Vulkan for cross-vendor GPU access, Ascend NPU via CANN, OpenCL for Adreno mobile GPUs, and experimental WebGPU for browser-based inference. This backend diversity means llama.cpp runs on everything from high-end data center GPUs to Raspberry Pis, Android phones, and iOS devices. A CPU+GPU hybrid inference mode allows models that exceed available VRAM to spill over to system RAM, making it practical to run 30B+ parameter models on consumer hardware. Quantization is where llama.cpp truly differentiates itself. The GGUF file format, which llama.cpp pioneered, supports quantization levels from 1.5-bit to 8-bit integers alongside standard float32, float16, and bfloat16 formats. Aggressive quantization (Q2_K through Q4_K) can reduce memory requirements by up to 75 percent, enabling models like LLaMA 2 13B or Mixtral 8x7B to run on machines with as little as 6-8 GB of RAM. The GGUF format has become a de facto standard, with Hugging Face providing native GGUF support and dedicated tools like GGUF-my-repo for model conversion. Beyond raw inference, llama.cpp ships with a production-ready HTTP server (llama-server) that exposes an OpenAI-compatible API. This means existing applications built against the OpenAI API can be pointed at a local llama.cpp server with minimal code changes. The server supports advanced features including speculative decoding for 1.5-2x throughput improvements on structured prompts, grammar-constrained output for reliable structured generation and function calling, and multimodal inference for vision-language models like LLaVA, MiniCPM, and Qwen2-VL. The project also includes VS Code and Vim/Neovim plugins for local code completion, a built-in web UI for interactive chat, and RPC-based distributed inference across multiple machines. Model compatibility spans over 50 text-only architectures (LLaMA, Mistral, Qwen, Phi, Gemma, Mamba, and many more) and 10+ vision-language models. Active development continues at a rapid pace, with over 8,200 commits and 758 open pull requests as of early 2026, including recent work on WebGPU shader optimization, CDNA3 tensor core flash attention for AMD MI300X GPUs, and native MXFP4 format support.

infrastructure

C++

Why It Matters

llama.cpp is arguably the single most important project in the local AI movement. Before its release in March 2023, running large language models required expensive cloud GPU instances or high-end NVIDIA hardware with complex Python toolchains. Georgi Gerganov demonstrated that careful C/C++ engineering could make LLM inference accessible on ordinary laptops and even mobile phones, fundamentally shifting the economics and accessibility of AI. The GGUF quantization format that llama.cpp popularized has become the backbone of an entire ecosystem. Thousands of quantized models are hosted on Hugging Face, Docker Hub now supports direct GGUF model pulls, and major applications like LM Studio, Ollama, Jan, GPT4All, and koboldcpp all depend on llama.cpp as their inference engine. When Gerganov and team joined Hugging Face, it cemented the project's role as critical open-source AI infrastructure. For developers and organizations, llama.cpp solves three pressing problems simultaneously: it eliminates cloud inference costs by enabling fully local deployment, it preserves data privacy by keeping all processing on-device, and it provides production-grade serving through its OpenAI-compatible API. The project's cross-platform reach -- spanning Windows, macOS, Linux, iOS, and Android across CPU, GPU, and NPU backends -- means there is virtually no hardware configuration where llama.cpp cannot run. This universality, combined with MIT licensing, makes it foundational infrastructure for anyone building AI applications that need to work outside the cloud.

Repository Stats

Stars

97.8k

Forks

15.5k

Last Commit

3/14/2026

View on GitHub

Related Resources

AI News & Articles

Read about the latest developments related to ggml-org/llama.cpp

AI Tools Directory

Compare commercial AI tools and find the right one for your workflow