Back to Repositories

mistralai/mistral-inference

Mistral Small 3.1 matches Llama 3.3 70B on benchmarks while running 3x faster on the same hardware -- and mistral-inference is the official Python library that lets you run it locally with a single pip install. With 10.7K stars and backing from a company valued at over $6B, this is not a hobby project. It is the sanctioned way to run every open-weight Mistral model on your own GPUs, from the 7B base model all the way up to Mixtral 8x22B and the multimodal Pixtral Large. The library covers an unusually broad surface for a "minimal" inference repo. You get text generation, multi-turn chat, function calling, fill-in-the-middle code completion (via Codestral), math reasoning (via Mathstral), and vision understanding (via Pixtral and Mistral Small 3.1) -- all through a consistent Python API. Multi-GPU inference works out of the box with torchrun, no orchestration framework required. The architecture leans on xformers for efficient attention, which means you need a CUDA GPU to even install the package -- no CPU-only fallback here. Getting started takes about two minutes: pip install mistral-inference, download a model from Hugging Face, and you are generating tokens. The CLI includes an interactive chat mode that is surprisingly useful for quick experiments. For programmatic use, the Python API exposes a clean generate method that handles tokenization, sampling, and structured output without the bloat of larger inference frameworks like vLLM or TGI. The project ships under the Apache 2.0 license, so there are zero legal surprises for commercial deployments. Development follows Mistral AI's model release cadence -- v1.6.0 landed in March 2025 with Mistral Small 3.1 vision support, and new model families get added within days of their public release. If you are building on Mistral models and want a dependency that tracks upstream perfectly, this is the only library maintained by the same team that trains the weights.

llm inference
Python

Why It Matters

If you are running Mistral models locally, you have two real options: use a general-purpose inference server like vLLM, or use the library built by the people who trained the models. mistral-inference is that second option, and it matters because model-specific optimizations -- tokenizer quirks, chat templates, function calling schemas, vision preprocessing -- are always correct on day one. No waiting for third-party libraries to catch up after a new release. The broader context is that Mistral AI is Europe's leading open-weight model provider, and their Small 3.1 model hits 81% on MMLU while fitting on a single GPU. For teams that need on-premise inference without the infrastructure complexity of distributed serving, mistral-inference is the shortest path from zero to production. It does one thing -- run Mistral models correctly -- and it does it without the configuration overhead of heavier frameworks.

Repository Stats

Stars
10.7k
Forks
1.0k
Last Commit
2/26/2026

Related Resources

Weekly AI Digest