datalab-to/chandra
Chandra 2 is the current state-of-the-art OCR model, scoring 85.9% on the olmOCR benchmark — the highest of any open model. It converts images and PDFs into structured Markdown, HTML, or JSON while preserving the full layout: tables, forms, handwriting, equations, code blocks, checkboxes, and image captions. The model is a 4B parameter upgrade from Chandra 1 (which was 9B), meaning it is both smaller and more accurate across every category. On a 43-language multilingual benchmark, Chandra 2 averages 77.8%, with particularly large improvements in South Asian scripts: Bengali +27.2, Kannada +42.6, Malayalam +46.2, Tamil +26.9, and Telugu +39.1 percentage points over the previous version. Installation is a single pip command. You can run it as a CLI tool, a web interface with Streamlit, or deploy it as a vLLM server for production throughput. On an H100 GPU, it processes approximately 1.44 pages per second. Chandra handles the messy documents that break other OCR tools. Handwritten notes with mixed print text. Tax forms with nested tables. Academic papers with inline equations and figure captions. Medical records with checkboxes and signatures. The model outputs clean, structured data regardless of input complexity. With 7,600+ GitHub stars and 90+ language support, Chandra has become the default recommendation for developers who need production-grade document extraction without relying on proprietary APIs like Google Document AI or Azure Form Recognizer. Related: If you are building AI-powered document workflows, pair Chandra with tools like Claude for downstream reasoning on extracted text, or use it alongside frontier models for security-sensitive document processing pipelines.
Why It Matters
Chandra 2 replaces a 9B model with a 4B one that scores higher on every benchmark. For developers processing documents at scale, this means better accuracy at lower compute cost. The 90+ language support and layout preservation make it the first truly universal open-source OCR solution.