Back to Repositories

google/langextract

Google's open-source Python library for extracting structured information from unstructured text using LLMs, with precise source grounding that maps every extracted entity back to its exact character-level offset in the source document. Built on Gemini with support for OpenAI, Ollama, and Vertex AI backends. Handles long documents via chunking and parallel processing, and generates interactive HTML visualizations for reviewing thousands of annotations.

data engineering
Python

Why It Matters

Most LLM extraction pipelines hallucinate quietly — they produce structured output but you cannot verify which part of the source document supported each extraction. LangExtract solves this with source grounding: every extracted entity is linked to its exact character position in the original text, enabling full traceability for regulated domains like healthcare, legal, and finance. The interactive HTML visualization lets you review and audit extractions against the source in a single file, with no additional tooling. An official Google release with Apache 2.0 licensing, it works with any LLM provider — not just Gemini — making it suitable for privacy-sensitive on-premises deployments via Ollama.

Repository Stats

Stars
34.7k
Forks
2.3k
Last Commit
2/25/2026

Related Resources

Weekly AI Digest