Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library

🔍 What’s Kreuzberg?
Kreuzberg is a Python library that provides a unified async/sync interface for extracting text from PDFs, images, Office documents, and more.
Key Features
Async First: Optimized async using anyio worker processes.
Minimal Dependencies: Much smaller footprint compared to alternatives.
Serverless-and-Docker Ready: Perfect for serverless functions and containerized deployments.
Local Processing: All processing is done locally, with no API calls or cloud services.
Modern Python: Built for Python 3.9+ with rigorous typing and extensive testing.
Versatile: Supports various formats, including PDFs, spreadsheets, Markdown, LaTeX, and more.
🚀 What’s New in Version 2.0?
Kreuzberg v2.0 brings significant enhancements to performance, usability, and feature set. Here’s what’s new:
Sync APIs: Kreuzberg supports synchronous extraction methods alongside async extraction.
Batch Processing: Efficiently process multiple files or byte streams in parallel.
Smart PDF Handling: Automatically fall back to OCR when direct text extraction fails.
Metadata Extraction: Retrieve metadata like document titles or creators using Pandoc.
Excel Multi-Sheet Support: Handle even the most complex spreadsheets.
Enhanced Performance: Worker processes for faster, resource-efficient extraction.
## 🎯 Who’s It For?
Kreuzberg is ideal for developers building:
- Retrieval-Augmented Generation (RAG) systems
- LLM-powered applications
- Document indexing, analysis, and automation tools
Kreuzberg is a great choice if you’re looking for a lightweight, efficient solution for text extraction.
⚖️ How Kreuzberg Compares
Here’s how Kreuzberg stacks up against alternatives:
1. Python OSS Libraries
Unstructured.io: Feature-rich but heavy, making it unsuitable for serverless or low-resource environments.
Docling: Another strong alternative but larger and heavier — better suited for high-volume, GPU-based workloads.
2. Non-Python OSS Libraries
Apache Tika: Requires a Java server running as a sidecar, with Python client libraries available.
Grobid: Excellent for structured research text extraction but comes with a ~20GB Docker image.
3. Commercial APIs
Paid solutions like Azure Document Intelligence or AWS Textract offer best-in-class OCR and layout extraction. However, they come with pricing concerns and cloud dependencies, unlike Kreuzberg.
## Staring ⭐ is Caring
If Kreuzberg sounds like the library you’ve been looking for, check it out on GitHub.
Please star the repo ⭐ — it helps others discover the project and motivates me to keep improving it!