Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library

Na'aman Hirschfeld
2 min readFeb 15, 2025

--

🔍 What’s Kreuzberg?

Kreuzberg is a Python library that provides a unified async/sync interface for extracting text from PDFs, images, Office documents, and more.

Key Features

Async First: Optimized async using anyio worker processes.

Minimal Dependencies: Much smaller footprint compared to alternatives.

Serverless-and-Docker Ready: Perfect for serverless functions and containerized deployments.

Local Processing: All processing is done locally, with no API calls or cloud services.

Modern Python: Built for Python 3.9+ with rigorous typing and extensive testing.

Versatile: Supports various formats, including PDFs, spreadsheets, Markdown, LaTeX, and more.

🚀 What’s New in Version 2.0?

Kreuzberg v2.0 brings significant enhancements to performance, usability, and feature set. Here’s what’s new:

Sync APIs: Kreuzberg supports synchronous extraction methods alongside async extraction.

Batch Processing: Efficiently process multiple files or byte streams in parallel.

Smart PDF Handling: Automatically fall back to OCR when direct text extraction fails.

Metadata Extraction: Retrieve metadata like document titles or creators using Pandoc.

Excel Multi-Sheet Support: Handle even the most complex spreadsheets.

Enhanced Performance: Worker processes for faster, resource-efficient extraction.

## 🎯 Who’s It For?

Kreuzberg is ideal for developers building:

  • Retrieval-Augmented Generation (RAG) systems
  • LLM-powered applications
  • Document indexing, analysis, and automation tools

Kreuzberg is a great choice if you’re looking for a lightweight, efficient solution for text extraction.

⚖️ How Kreuzberg Compares

Here’s how Kreuzberg stacks up against alternatives:

1. Python OSS Libraries

Unstructured.io: Feature-rich but heavy, making it unsuitable for serverless or low-resource environments.

Docling: Another strong alternative but larger and heavier — better suited for high-volume, GPU-based workloads.

2. Non-Python OSS Libraries

Apache Tika: Requires a Java server running as a sidecar, with Python client libraries available.

Grobid: Excellent for structured research text extraction but comes with a ~20GB Docker image.

3. Commercial APIs

Paid solutions like Azure Document Intelligence or AWS Textract offer best-in-class OCR and layout extraction. However, they come with pricing concerns and cloud dependencies, unlike Kreuzberg.

## Staring ⭐ is Caring

If Kreuzberg sounds like the library you’ve been looking for, check it out on GitHub.

Please star the repo ⭐ — it helps others discover the project and motivates me to keep improving it!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response