PDF Extract API — screenshot of github.com

PDF Extract API

This open-source, self-hosted API converts PDFs, images, and Office documents to Markdown or structured JSON. It leverages local OCR and LLMs (like Ollama) for high accuracy, ensuring data privacy without external dependencies.

Visit github.com →

Questions & Answers

What is PDF Extract API?
PDF Extract API is an open-source, self-hosted solution that converts images, PDFs, and Office documents into Markdown text or structured JSON. It uses local OCR and LLMs to achieve high accuracy in text and data extraction.
Who is the target user for this API?
This API is suitable for developers and organizations that require a private and self-contained document processing solution. It's ideal for scenarios where data privacy is critical, as it operates without external cloud dependencies.
How does this API ensure data privacy compared to other services?
It ensures data privacy by operating entirely locally; no data is sent to external cloud services. It integrates PyTorch-based OCR (EasyOCR) and Ollama for LLM processing directly within your environment, configurable via Docker Compose.
When would one choose to use this PDF extraction tool?
One would choose this tool when high-accuracy conversion of documents to Markdown or JSON is needed, especially if sensitive data is involved. It is also beneficial for improving OCR results with LLMs or removing PII, all within a self-controlled environment.
What are some of the key technical components of this API?
The API is built with FastAPI and utilizes Celery for asynchronous task processing, enabling distributed queue management. Redis is employed for caching OCR results, and it supports switchable storage strategies.