Welcome!¶
A Flask-based web application that intelligently extracts text from PDF files. It automatically determines whether the PDF contains selectable text or is a scanned document, using PyMuPDF for direct text extraction and Tesseract OCR for scanned images.
🚀 Key Features¶
- Smart Text Extraction - Automatically detects if a PDF has selectable text or needs OCR
- Multiple Extraction Methods:
- Direct text extraction using PyMuPDF for standard PDFs
- OCR processing with Tesseract for scanned documents
- Multiple Deployment Options:
- Full stack deployment with Docker Compose (recommended)
- Local installation
- Multi-architecture Support - Docker images built for:
amd64,arm64, andarm/v7 - Web Interface and API Access - Upload PDFs through a browser
📦 Installation¶
The recommended way to run the application is with Docker Compose, which sets up all necessary services (Flask, Redis, PostgreSQL) in a single command.
-
Download the
docker-compose.ymlfile: -
Create a
.envfile optional (1): -
Start the services:
-
Open your browser and navigate to http://localhost:8080
- Default values will be used if not provided
If you prefer to run the application locally without Docker, you can clone the repository and install the required dependencies.
Prerequisites
Depending on your system, you may need to install the following dependencies:
- Python 3.8 or higher
- pip
- Tesseract OCR (for OCR processing)
- SQLite (for local database storage)
Install Dependencies
To run the application locally, you can clone the repository:
-
Clone the repository:
-
Install the required dependencies:
-
Start the flask application on port
8080: -
Open your browser and navigate to http://localhost:8080
If you prefer to run the application directly with Docker:
-
Pull the latest image:
-
Run the container:
-
Open your browser and navigate to http://localhost:8080
📚 Documentation¶
Looking for more detailed information? Check out these guides:
- Running with Docker Compose - Full stack deployment with Docker
- Installation Guide - Installing locally
- API Reference - Using the REST API
- Troubleshooting - Common issues and solutions
🛠️ Usage¶
🌐 Web Interface¶
- Access the web interface at http://localhost:8080
- Drag and drop PDF files or click to select files
- View extracted text in the browser
🔗 API¶
Use the API to extract text programmatically:
{
"status": "processing",
"task_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
"filename": "example.pdf"
}
To view the result, use the task_id:
{
"text": "Extracted text from the PDF here...",
"status": "success",
"method": "tesseract",
"filename": "example.pdf",
"datetime": "2025-03-21T12:34:56.789012+00:00",
"duration_ms": 12.3
}
The full api documentation is available here: API Documentation.
📋 Job History¶
The application maintains a history of processing jobs, which you can view at http://localhost:8080/jobs.
📝 License¶
© 2025 Kaj Kowalski. All Rights Reserved.
This software and associated documentation files are proprietary. Private use is permitted without restrictions. For commercial use, distribution, or modification, prior written approval from the owner is required.