Welcome!¶

A Flask-based web application that intelligently extracts text from PDF files. It automatically determines whether the PDF contains selectable text or is a scanned document, using PyMuPDF for direct text extraction and Tesseract OCR for scanned images.

🚀 Key Features¶

Smart Text Extraction - Automatically detects if a PDF has selectable text or needs OCR
Multiple Extraction Methods:
- Direct text extraction using PyMuPDF for standard PDFs
- OCR processing with Tesseract for scanned documents
Multiple Deployment Options:
- Full stack deployment with Docker Compose (recommended)
- Local installation
Multi-architecture Support - Docker images built for: amd64, arm64, and arm/v7
Web Interface and API Access - Upload PDFs through a browser

📦 Installation¶

Docker ComposeClone and Run LocallyDirect Docker

The recommended way to run the application is with Docker Compose, which sets up all necessary services (Flask, Redis, PostgreSQL) in a single command.

Download the docker-compose.yml file:

wget https://raw.githubusercontent.com/kjanat/pdf-extract-with-ocr/docker/docker-compose.yml

Create a .env file optional (1):

.env

# API configuration
API_PORT=8080
UPLOADS_DIR=./uploads

# Database configuration
POSTGRES_USER=ocruser
POSTGRES_PASSWORD=ocrpass
POSTGRES_DB=ocr
DATABASE_URL=postgresql://ocruser:ocrpass@db:5432/ocr

# Celery configuration
CELERY_BROKER_URL=redis://redis:6379/0

Start the services:
```
docker-compose up -d
```
Open your browser and navigate to http://localhost:8080

Default values will be used if not provided

If you prefer to run the application locally without Docker, you can clone the repository and install the required dependencies.

Prerequisites

Depending on your system, you may need to install the following dependencies:

Python 3.8 or higher
pip
Tesseract OCR (for OCR processing)
SQLite (for local database storage)

Install Dependencies

Linux (Debian/Ubuntu) Windows macOS

Debian/Ubuntu

sudo apt-get install -y \
    tesseract-ocr \
    redis-server \
    sqlite3

PowerShell (Launch as Administrator)

'tesseract-ocr.tesseract', 'SQLite.SQLite' | 
    % { winget install --id=$_ }

Homebrew

brew install \
    tesseract \
    redis \
    sqlite

To run the application locally, you can clone the repository:

Clone the repository:

git clone https://github.com/kjanat/pdf-extract-with-ocr.git
cd pdf-extract-with-ocr

Install the required dependencies:
```
pip install -r requirements.txt
```
Start the flask application on port 8080:
```
FLASK_RUN_PORT=8080 flask run
```
Open your browser and navigate to http://localhost:8080

If you prefer to run the application directly with Docker:

Pull the latest image:

docker pull kjanat/pdf-extract-with-ocr:latest

Run the container:

docker run -d -p 8080:80 -e IS_DOCKER_CONTAINER=true kjanat/pdf-extract-with-ocr:latest

Open your browser and navigate to http://localhost:8080

📚 Documentation¶

Looking for more detailed information? Check out these guides:

Running with Docker Compose - Full stack deployment with Docker
Installation Guide - Installing locally
API Reference - Using the REST API
Troubleshooting - Common issues and solutions

🛠️ Usage¶

🌐 Web Interface¶

Access the web interface at http://localhost:8080
Drag and drop PDF files or click to select files
View extracted text in the browser

🔗 API¶

Use the API to extract text programmatically:

Upload PDF

curl -X POST -F file=@path/to/your/file.pdf http://localhost:8080/upload

Response example

{
  "status": "processing",
  "task_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
  "filename": "example.pdf"
}

To view the result, use the task_id:

View result

curl http://localhost:8080/api/result/a1b2c3d4-e5f6-7890-abcd-1234567890ab

Response example

{
  "text": "Extracted text from the PDF here...",
  "status": "success",
  "method": "tesseract",
  "filename": "example.pdf",
  "datetime": "2025-03-21T12:34:56.789012+00:00",
  "duration_ms": 12.3
}

The full api documentation is available here: API Documentation.

📋 Job History¶

The application maintains a history of processing jobs, which you can view at http://localhost:8080/jobs.

📝 License¶

This software and associated documentation files are proprietary. Private use is permitted without restrictions. For commercial use, distribution, or modification, prior written approval from the owner is required.