Skip to content

Welcome!

Commits GitHub last commit Docker Pulls GitHub Actions Workflow Status

A Flask-based web application that intelligently extracts text from PDF files. It automatically determines whether the PDF contains selectable text or is a scanned document, using PyMuPDF for direct text extraction and Tesseract OCR for scanned images.

🚀 Key Features

  • Smart Text Extraction - Automatically detects if a PDF has selectable text or needs OCR
  • Multiple Extraction Methods:
    • Direct text extraction using PyMuPDF for standard PDFs
    • OCR processing with Tesseract for scanned documents
  • Multiple Deployment Options:
    • Full stack deployment with Docker Compose (recommended)
    • Local installation
  • Multi-architecture Support - Docker images built for: amd64, arm64, and arm/v7
  • Web Interface and API Access - Upload PDFs through a browser

📦 Installation

The recommended way to run the application is with Docker Compose, which sets up all necessary services (Flask, Redis, PostgreSQL) in a single command.

  1. Download the docker-compose.yml file:

    wget https://raw.githubusercontent.com/kjanat/pdf-extract-with-ocr/docker/docker-compose.yml
    
  2. Create a .env file optional (1):

    .env
    # API configuration
    API_PORT=8080
    UPLOADS_DIR=./uploads
    
    # Database configuration
    POSTGRES_USER=ocruser
    POSTGRES_PASSWORD=ocrpass
    POSTGRES_DB=ocr
    DATABASE_URL=postgresql://ocruser:ocrpass@db:5432/ocr
    
    # Celery configuration
    CELERY_BROKER_URL=redis://redis:6379/0
    
  3. Start the services:

    docker-compose up -d
    
  4. Open your browser and navigate to http://localhost:8080

  1. Default values will be used if not provided

If you prefer to run the application locally without Docker, you can clone the repository and install the required dependencies.

Prerequisites

Depending on your system, you may need to install the following dependencies:

  • Python 3.8 or higher
  • pip
  • Tesseract OCR (for OCR processing)
  • SQLite (for local database storage)
Install Dependencies
Debian/Ubuntu
sudo apt-get install -y \
    tesseract-ocr \
    redis-server \
    sqlite3
PowerShell (Launch as Administrator)
'tesseract-ocr.tesseract', 'SQLite.SQLite' | 
    % { winget install --id=$_ }
Homebrew
brew install \
    tesseract \
    redis \
    sqlite

To run the application locally, you can clone the repository:

  1. Clone the repository:

    git clone https://github.com/kjanat/pdf-extract-with-ocr.git
    cd pdf-extract-with-ocr
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    
  3. Start the flask application on port 8080:

    FLASK_RUN_PORT=8080 flask run
    
  4. Open your browser and navigate to http://localhost:8080

If you prefer to run the application directly with Docker:

  1. Pull the latest image:

    docker pull kjanat/pdf-extract-with-ocr:latest
    
  2. Run the container:

    docker run -d -p 8080:80 -e IS_DOCKER_CONTAINER=true kjanat/pdf-extract-with-ocr:latest
    
  3. Open your browser and navigate to http://localhost:8080

📚 Documentation

Looking for more detailed information? Check out these guides:

  • Running with Docker Compose - Full stack deployment with Docker
  • Installation Guide - Installing locally
  • API Reference - Using the REST API
  • Troubleshooting - Common issues and solutions

🛠️ Usage

🌐 Web Interface

  1. Access the web interface at http://localhost:8080
  2. Drag and drop PDF files or click to select files
  3. View extracted text in the browser

🔗 API

Use the API to extract text programmatically:

Upload PDF
curl -X POST -F file=@path/to/your/file.pdf http://localhost:8080/upload
Response example
{
  "status": "processing",
  "task_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
  "filename": "example.pdf"
}

To view the result, use the task_id:

View result
curl http://localhost:8080/api/result/a1b2c3d4-e5f6-7890-abcd-1234567890ab
Response example
{
  "text": "Extracted text from the PDF here...",
  "status": "success",
  "method": "tesseract",
  "filename": "example.pdf",
  "datetime": "2025-03-21T12:34:56.789012+00:00",
  "duration_ms": 12.3
}

The full api documentation is available here: API Documentation.

📋 Job History

The application maintains a history of processing jobs, which you can view at http://localhost:8080/jobs.

📝 License

© 2025 Kaj Kowalski. All Rights Reserved.

This software and associated documentation files are proprietary. Private use is permitted without restrictions. For commercial use, distribution, or modification, prior written approval from the owner is required.