Running with Docker Compose¶

This guide explains how to run PDF Extract with OCR using Docker Compose, which is the recommended approach for deploying the full application stack.

Prerequisites¶

Docker installed on your system
Docker Compose installed on your system
Basic knowledge of Docker and Docker Compose

Quick Start¶

Clone the repository or download the docker-compose.yml file
Create a .env file in the same directory (1) (optional, see Environment Variables below)
Run the application:

If no .env file is provided, the default values will be used.

docker-compose up -d

This will start all required services and make the API available at http://localhost:8080.

Environment Variables¶

Create a .env file in the same directory as your docker-compose.yml with the following environment variables:

.env

# API configuration
API_PORT=8080                         # Port for the API service
UPLOADS_DIR=./uploads                 # Directory to store uploaded PDFs

# Database configuration (choose SQLite or PostgreSQL)
# SQLite is used for local development, while PostgreSQL is recommended for production.
# Uncomment the one you want to use and comment the other.
#
# SQLite:
# DATABASE_URL=sqlite:///local.db
#
# PostgreSQL:
# POSTGRES_USER=ocruser                 # PostgreSQL username
# POSTGRES_PASSWORD=ocrpass             # PostgreSQL password
# POSTGRES_DB=ocr                       # PostgreSQL database name
# DATABASE_URL=postgresql://ocruser:ocrpass@db:5432/ocr

# Celery configuration
CELERY_BROKER_URL=redis://redis:6379/0

Services Overview¶

The docker-compose.yml file defines four services:

API Service¶

The API service exposes the Flask web server that provides the REST API and web interface for uploading PDFs. It's available at http://localhost:8080 by default.

api:
  image: kjanat/pdf-extract-with-ocr:latest
  container_name: pdf-extract-api
  ports:
    - "${API_PORT:-8080}:80"
  env_file:
    - .env
  volumes:
    - ${UPLOADS_DIR:-./uploads}:/app/uploads
  # ...

Worker Service¶

The worker service processes PDF files asynchronously using Celery. It performs the actual text extraction from PDFs.

worker:
  image: kjanat/pdf-extract-with-ocr:latest
  container_name: pdf-extract-celery-worker
  env_file:
    - .env
  volumes:
    - ${UPLOADS_DIR:-./uploads}:/app/uploads
  command: python -m celery -A celery_worker.celery worker --loglevel=info
  # ...

Redis Service¶

Redis is used as a message broker for Celery tasks.

redis:
  image: redis:latest
  container_name: pdf-extract-redis
  # ...

Database Service¶

PostgreSQL database for storing extraction results and job history.

db:
  image: postgres:latest
  container_name: pdf-extract-postgres-db
  volumes:
    - postgres_data:/var/lib/postgresql/data
  env_file:
    - .env
  # ...

Architecture Support¶

The Docker images are built for multiple architectures:

linux/amd64 (x86_64)
linux/arm64 (aarch64)

Docker will automatically pull the correct image for your system architecture.

Usage After Deployment¶

Once the services are up and running:

Access the web interface at http://localhost:8080
Use the API endpoint for programmatic access:

curl -X POST -F "file=@path/to/your/file.pdf" http://localhost:8080/upload

Management Commands¶

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f

# View logs for a specific service
docker-compose logs -f api

# Stop all services
docker-compose down

# Stop services and remove volumes
docker-compose down -v

# Update the images
docker-compose pull
docker-compose up -d --build

Troubleshooting¶

If you encounter issues:

Check the logs: docker-compose logs -f
Verify that all services are running: docker-compose ps
Make sure your .env file has the correct configuration
Check if ports are already in use on your host system
Ensure the uploads directory has the correct permissions

For more information about the application, refer to the GitHub repository.

Reference files¶

`docker-compose.yml`¶

docker-compose.yml

# Common healthcheck settings
x-healthcheck: &default-healthcheck
  interval: 5s
  timeout: 5s
  retries: 5

services:
  api:
    image: kjanat/pdf-extract-with-ocr:latest
    container_name: pdf-extract-api
    ports:
      - "${API_PORT:-8080}:80"
    environment:
      - IS_DOCKER_CONTAINER=true
    env_file:
      - .env
    volumes:
      - ${UPLOADS_DIR:-./uploads}:/app/uploads
    depends_on:
      - db
      - redis
      - worker
    restart: unless-stopped
    networks:
      - app-network

  worker:
    image: kjanat/pdf-extract-with-ocr:latest
    container_name: pdf-extract-celery-worker
    environment:
      - IS_DOCKER_CONTAINER=true
    env_file:
      - .env
    volumes:
      - ${UPLOADS_DIR:-./uploads}:/app/uploads
    depends_on:
      - db
      - redis
    command: python -m celery -A celery_worker.celery worker --loglevel=info
    restart: unless-stopped
    networks:
      - app-network

  redis:
    image: redis:latest
    container_name: pdf-extract-redis
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      <<: *default-healthcheck
    restart: unless-stopped

  db:
    image: postgres:latest
    container_name: pdf-extract-postgres-db
    volumes:
      - postgres_data:/var/lib/postgresql/data
    env_file:
      - .env
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-ocruser} -d ${POSTGRES_DB:-ocr}"]
      <<: *default-healthcheck
    restart: unless-stopped
    networks:
      - app-network

volumes:
  postgres_data:

networks:
  app-network:
    driver: bridge

`.env`¶

.env.example

# API configuration
API_PORT=8080                         # Port for the API service
UPLOADS_DIR=./uploads                 # Directory to store uploaded PDFs

# Database configuration (choose SQLite or PostgreSQL)
# SQLite is used for local development, while PostgreSQL is recommended for production.
# Uncomment the one you want to use and comment the other.
#
# SQLite:
# DATABASE_URL=sqlite:///local.db
#
# PostgreSQL:
# POSTGRES_USER=ocruser                 # PostgreSQL username
# POSTGRES_PASSWORD=ocrpass             # PostgreSQL password
# POSTGRES_DB=ocr                       # PostgreSQL database name
# DATABASE_URL=postgresql://ocruser:ocrpass@db:5432/ocr

# Celery configuration
CELERY_BROKER_URL=redis://redis:6379/0