Running with Docker Compose¶
This guide explains how to run PDF Extract with OCR using Docker Compose, which is the recommended approach for deploying the full application stack.
Prerequisites¶
- Docker installed on your system
- Docker Compose installed on your system
- Basic knowledge of Docker and Docker Compose
Quick Start¶
- Clone the repository or download the docker-compose.yml file
- Create a
.envfile in the same directory (1) (optional, see Environment Variables below) - Run the application:
- If no
.envfile is provided, the default values will be used.
This will start all required services and make the API available at http://localhost:8080.
Environment Variables¶
Create a .env file in the same directory as your docker-compose.yml with the following environment variables:
# API configuration
API_PORT=8080 # Port for the API service
UPLOADS_DIR=./uploads # Directory to store uploaded PDFs
# Database configuration (choose SQLite or PostgreSQL)
# SQLite is used for local development, while PostgreSQL is recommended for production.
# Uncomment the one you want to use and comment the other.
#
# SQLite:
# DATABASE_URL=sqlite:///local.db
#
# PostgreSQL:
# POSTGRES_USER=ocruser # PostgreSQL username
# POSTGRES_PASSWORD=ocrpass # PostgreSQL password
# POSTGRES_DB=ocr # PostgreSQL database name
# DATABASE_URL=postgresql://ocruser:ocrpass@db:5432/ocr
# Celery configuration
CELERY_BROKER_URL=redis://redis:6379/0
Services Overview¶
The docker-compose.yml file defines four services:
API Service¶
The API service exposes the Flask web server that provides the REST API and web interface for uploading PDFs. It's available at http://localhost:8080 by default.
api:
image: kjanat/pdf-extract-with-ocr:latest
container_name: pdf-extract-api
ports:
- "${API_PORT:-8080}:80"
env_file:
- .env
volumes:
- ${UPLOADS_DIR:-./uploads}:/app/uploads
# ...
Worker Service¶
The worker service processes PDF files asynchronously using Celery. It performs the actual text extraction from PDFs.
worker:
image: kjanat/pdf-extract-with-ocr:latest
container_name: pdf-extract-celery-worker
env_file:
- .env
volumes:
- ${UPLOADS_DIR:-./uploads}:/app/uploads
command: python -m celery -A celery_worker.celery worker --loglevel=info
# ...
Redis Service¶
Redis is used as a message broker for Celery tasks.
Database Service¶
PostgreSQL database for storing extraction results and job history.
db:
image: postgres:latest
container_name: pdf-extract-postgres-db
volumes:
- postgres_data:/var/lib/postgresql/data
env_file:
- .env
# ...
Architecture Support¶
The Docker images are built for multiple architectures:
linux/amd64(x86_64)linux/arm64(aarch64)
Docker will automatically pull the correct image for your system architecture.
Usage After Deployment¶
Once the services are up and running:
- Access the web interface at
http://localhost:8080 - Use the API endpoint for programmatic access:
Management Commands¶
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# View logs for a specific service
docker-compose logs -f api
# Stop all services
docker-compose down
# Stop services and remove volumes
docker-compose down -v
# Update the images
docker-compose pull
docker-compose up -d --build
Troubleshooting¶
If you encounter issues:
- Check the logs:
docker-compose logs -f - Verify that all services are running:
docker-compose ps - Make sure your
.envfile has the correct configuration - Check if ports are already in use on your host system
- Ensure the uploads directory has the correct permissions
For more information about the application, refer to the GitHub repository.
Reference files¶
docker-compose.yml¶
# Common healthcheck settings
x-healthcheck: &default-healthcheck
interval: 5s
timeout: 5s
retries: 5
services:
api:
image: kjanat/pdf-extract-with-ocr:latest
container_name: pdf-extract-api
ports:
- "${API_PORT:-8080}:80"
environment:
- IS_DOCKER_CONTAINER=true
env_file:
- .env
volumes:
- ${UPLOADS_DIR:-./uploads}:/app/uploads
depends_on:
- db
- redis
- worker
restart: unless-stopped
networks:
- app-network
worker:
image: kjanat/pdf-extract-with-ocr:latest
container_name: pdf-extract-celery-worker
environment:
- IS_DOCKER_CONTAINER=true
env_file:
- .env
volumes:
- ${UPLOADS_DIR:-./uploads}:/app/uploads
depends_on:
- db
- redis
command: python -m celery -A celery_worker.celery worker --loglevel=info
restart: unless-stopped
networks:
- app-network
redis:
image: redis:latest
container_name: pdf-extract-redis
healthcheck:
test: ["CMD", "redis-cli", "ping"]
<<: *default-healthcheck
restart: unless-stopped
db:
image: postgres:latest
container_name: pdf-extract-postgres-db
volumes:
- postgres_data:/var/lib/postgresql/data
env_file:
- .env
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-ocruser} -d ${POSTGRES_DB:-ocr}"]
<<: *default-healthcheck
restart: unless-stopped
networks:
- app-network
volumes:
postgres_data:
networks:
app-network:
driver: bridge
.env¶
# API configuration
API_PORT=8080 # Port for the API service
UPLOADS_DIR=./uploads # Directory to store uploaded PDFs
# Database configuration (choose SQLite or PostgreSQL)
# SQLite is used for local development, while PostgreSQL is recommended for production.
# Uncomment the one you want to use and comment the other.
#
# SQLite:
# DATABASE_URL=sqlite:///local.db
#
# PostgreSQL:
# POSTGRES_USER=ocruser # PostgreSQL username
# POSTGRES_PASSWORD=ocrpass # PostgreSQL password
# POSTGRES_DB=ocr # PostgreSQL database name
# DATABASE_URL=postgresql://ocruser:ocrpass@db:5432/ocr
# Celery configuration
CELERY_BROKER_URL=redis://redis:6379/0