Running from Source¶

This guide explains how to install and run PDF Extract with OCR directly on your host system without containers.

Prerequisites¶

Before installing, ensure you have the following requirements:

Python 3.8 or higher
pip (Python package manager)
Git (optional, for cloning the repository)

System Dependencies¶

Install the required system dependencies based on your operating system:

Linux (Debian/Ubuntu) Windows macOS

sudo apt-get install -y \
    tesseract-ocr \
    redis-server \
    sqlite3

# Using winget (run as Administrator)
'tesseract-ocr.tesseract', 'SQLite.SQLite' |
    % { winget install --id=$_ }

For Redis, you can either:

Install using Redis Windows
Use the Windows Subsystem for Linux (WSL)
Skip Redis by using a SQLite-based task queue

# Using Homebrew
brew install \
    tesseract \
    redis \
    sqlite

Installation Steps¶

Clone the repository (or download the source code):

git clone https://github.com/kjanat/pdf-extract-with-ocr.git
cd pdf-extract-with-ocr

Create a virtual environment and activate it:

Linux (Debian/Ubuntu) Windows macOS

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Create virtual environment
python -m venv venv

# Activate on Windows
.\venv\Scripts\Activate

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

Install the required Python dependencies:
```
pip install -U -r requirements.txt
```

Configuration¶

Create a .env file by copying the example file:
```
cp .env.example .env
```

Edit the .env file to set your configuration options:

.env

# API configuration
API_PORT=8080
UPLOADS_DIR=./uploads

# Database configuration (choose SQLite or PostgreSQL)
# For SQLite:
DATABASE_URL=sqlite:///local.db
# For PostgreSQL:
# DATABASE_URL=postgresql://user:password@localhost:5432/ocr

# Celery configuration
CELERY_BROKER_URL=redis://localhost:6379/0

Running the Application¶

Option 1: Run the Flask application only¶

For simple usage where background processing isn't needed:

Linux (Debian/Ubuntu) Windows macOS

# Activate the virtual environment (if not already activated)
source venv/bin/activate

# Set the port (optional, defaults to 5000)
export FLASK_RUN_PORT=8080

# Run the application
python app.py
# Or alternatively:
# flask run

# Activate the virtual environment (if not already activated)
.\venv\Scripts\Activate.ps1

# Set the port (optional, defaults to 5000)
$env:FLASK_RUN_PORT=8080

# Run the application
python app.py
# Or alternatively:
# flask run

# Activate the virtual environment (if not already activated)
source venv/bin/activate

# Set the port (optional, defaults to 5000)
export FLASK_RUN_PORT=8080

# Run the application
python app.py
# Or alternatively:
# flask run

Option 2: Run with background processing (recommended)¶

For optimal performance with background task processing:

Start Redis (if not already running):

# On Linux/macOS
redis-server

# On Windows (if installed as a service)
# It may already be running as a service

Start the Celery worker in a separate terminal:

# Make sure your virtual environment is activated
celery -A celery_worker.celery worker --loglevel=info

Start the Flask application:
```
python app.py
```

Accessing the Application¶

Once running, you can access:

Web interface: http://localhost:8080 (or the port you configured)
API endpoint: http://localhost:8080/upload

Troubleshooting¶

Common Issues¶

Tesseract not found: Ensure Tesseract is installed and in your PATH
Redis connection errors: Verify Redis is running (redis-cli ping should return PONG)
Database errors: Check your database configuration in .env

Logs¶

Check the application logs for more detailed error information:

# Application logs are printed to the console
# Celery worker logs are in the worker terminal

For more help, refer to the GitHub repository or open an issue.