API Documentation¶

This page provides comprehensive documentation for the PDF Extract with OCR REST API.

API Endpoints¶

The API provides several endpoints for uploading PDFs, retrieving results, and managing job history.

Upload a PDF¶

Upload a PDF file for text extraction.

Endpoint: POST /upload
Content-Type: multipart/form-data
Request Parameters:

Parameter	Type	Required	Description
file	File	Yes	The PDF file to process

Response:

{
  "status": "processing",
  "task_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
  "filename": "example.pdf"
}

The response includes a task ID that you'll need to poll for results.

Example:

curl -X POST -F file=@path/to/your/file.pdf http://localhost:8080/upload

Check Processing Status¶

Check the status of a PDF processing task.

Endpoint: GET /api/result/{task_id}
Path Parameters:

Parameter	Type	Required	Description
task_id	String	Yes	The task ID returned from the upload endpoint

Response when processing is complete:

{
  "text": "Extracted text from the PDF here...",
  "status": "success",
  "method": "tesseract",
  "filename": "example.pdf",
  "datetime": "2025-03-21T12:34:56.789012+00:00",
  "duration_ms": 12.3
}

Response if still processing:

{
  "status": "processing"
}

Example:

curl http://localhost:8080/api/result/a1b2c3d4-e5f6-7890-abcd-1234567890ab

List Processing Jobs¶

Retrieve the history of PDF processing jobs.

Endpoint: GET /api/jobs
Query Parameters:

Parameter	Type	Required	Description
limit	Integer	No	Maximum number of jobs to return (default: 50)
page	Integer	No	Page number for pagination (default: 1)

Response:

{
  "jobs": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
      "filename": "example.pdf",
      "status": "SUCCESS",
      "created_at": "2025-03-21T12:34:56.789012+00:00",
      "completed_at": "2025-03-21T12:35:02.123456+00:00",
      "method": "tesseract",
      "duration_ms": 5334
    },
    // Additional job entries...
  ],
  "pagination": {
    "page": 1,
    "total_pages": 5,
    "total_jobs": 124
  }
}

Example:

curl http://localhost:8080/api/jobs?limit=10&page=1

Complete Workflow Example¶

Here's a complete example of how to use the API to extract text from a PDF:

Upload the PDF file:

curl -X POST -F file=@path/to/your/file.pdf http://localhost:8080/upload

Response:

{
"status": "processing",
"task_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
"filename": "example.pdf"
}

Poll for results using the task ID:

curl http://localhost:8080/api/result/a1b2c3d4-e5f6-7890-abcd-1234567890ab

Initial response (still processing):

{
"status": "processing"
}

Final response (processing complete):

{
"text": "Extracted text from the PDF here...",
"status": "success",
"method": "tesseract",
"filename": "example.pdf",
"datetime": "2025-03-21T12:34:56.789012+00:00",
"duration_ms": 12.3
}

Error Responses¶

The API may return the following error responses:

Status Code	Description	Possible Cause
400	Bad Request	Missing file, invalid file type
404	Not Found	Invalid task ID, job not found
500	Internal Server Error	Server error during processing

Error response format:

{
  "error": "Error message describing the issue"
}

Notes¶

The maximum file size for uploads is determined by server configuration
Supported file formats: PDF only
Processing time varies based on PDF size, complexity, and whether OCR is needed
Large PDFs may take longer to process, especially when OCR is required

For frontend integration examples, see the JavaScript in the script.js file.