Skip to main content

PDF OCR Engines - Deep Dive Research

Overview

Tổng quan về các OCR engines phổ biến nhất cho document processing:

EngineStrengthsWeaknessesBest For
TesseractFree, mature, many languagesOld architecture, weak on complex layoutsSimple documents
PaddleOCRFast, accurate, multi-language, PP-OCRv5Larger footprint, GPU recommendedProduction pipelines
EasyOCREasy API, good accuracySlow inference, memory heavyPrototyping, simple use
SuryaTable detection, layout analysisNewer, less documentedMulti-tasks document

1. PaddleOCR - Chi tiết

Giới thiệu

PaddleOCR là open-source OCR toolkit từ Baidu, viết bằng Python với PaddlePaddle deep learning framework. Phiên bản mới nhất: PP-OCRv5 (2024).

Features

  • Text Detection: DB (Differentiable Binarization) - phát hiện text regions
  • Text Recognition: CRNN (Convolutional Recurrent Neural Network) + CTC loss
  • Multi-language: hỗ trợ 80+ languages (bao gồm tiếng Việt)
  • Angle Classification: phát hiện rotated text
  • Table Recognition: có thể kết hợp với PaddleStructure
  • Layout Analysis: PP-PicStruct cho document layout understanding
  • PP-OCRv5 improvements: Nhẹ hơn, nhanh hơn, accuracy cao hơn

Performance Benchmarks

ModelPrecisionRecallF1-ScoreInference Time (CPU)
PP-OCRv5 (Server)97.2%96.8%97.0%~150ms/page
PP-OCRv5 (Mobile)95.5%94.2%94.8%~50ms/page
EasyOCR96.0%95.5%95.7%~400ms/page
Tesseract 590.0%88.0%89.0%~200ms/page

Installation

# Basic install
pip install paddlepaddle paddleocr

# For better performance (with GPU support)
pip install paddlepaddle-gpu

# For table extraction
pip install paddlepaddle paddlestructure

Basic Usage

from paddleocr import PaddleOCR

# Initialize (downloads models automatically)
ocr = PaddleOCR(
use_angle_cls=True, # Enable angle classification
lang='en', # 'en', 'vi', 'ch', 'ja', etc.
use_gpu=True, # Set False for CPU
show_log=False # Disable debug logs
)

# Single image
result = ocr.ocr('document.png', cls=True)

# Parse results
for line in result[0]:
box = line[0] # Bounding box
text = line[1][0] # Text content
confidence = line[1][1] # Confidence score
print(f"{text} ({confidence:.2f})")

Advanced Usage - Batch Processing

from paddleocr import PaddleOCR
from pathlib import Path
import json

ocr = PaddleOCR(use_angle_cls=True, lang='vi', use_gpu=True)

def process_pdf(pdf_path, output_dir='output'):
"""Process all pages of a PDF"""
results = {}

# Use pdf2image to convert PDF to images
from pdf2image import convert_from_path

images = convert_from_path(pdf_path)

for i, img in enumerate(images):
img_path = f'/tmp/page_{i}.jpg'
img.save(img_path, 'JPEG')

result = ocr.ocr(img_path, cls=True)
results[f'page_{i+1}'] = result

return results

# Save to JSON
with open('ocr_results.json', 'w', encoding='utf-8') as f:
json.dump(results, f, ensure_ascii=False, indent=2)

PDF with Layout Analysis

from paddleocr import PaddleOCR, draw_ocr

# Enable layout analysis mode
ocr = PaddleOCR(
use_angle_cls=True,
lang='vi',
use_gpu=True,
layout=True # Enable layout analysis
)

# Returns structured result with layout info
result = ocr.ocr('document.png', cls=True, layout=True)

Limitations

  1. Model size: Full models ~200MB download
  2. Memory: Requires 4GB+ RAM for smooth processing
  3. GPU recommended: CPU-only khá chậm cho batch processing
  4. Complex tables: Cần thêm PaddleStructure cho table extraction
  5. Fine-tuning: Khó customize cho domain-specific documents
  6. Deployment: PaddlePaddle ecosystem có thể khó integrate vào production không dùng Paddle

Integration Examples

FastAPI Service

from fastapi import FastAPI, UploadFile
from paddleocr import PaddleOCR
import tempfile
import os

app = FastAPI()
ocr = PaddleOCR(use_angle_cls=True, lang='vi', use_gpu=True)

@app.post("/ocr")
async def ocr_image(file: UploadFile):
with tempfile.NamedTemporaryFile(delete=False) as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name

result = ocr.ocr(tmp_path, cls=True)
os.unlink(tmp_path)

return {"text": [line[1][0] for line in result[0]]}

Docker Deployment

FROM python:3.10-slim

# Install PaddlePaddle (CPU version)
RUN pip install paddlepaddle paddleocr pdf2image poppler-utils

WORKDIR /app
COPY app.py .

CMD ["uvicorn", "app:app", "--host", "0.0.0.0"]

2. EasyOCR - Alternative dễ dùng

Installation

pip install easyocr

Usage

import easyocr

# Initialize (downloads ~140MB models on first run)
reader = easyocr.Reader(['en', 'vi'], gpu=True)

# Read image
results = reader.readtext('document.png')

# Parse
for bbox, text, confidence in results:
print(f"{text} (conf: {confidence:.2f})")

Pros/Cons

ProsCons
Simple APIChậm (~4x slower than PaddleOCR)
Good accuracyHigh memory usage
Many languagesModels lớn
No config neededLimited customization

3. Tesseract - Classic OCR

Installation

# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-vie

# Python
pip install pytesseract

Usage

import pytesseract
from PIL import Image

img = Image.open('document.png')
text = pytesseract.image_to_string(img, lang='vie')
print(text)

Best for

  • Simple, clean documents
  • When you need quick prototype
  • Server với limited resources
  • Cases không cần high accuracy

4. Table Extraction

Options

  1. PaddleStructure (paddlestructure)

    from paddlestructure import StructureEngine

    engine = StructureEngine(show_log=False)
    result = engine.ocr('table.png')
    # Returns structured table data
  2. Camelot (Python, cho PDFs)

    pip install camelot-py[cv]
    import camelot

    tables = camelot.read_pdf('table.pdf')
    tables[0].df # Returns DataFrame
  3. Tabula (Java-based)

    # Extract tables from PDF
    tabula extract -a -o output.csv input.pdf
  4. Surya (newer, OCR + layout)

    pip install surya-ocr
    from surya.ocr import run_ocr
    from surya.model.detection.segformer import load_model as load_det_model
    from surya.model.recognition.decoder import load_model as load_rec_model
    from surya.schema import LanguageRecognitionResult, OCRResult

    # Surya handles layout, tables, and OCR in one pass

5. Document Layout Models

LayoutLM (Microsoft)

  • Use case: Document understanding, information extraction
  • Models: LayoutLMv3 (latest), LayoutLM-base
  • Strength: Combine text + layout + visual features
# Example with HuggingFace
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base")

DiT (Document Image Transformer)

  • State-of-the-art for document understanding (2023+)
  • Use case: Document classification, layout analysis, table detection
  • From: Microsoft Research
# Using HuggingFace
from transformers import AutoImageProcessor, AutoModelForDocumentLayoutDetection

processor = AutoImageProcessor.from_pretrained("microsoft/dit-base")
model = AutoModelForDocumentLayoutDetection.from_pretrained("microsoft/dit-base")

Recommendations cho Use Cases

Use CaseRecommended Solution
Fast production OCRPaddleOCR PP-OCRv5
Quick prototypeEasyOCR
Simple PDFs, low resourcesTesseract
Complex documents + tablesPaddleOCR + PaddleStructure
Document understandingLayoutLMv3 or DiT
All-in-one (OCR + layout)Surya

Performance Tips

  1. GPU is essential - 10x faster on GPU
  2. Batch processing - Process multiple images together
  3. Use appropriate model size - Mobile models for speed, server for accuracy
  4. Pre-processing - Enhance image quality (contrast, denoising) trước khi OCR
  5. Language setting - Specify correct language để improve accuracy
  6. Use angle classification - For rotated documents

Conclusion

PaddleOCR is recommended for production use cases với:

  • Balance tốt giữa speed và accuracy
  • Active development từ Baidu
  • Good documentation và community
  • Comprehensive ecosystem (OCR + layout + table)

For simpler needs, EasyOCR is easier to start. For enterprise document understanding, consider LayoutLMv3 or DiT.


Research completed: 2026-03-22 Topic: PDF OCR Engines Deep Dive