A sophisticated Python tool for extracting tables and figures from complex, multi-layered PDF documents using state-of-the-art extraction libraries.
FeaturesΒΆ
π Layer-Aware Extraction: Handles complex PDFs with multiple layers
π Multi-Method Table Extraction: Uses Tabula, Camelot, and pdfplumber
πΌοΈ Advanced Figure Extraction: Vector graphics (SVG) + high-res rendering
π Rich Reporting: Markdown report with image previews and table samples
π€ Smart Detection: Heuristic analysis to identify tables vs figures
β‘ CLI Interface: Easy-to-use command line interface
InstallationΒΆ
RequirementsΒΆ
Python 3.12+
Java Runtime Environment (for Tabula)
SetupΒΆ
# 1. Create virtual environment
python3.12 -m venv pdf_extraction_env
# 2. Activate environment
source pdf_extraction_env/bin/activate # Linux/Mac
# or
pdf_extraction_env\Scripts\activate # Windows
# 3. Install dependencies from requirements file
pip install -r requirements-pdfextractor.txt
# Alternative: Manual installation
pip install tabula-py camelot-py[cv] PyMuPDF pdfplumber pandas numpy Pillow opencv-pythonUsageΒΆ
Basic UsageΒΆ
# Extract from PDF to default directory (extracted_content_FILENAME)
python advanced_pdf_extractor.py document.pdf
# Extract with verbose output
python advanced_pdf_extractor.py PDF_Proof.PDF --verbose
# Extract to custom directory
python advanced_pdf_extractor.py report.pdf --output my_extractionCommand Line OptionsΒΆ
usage: advanced_pdf_extractor.py [-h] [-o OUTPUT_DIR] [-v] [--version] pdf_file
positional arguments:
pdf_file Path to the PDF file to extract content from
options:
-h, --help Show help message
-o OUTPUT_DIR Output directory (default: extracted_content_FILENAME)
-v, --verbose Enable verbose output
--version Show version numberOutput StructureΒΆ
The tool creates a comprehensive directory structure:
extracted_content_FILENAME/
βββ tables/ # CSV files with extracted table data
β βββ tabula_lattice_table_1.csv
β βββ camelot_stream_table_2.csv
β βββ pdfplumber_page_5_table_1.csv
βββ figures/ # PNG and SVG files with extracted figures
β βββ figure_page_1_highres.png # High-resolution renders
β βββ figure_page_2_vector.svg # Vector graphics
β βββ embedded_page_3_img_1.png # Embedded images
βββ raw_data/ # Intermediate processing files
βββ EXTRACTION_REPORT.md # Comprehensive report with previews
βββ extraction_report.json # Machine-readable resultsExtraction MethodsΒΆ
Table ExtractionΒΆ
Tabula: Best for form-based tables and structured data
Camelot: Excellent for complex layouts and scientific papers
pdfplumber: Precise text extraction and simple tables
Figure ExtractionΒΆ
SVG Vector: Extracts vector graphics from all PDF layers
High-Resolution Render: 3x scaling for figure-rich pages
Embedded Images: Extracts actual embedded image files
Report FeaturesΒΆ
The generated EXTRACTION_REPORT.md includes:
π Summary statistics and method performance
πΌοΈ Image gallery with PNG previews
π Table previews showing first 10 rows as Markdown tables
π Complete file listing with paths and metadata
ExamplesΒΆ
Example 1: Academic PaperΒΆ
python advanced_pdf_extractor.py research_paper.pdf --verboseOutput: extracted_content_research_paper/ with tables and figures
Example 2: Custom DirectoryΒΆ
python advanced_pdf_extractor.py complex_report.pdf -o report_extractionOutput: report_extraction/ with extracted content
Example 3: Processing PDF_Proof.PDFΒΆ
python advanced_pdf_extractor.py PDF_Proof.PDFResult:
π 126 tables extracted (65 Tabula + 55 Camelot + 6 pdfplumber)
πΌοΈ 67 figures extracted (46 vector + 19 high-res + 2 embedded)
π Output in
extracted_content_PDF_Proof/
TroubleshootingΒΆ
Common IssuesΒΆ
Missing Dependencies: The tool gracefully handles missing packages
Java not found: Install OpenJDK 11+ for Tabula support
Font warnings: Normal for complex PDFs, extraction continues
Memory usage: Large PDFs may require more RAM
Dependency ManagementΒΆ
The tool is designed to show help and version information even without dependencies:
# These work without any packages installed:
python advanced_pdf_extractor.py --help # β
Always works
python advanced_pdf_extractor.py --version # β
Always works
# This shows helpful error if dependencies missing:
python advanced_pdf_extractor.py document.pdf
# β Error: Required dependency not found: No module named 'pandas'
# π¦ To install all required dependencies, run:
# pip install -r requirements-pdfextractor.txtFont WarningsΒΆ
Font warnings like βStart marker missingβ are common with academic PDFs and donβt affect extraction quality.
PerformanceΒΆ
PDF_Proof.PDF Results (46 pages):
β±οΈ Processing time: ~2 minutes
π Success rate: 126 tables + 67 figures extracted
π― Methods used: All extraction methods successfully applied
πΎ Output size: ~15MB (high-resolution images included)
Technical DetailsΒΆ
DependenciesΒΆ
tabula-py: Java-based table extractioncamelot-py[cv]: Computer vision table detectionPyMuPDF: PDF manipulation and renderingpdfplumber: Text-based PDF analysispandas: Data manipulationnumpy: Numerical operationsPillow: Image processingopencv-python: Computer vision
File Naming ConventionΒΆ
Tables:
{method}_{type}_table_{id}.csvFigures:
figure_page_{page}_{type}.{ext}Output:
extracted_content_{clean_filename}/
LicenseΒΆ
This tool is designed for academic and research use with complex PDF documents.
Version: 2.0
Python: 3.12+
Last Updated: August 2025