User Guide

This comprehensive guide covers all aspects of using the Entomological Label Information Extraction system.

System Overview

The system is designed to extract and digitize text from museum specimen labels using AI and OCR technologies. It supports three processing pipelines:

  • Gemini Pipeline (recommended): Cloud-based detection, classification, OCR (printed + handwritten), entity recognition, and Darwin Core export via the Google Gemini API

  • Multi-Label Images (MLI): Full specimen photos processed with local models (Detectron2, TensorFlow, Tesseract)

  • Single-Label Images (SLI): Pre-cropped individual label images processed with local models

Architecture

Gemini Pipeline:

Input Images β†’ Gemini Detection + Classification + Rotation β†’ OCR/HTR β†’ Post-processing β†’ Entity Recognition + GBIF/OSM β†’ Darwin Core Export

Traditional Pipelines (MLI / SLI):

Input Images β†’ Detection (Faster R-CNN) β†’ Classification (TensorFlow) β†’ OCR (Tesseract/Google Vision) β†’ Post-processing β†’ Structured Output

Core Components:

  • Gemini API: Cloud-based detection, classification, OCR, HTR, and entity extraction (recommended)

  • Label detection using Faster R-CNN (traditional)

  • Classification models for label types (traditional)

  • OCR using Tesseract and Google Vision API (traditional)

  • Entity recognition with GBIF validation and OSM geocoding

  • Post-processing for text cleaning and structuring

  • Darwin Core / OpenDS export

Preprocessing and Thresholds

  • Stage 1 (Image Processing) is restricted to geometric normalization and routing only: label detection and cropping, classification (identifier vs. not, handwritten vs. printed, multi- vs single‑label), and rotation normalization to 0Β°/90Β°/180Β°/270Β°. No intensity-based enhancements (e.g., CLAHE, histogram equalization, global normalization) are applied in Stage 1 to preserve cues learned by the detectors/classifiers.

  • Stage 2 (OCR preprocessing, printed labels) applies grayscale conversion, Gaussian/median denoising, binarization via Otsu or adaptive mean/Gaussian (block size and C tunable), skew estimation within Β±10Β° and deskew, and optional morphological clean-up (dilation/erosion) before Tesseract OCR. Google Vision is called on the rotated ROI without thresholding.

  • Empty‑label detection thresholds: we crop a 10% border on all sides, count β€œdark” pixels as mean RGB < 100, and classify a label as empty if the dark‑pixel proportion p_dark < 0.01 (1%).

Preparing Your Data

Image Requirements

Quality Guidelines

  • Resolution: 300 DPI or higher recommended

  • Format: JPEG, PNG

  • Lighting: Even, sufficient contrast

  • Focus: Sharp, minimal blur

  • Orientation: Any (system handles rotation)

Multi-Label Images

  • Full specimen photos showing multiple labels

  • Include collection labels, determination labels, locality labels

  • Ensure all labels are visible and readable

Single-Label Images

  • Individual label images, pre-cropped

  • One label per image

  • Include some margin around the label text

Directory Structure

Organize your data as follows:

project/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ MLI/
β”‚   β”‚   β”œβ”€β”€ input/          # Multi-label input images
β”‚   β”‚   └── output/         # Processing results
β”‚   └── SLI/
β”‚       β”œβ”€β”€ input/          # Single-label input images
β”‚       └── output/         # Processing results

Using the Interface

Starting the Interface

# Recommended: Quick launch
python launch.py

# Alternative: Streamlit directly
streamlit run interfaces/launch_streamlit.py

# Alternative: Desktop GUI
python interfaces/launch_gui.py

The Streamlit web interface (recommended) provides:

  • Pipeline Selection: Choose between Gemini (recommended), MLI, or SLI

  • Interactive Web UI: Modern browser-based interface

  • Real-time Progress: Live progress tracking with job duration display

  • Processing Dashboard: System metrics and performance monitoring

  • Results Browser: Interactive file preview and analysis

  • OCR Correction: Edit transcribed text directly in the browser

  • Entity Viewer: See extracted scientific names, collectors, dates, localities

  • Re-run Entity Recognition: After correcting OCR text, re-extract entities with one click

  • Darwin Core Export: Download standardised Darwin Core records (JSON/CSV)

Interface Workflow

  1. Select Input Directory: Browse and choose folder containing your images

  2. Choose Pipeline Type: Gemini (recommended), MLI for specimen photos, SLI for cropped labels

  3. Configure Settings: Set processing options (OCR engine, entity recognition, exports)

  4. Start Processing: Click β€œStart Processing” and monitor real-time progress

  5. View Results: Browse generated files, charts, and structured data

  6. Correct OCR (optional): Edit transcribed text in the label explorer

  7. Re-run Entity Recognition (optional): Extract entities again after corrections

  8. Export: Download all outputs (JSON, CSV, Darwin Core)

Command Line Usage

Traditional Pipeline Commands

Multi-Label Processing:

# Basic detection
python scripts/processing/detection.py -j data/MLI/input -o data/MLI/output

# With custom confidence threshold
python scripts/processing/detection.py -j data/MLI/input -o data/MLI/output --confidence 0.7

Single-Label Processing (sequential):

# 1) Empty label filtering
python scripts/processing/analysis.py -i data/SLI/input -o data/SLI/output

# 2) Classify identifiers and text type
python scripts/processing/classifiers.py -m 1 -j data/SLI/input -o data/SLI/output  # identifier/not_identifier
python scripts/processing/classifiers.py -m 2 -j data/SLI/input -o data/SLI/output  # handwritten/printed

# 3) Rotation correction for printed labels
python scripts/processing/rotation.py -i data/SLI/output/printed -o data/SLI/output/printed/rotated

# 4) OCR (choose one)
# Option A: Tesseract OCR
python scripts/processing/tesseract.py -d data/SLI/output/printed/rotated -o data/SLI/output

# Option B: Google Vision API
python scripts/processing/vision.py -c credentials.json -d data/SLI/output/printed/rotated -o data/SLI/output

Advanced Options

Detection Parameters:

python scripts/processing/detection.py \
  -j data/MLI/input \
  -o data/MLI/output \
  --confidence 0.5 \
  --batch-size 16 \
  --device auto \
  --no-cache        # optional
# Cache maintenance
python scripts/processing/detection.py --clear-cache

OCR Configuration:

# Tesseract (printed labels after rotation)
python scripts/processing/tesseract.py \
  -d data/SLI/output/printed/rotated \
  -o data/SLI/output \
  -t 1            # 1=Otsu, 2=Adaptive-Mean, 3=Adaptive-Gaussian

# Google Vision (printed labels after rotation)
python scripts/processing/vision.py \
  -c credentials.json \
  -d data/SLI/output/printed/rotated \
  -o data/SLI/output

Manual Pipeline Scripts

Direct Script Execution

For advanced users or batch processing, run pipeline scripts directly:

# Multi-label pipeline (conda-based)
./tools/pipelines/run_mli_pipeline_conda.sh

# Single-label pipeline (conda-based)
./tools/pipelines/run_sli_pipeline_conda.sh

# Set custom input/output paths
INPUT_DIR=/path/to/input OUTPUT_DIR=/path/to/output ./tools/pipelines/run_mli_pipeline_conda.sh

Benefits of Direct Scripts: - Full control over environment - Custom path configuration - Batch processing integration - Debugging and development

Understanding Results

Output Structure

Gemini Pipeline Results:

data/MLI/output/
β”œβ”€β”€ entity_master.json             # All labels with entities, GBIF, OSM
β”œβ”€β”€ consolidated_results.json      # Labels with OCR text and metadata
β”œβ”€β”€ quality_report.json            # Extraction quality scores per label
β”œβ”€β”€ darwin_core.json               # Darwin Core formatted records
β”œβ”€β”€ darwin_core.csv                # Same in CSV format
β”œβ”€β”€ validated_results.json         # After manual OCR corrections (Streamlit)
└── input_cropped/                 # Cropped label images

Traditional Multi-Label Results:

data/MLI/output/
β”œβ”€β”€ input_predictions.csv          # Detection coordinates and confidence
β”œβ”€β”€ input_cropped/                 # Individual label images
β”œβ”€β”€ detection_stats.json           # Processing statistics
└── consolidated_results.json      # Complete detection report

Traditional Single-Label Results:

data/SLI/output/
β”œβ”€β”€ classification/
β”‚   β”œβ”€β”€ empty/                     # Empty labels
β”‚   β”œβ”€β”€ handwritten/               # Handwritten labels
β”‚   β”œβ”€β”€ printed/                   # Printed labels
β”‚   └── identifier/                # QR codes, barcodes
β”œβ”€β”€ ocr_results/
β”‚   β”œβ”€β”€ tesseract/                 # Tesseract OCR output
β”‚   └── google_vision/             # Google Vision API output
β”œβ”€β”€ processed/
β”‚   β”œβ”€β”€ corrected_transcripts.json # Cleaned and corrected text
β”‚   β”œβ”€β”€ plausible_transcripts.json # High-confidence results
β”‚   └── metadata.json              # Processing metadata
└── consolidated_results.json      # Final structured output

Key Output Files

entity_master.json (Gemini pipeline)

Complete results with extracted entities per label including: - Scientific names validated against GBIF - Collector names and collection dates - Localities geocoded with OpenStreetMap - Confidence scores and quality metrics

darwin_core.json / darwin_core.csv (Gemini pipeline)

Standardised Darwin Core records suitable for: - Direct import into biodiversity databases - GBIF data publishing - Research data sharing

consolidated_results.json

Complete processing results including: - Original image metadata - Detection/classification results - OCR transcriptions - Confidence scores - Processing timestamps

corrected_transcripts.json (traditional pipelines)

Post-processed text with: - Spelling corrections - Format standardization - Confidence ratings

plausible_transcripts.json (traditional pipelines)

High-quality extractions suitable for: - Automated database entry - Research analysis - Publication-ready data

Quality Assessment

Confidence Scores: - Detection confidence: Probability of correct label detection - Classification confidence: Accuracy of label type identification - OCR confidence: Text extraction reliability

Quality Indicators: - Image resolution and clarity - Text contrast and legibility - Processing success rates - Manual review recommendations

Processing Workflows

Complete Museum Digitization

  1. Image Capture

    # Photograph specimens with multiple labels
    # Save as high-resolution JPEG files
    
  2. Multi-Label Detection

    python scripts/processing/detection.py -j photos/ -o detections/
    
  3. Label Extraction

    # Move cropped labels to SLI pipeline
    cp detections/input_cropped/* data/SLI/input/
    
  4. Single-Label Processing

    python scripts/processing/analysis.py -j data/SLI/input -o data/SLI/output
    
  5. Quality Control

    python scripts/evaluation/analysis_eval.py -i data/SLI/output/
    

Research Data Extraction

  1. Direct Processing

    # Process pre-cropped research labels
    python scripts/processing/analysis.py -j research_labels/ -o results/
    
  2. High-Confidence Filtering

    # Extract reliable data
    jq '.[] | select(.confidence > 0.8)' results/plausible_transcripts.json
    
  3. Data Export

    # Convert to CSV for analysis
    python scripts/postprocessing/consolidate_results.py -i results/ -f csv
    

Batch Processing

For large datasets:

# Process in batches of 50 images
find data/MLI/input -name "*.jpg" | split -l 50 - batch_

# Process each batch
for batch in batch_*; do
    mkdir batch_input batch_output
    while read img; do cp "$img" batch_input/; done < "$batch"
    python scripts/processing/detection.py -j batch_input -o batch_output
    # Consolidate results
done

Troubleshooting

Common Issues

Low Detection Accuracy - Check image quality and resolution - Adjust confidence thresholds - Verify lighting and contrast - Consider manual cropping for difficult cases

OCR Errors - Try different OCR methods (Tesseract vs Google Vision) - Adjust language settings - Check for proper rotation correction - Review image preprocessing steps

Memory Issues - Reduce batch sizes - Process images sequentially - Close other applications - Consider using Docker for memory management

Performance Problems - Use GPU acceleration when available - Optimize image sizes - Process in smaller batches - Monitor system resources

Getting Help

When encountering issues:

  1. Check log files for error messages

  2. Verify input data format and quality

  3. Test with sample images first

  4. Consult the troubleshooting documentation

  5. Report issues with detailed error information

Best Practices

Image Preparation

  • Standardize lighting conditions

  • Maintain consistent resolution

  • Remove dust and debris from labels

  • Ensure labels are flat and unfolded

Processing Strategy

  • Start with small test batches

  • Validate results before large-scale processing

  • Keep original images as backups

  • Document processing parameters used

Quality Control

  • Review classification results manually

  • Validate high-confidence OCR outputs

  • Check for systematic errors

  • Maintain processing logs

Data Management

  • Organize results by processing date

  • Archive original images separately

  • Document metadata and provenance

  • Plan for long-term data storage

Entity Recognition (Gemini Pipeline)

The Gemini pipeline includes an entity recognition step that extracts structured data from OCR text:

  • Scientific names: Genus, species, subspecies β€” validated against the GBIF Backbone Taxonomy

  • Collectors: Person names associated with specimens

  • Collection dates: Parsed and normalised dates

  • Localities: Place names geocoded with OpenStreetMap (Nominatim)

  • Other fields: Altitude, habitat, collection methods, identifiers

Results are exported as Darwin Core records (darwin_core.json and darwin_core.csv) for direct use in biodiversity databases.

Advanced Features

Custom Configuration

Create custom processing configurations:

# config/custom_settings.py
DETECTION_CONFIDENCE = 0.85
OCR_METHOD = 'google'
LANGUAGE = 'eng+fra'  # Multi-language support
OUTPUT_FORMAT = 'json'

Programmatic Access

Use the system programmatically:

from label_processing import LabelProcessor

processor = LabelProcessor()
results = processor.process_directory('data/SLI/input')
processor.save_results(results, 'output.json')

Integration

Integrate with existing systems:

# Database integration example
import json
from your_database import Database

with open('consolidated_results.json') as f:
    data = json.load(f)

db = Database()
for record in data:
    db.insert_specimen_data(record)