User Guideο
This comprehensive guide covers all aspects of using the Entomological Label Information Extraction system.
System Overviewο
The system is designed to extract and digitize text from museum specimen labels using AI and OCR technologies. It supports three processing pipelines:
Gemini Pipeline (recommended): Cloud-based detection, classification, OCR (printed + handwritten), entity recognition, and Darwin Core export via the Google Gemini API
Multi-Label Images (MLI): Full specimen photos processed with local models (Detectron2, TensorFlow, Tesseract)
Single-Label Images (SLI): Pre-cropped individual label images processed with local models
Architectureο
Gemini Pipeline:
Input Images β Gemini Detection + Classification + Rotation β OCR/HTR β Post-processing β Entity Recognition + GBIF/OSM β Darwin Core Export
Traditional Pipelines (MLI / SLI):
Input Images β Detection (Faster R-CNN) β Classification (TensorFlow) β OCR (Tesseract/Google Vision) β Post-processing β Structured Output
Core Components:
Gemini API: Cloud-based detection, classification, OCR, HTR, and entity extraction (recommended)
Label detection using Faster R-CNN (traditional)
Classification models for label types (traditional)
OCR using Tesseract and Google Vision API (traditional)
Entity recognition with GBIF validation and OSM geocoding
Post-processing for text cleaning and structuring
Darwin Core / OpenDS export
Preprocessing and Thresholdsο
Stage 1 (Image Processing) is restricted to geometric normalization and routing only: label detection and cropping, classification (identifier vs. not, handwritten vs. printed, multi- vs singleβlabel), and rotation normalization to 0Β°/90Β°/180Β°/270Β°. No intensity-based enhancements (e.g., CLAHE, histogram equalization, global normalization) are applied in Stage 1 to preserve cues learned by the detectors/classifiers.
Stage 2 (OCR preprocessing, printed labels) applies grayscale conversion, Gaussian/median denoising, binarization via Otsu or adaptive mean/Gaussian (block size and C tunable), skew estimation within Β±10Β° and deskew, and optional morphological clean-up (dilation/erosion) before Tesseract OCR. Google Vision is called on the rotated ROI without thresholding.
Emptyβlabel detection thresholds: we crop a 10% border on all sides, count βdarkβ pixels as mean RGB < 100, and classify a label as empty if the darkβpixel proportion p_dark < 0.01 (1%).
Preparing Your Dataο
Image Requirementsο
Quality Guidelinesο
Resolution: 300 DPI or higher recommended
Format: JPEG, PNG
Lighting: Even, sufficient contrast
Focus: Sharp, minimal blur
Orientation: Any (system handles rotation)
Multi-Label Imagesο
Full specimen photos showing multiple labels
Include collection labels, determination labels, locality labels
Ensure all labels are visible and readable
Single-Label Imagesο
Individual label images, pre-cropped
One label per image
Include some margin around the label text
Directory Structureο
Organize your data as follows:
project/
βββ data/
β βββ MLI/
β β βββ input/ # Multi-label input images
β β βββ output/ # Processing results
β βββ SLI/
β βββ input/ # Single-label input images
β βββ output/ # Processing results
Using the Interfaceο
Starting the Interfaceο
# Recommended: Quick launch
python launch.py
# Alternative: Streamlit directly
streamlit run interfaces/launch_streamlit.py
# Alternative: Desktop GUI
python interfaces/launch_gui.py
The Streamlit web interface (recommended) provides:
Pipeline Selection: Choose between Gemini (recommended), MLI, or SLI
Interactive Web UI: Modern browser-based interface
Real-time Progress: Live progress tracking with job duration display
Processing Dashboard: System metrics and performance monitoring
Results Browser: Interactive file preview and analysis
OCR Correction: Edit transcribed text directly in the browser
Entity Viewer: See extracted scientific names, collectors, dates, localities
Re-run Entity Recognition: After correcting OCR text, re-extract entities with one click
Darwin Core Export: Download standardised Darwin Core records (JSON/CSV)
Interface Workflowο
Select Input Directory: Browse and choose folder containing your images
Choose Pipeline Type: Gemini (recommended), MLI for specimen photos, SLI for cropped labels
Configure Settings: Set processing options (OCR engine, entity recognition, exports)
Start Processing: Click βStart Processingβ and monitor real-time progress
View Results: Browse generated files, charts, and structured data
Correct OCR (optional): Edit transcribed text in the label explorer
Re-run Entity Recognition (optional): Extract entities again after corrections
Export: Download all outputs (JSON, CSV, Darwin Core)
Command Line Usageο
Gemini Pipeline (Recommended)ο
# Set your API key
export GEMINI_API_KEY=<your-api-key>
# Run the full Gemini pipeline
./tools/pipelines/run_gemini_pipeline_conda.sh
# With custom options
INPUT_DIR=data/MLI/input OUTPUT_DIR=data/MLI/output \
ENTITY_RECOGNITION=true EXPORT_DWC=true EXPORT_CSV=true \
./tools/pipelines/run_gemini_pipeline_conda.sh
The Gemini pipeline handles detection, classification, OCR (printed and handwritten), post-processing, entity recognition, GBIF validation, OSM geocoding, and Darwin Core export in a single run.
Traditional Pipeline Commandsο
Multi-Label Processing:
# Basic detection
python scripts/processing/detection.py -j data/MLI/input -o data/MLI/output
# With custom confidence threshold
python scripts/processing/detection.py -j data/MLI/input -o data/MLI/output --confidence 0.7
Single-Label Processing (sequential):
# 1) Empty label filtering
python scripts/processing/analysis.py -i data/SLI/input -o data/SLI/output
# 2) Classify identifiers and text type
python scripts/processing/classifiers.py -m 1 -j data/SLI/input -o data/SLI/output # identifier/not_identifier
python scripts/processing/classifiers.py -m 2 -j data/SLI/input -o data/SLI/output # handwritten/printed
# 3) Rotation correction for printed labels
python scripts/processing/rotation.py -i data/SLI/output/printed -o data/SLI/output/printed/rotated
# 4) OCR (choose one)
# Option A: Tesseract OCR
python scripts/processing/tesseract.py -d data/SLI/output/printed/rotated -o data/SLI/output
# Option B: Google Vision API
python scripts/processing/vision.py -c credentials.json -d data/SLI/output/printed/rotated -o data/SLI/output
Advanced Optionsο
Detection Parameters:
python scripts/processing/detection.py \
-j data/MLI/input \
-o data/MLI/output \
--confidence 0.5 \
--batch-size 16 \
--device auto \
--no-cache # optional
# Cache maintenance
python scripts/processing/detection.py --clear-cache
OCR Configuration:
# Tesseract (printed labels after rotation)
python scripts/processing/tesseract.py \
-d data/SLI/output/printed/rotated \
-o data/SLI/output \
-t 1 # 1=Otsu, 2=Adaptive-Mean, 3=Adaptive-Gaussian
# Google Vision (printed labels after rotation)
python scripts/processing/vision.py \
-c credentials.json \
-d data/SLI/output/printed/rotated \
-o data/SLI/output
Manual Pipeline Scriptsο
Direct Script Executionο
For advanced users or batch processing, run pipeline scripts directly:
# Multi-label pipeline (conda-based)
./tools/pipelines/run_mli_pipeline_conda.sh
# Single-label pipeline (conda-based)
./tools/pipelines/run_sli_pipeline_conda.sh
# Set custom input/output paths
INPUT_DIR=/path/to/input OUTPUT_DIR=/path/to/output ./tools/pipelines/run_mli_pipeline_conda.sh
Benefits of Direct Scripts: - Full control over environment - Custom path configuration - Batch processing integration - Debugging and development
Understanding Resultsο
Output Structureο
Gemini Pipeline Results:
data/MLI/output/
βββ entity_master.json # All labels with entities, GBIF, OSM
βββ consolidated_results.json # Labels with OCR text and metadata
βββ quality_report.json # Extraction quality scores per label
βββ darwin_core.json # Darwin Core formatted records
βββ darwin_core.csv # Same in CSV format
βββ validated_results.json # After manual OCR corrections (Streamlit)
βββ input_cropped/ # Cropped label images
Traditional Multi-Label Results:
data/MLI/output/
βββ input_predictions.csv # Detection coordinates and confidence
βββ input_cropped/ # Individual label images
βββ detection_stats.json # Processing statistics
βββ consolidated_results.json # Complete detection report
Traditional Single-Label Results:
data/SLI/output/
βββ classification/
β βββ empty/ # Empty labels
β βββ handwritten/ # Handwritten labels
β βββ printed/ # Printed labels
β βββ identifier/ # QR codes, barcodes
βββ ocr_results/
β βββ tesseract/ # Tesseract OCR output
β βββ google_vision/ # Google Vision API output
βββ processed/
β βββ corrected_transcripts.json # Cleaned and corrected text
β βββ plausible_transcripts.json # High-confidence results
β βββ metadata.json # Processing metadata
βββ consolidated_results.json # Final structured output
Key Output Filesο
- entity_master.json (Gemini pipeline)
Complete results with extracted entities per label including: - Scientific names validated against GBIF - Collector names and collection dates - Localities geocoded with OpenStreetMap - Confidence scores and quality metrics
- darwin_core.json / darwin_core.csv (Gemini pipeline)
Standardised Darwin Core records suitable for: - Direct import into biodiversity databases - GBIF data publishing - Research data sharing
- consolidated_results.json
Complete processing results including: - Original image metadata - Detection/classification results - OCR transcriptions - Confidence scores - Processing timestamps
- corrected_transcripts.json (traditional pipelines)
Post-processed text with: - Spelling corrections - Format standardization - Confidence ratings
- plausible_transcripts.json (traditional pipelines)
High-quality extractions suitable for: - Automated database entry - Research analysis - Publication-ready data
Quality Assessmentο
Confidence Scores: - Detection confidence: Probability of correct label detection - Classification confidence: Accuracy of label type identification - OCR confidence: Text extraction reliability
Quality Indicators: - Image resolution and clarity - Text contrast and legibility - Processing success rates - Manual review recommendations
Processing Workflowsο
Complete Museum Digitizationο
Image Capture
# Photograph specimens with multiple labels # Save as high-resolution JPEG files
Multi-Label Detection
python scripts/processing/detection.py -j photos/ -o detections/
Label Extraction
# Move cropped labels to SLI pipeline cp detections/input_cropped/* data/SLI/input/
Single-Label Processing
python scripts/processing/analysis.py -j data/SLI/input -o data/SLI/output
Quality Control
python scripts/evaluation/analysis_eval.py -i data/SLI/output/
Research Data Extractionο
Direct Processing
# Process pre-cropped research labels python scripts/processing/analysis.py -j research_labels/ -o results/
High-Confidence Filtering
# Extract reliable data jq '.[] | select(.confidence > 0.8)' results/plausible_transcripts.json
Data Export
# Convert to CSV for analysis python scripts/postprocessing/consolidate_results.py -i results/ -f csv
Batch Processingο
For large datasets:
# Process in batches of 50 images
find data/MLI/input -name "*.jpg" | split -l 50 - batch_
# Process each batch
for batch in batch_*; do
mkdir batch_input batch_output
while read img; do cp "$img" batch_input/; done < "$batch"
python scripts/processing/detection.py -j batch_input -o batch_output
# Consolidate results
done
Troubleshootingο
Common Issuesο
Low Detection Accuracy - Check image quality and resolution - Adjust confidence thresholds - Verify lighting and contrast - Consider manual cropping for difficult cases
OCR Errors - Try different OCR methods (Tesseract vs Google Vision) - Adjust language settings - Check for proper rotation correction - Review image preprocessing steps
Memory Issues - Reduce batch sizes - Process images sequentially - Close other applications - Consider using Docker for memory management
Performance Problems - Use GPU acceleration when available - Optimize image sizes - Process in smaller batches - Monitor system resources
Getting Helpο
When encountering issues:
Check log files for error messages
Verify input data format and quality
Test with sample images first
Consult the troubleshooting documentation
Report issues with detailed error information
Best Practicesο
Image Preparationο
Standardize lighting conditions
Maintain consistent resolution
Remove dust and debris from labels
Ensure labels are flat and unfolded
Processing Strategyο
Start with small test batches
Validate results before large-scale processing
Keep original images as backups
Document processing parameters used
Quality Controlο
Review classification results manually
Validate high-confidence OCR outputs
Check for systematic errors
Maintain processing logs
Data Managementο
Organize results by processing date
Archive original images separately
Document metadata and provenance
Plan for long-term data storage
Entity Recognition (Gemini Pipeline)ο
The Gemini pipeline includes an entity recognition step that extracts structured data from OCR text:
Scientific names: Genus, species, subspecies β validated against the GBIF Backbone Taxonomy
Collectors: Person names associated with specimens
Collection dates: Parsed and normalised dates
Localities: Place names geocoded with OpenStreetMap (Nominatim)
Other fields: Altitude, habitat, collection methods, identifiers
Results are exported as Darwin Core records (darwin_core.json and darwin_core.csv) for direct use in biodiversity databases.
Advanced Featuresο
Custom Configurationο
Create custom processing configurations:
# config/custom_settings.py
DETECTION_CONFIDENCE = 0.85
OCR_METHOD = 'google'
LANGUAGE = 'eng+fra' # Multi-language support
OUTPUT_FORMAT = 'json'
Programmatic Accessο
Use the system programmatically:
from label_processing import LabelProcessor
processor = LabelProcessor()
results = processor.process_directory('data/SLI/input')
processor.save_results(results, 'output.json')
Integrationο
Integrate with existing systems:
# Database integration example
import json
from your_database import Database
with open('consolidated_results.json') as f:
data = json.load(f)
db = Database()
for record in data:
db.insert_specimen_data(record)