Quick Start Guide

πŸŽ† Get running in 5 minutes! Perfect for first-time users who want to test the system quickly.

Tip

πŸ“‹ Prerequisites: Make sure you have Git and Conda installed first!

πŸ”§ Option 2: Traditional Pipelines (Offline)

For offline use with local models (Detectron2 + TensorFlow + Tesseract). Printed labels only.

# 1. Get the code
git clone https://github.com/MargotBelot/entomological-label-information-extraction.git
cd entomological-label-information-extraction

# 2. Setup environment
conda env create -f environment.yml
conda activate ELIE
pip install -e .

# 3. Install Tesseract OCR (required for traditional pipelines)
# macOS: brew install tesseract
# Linux: sudo apt install tesseract-ocr

# 4. Choose your interface:
# Streamlit (recommended)
python launch.py

# OR Desktop GUI (Tkinter-based)
python interfaces/launch_gui.py

# OR Manual pipeline scripts
./tools/pipelines/run_gemini_pipeline_conda.sh  # Gemini (recommended)
./tools/pipelines/run_mli_pipeline_conda.sh     # Multi-label (traditional)
./tools/pipelines/run_sli_pipeline_conda.sh     # Single-label (traditional)

🎯 What Happens Next?

After processing, you’ll find your results in the output folders:

Gemini Pipeline Output:

data/MLI/output/
β”œβ”€β”€ entity_master.json          # πŸ“Š All labels with entities, GBIF, OSM
β”œβ”€β”€ consolidated_results.json   # πŸ“ Labels with OCR text and metadata
β”œβ”€β”€ quality_report.json         # βœ… Extraction quality scores
β”œβ”€β”€ darwin_core.json            # 🧬 Darwin Core formatted records
β”œβ”€β”€ darwin_core.csv             # πŸ“„ Same in CSV format
└── input_cropped/              # πŸ–ΌοΈ Cropped label images

Traditional Pipeline Output:

data/MLI/output/
β”œβ”€β”€ consolidated_results.json    # πŸ“Š Complete summary
β”œβ”€β”€ input_predictions.csv       # πŸ—Ί Label locations
└── input_cropped/              # πŸ–ΌοΈ Cropped label images

data/SLI/output/
β”œβ”€β”€ consolidated_results.json    # πŸ“Š Complete summary
β”œβ”€β”€ corrected_transcripts.json  # 🧹 Clean text results
└── classification/             # πŸ“ Sorted by label type

πŸ“ˆ Quick Results Check

Open consolidated_results.json to see all your extracted text and confidence scores!

# Preview your results
cat data/SLI/output/consolidated_results.json | head -20

πŸš‘ Need Help?

Understanding Pipeline Types

Multi-Label Images (MLI)

Use when: You have full specimen photos containing multiple labels

# Place images here
data/MLI/input/specimen_001.jpg
data/MLI/input/specimen_002.jpg

What happens: 1. System detects individual labels in each image 2. Crops each detected label 3. Saves cropped labels for further processing 4. Generates detection results

Output: Detected labels and bounding box coordinates

Single-Label Images (SLI)

Use when: You have pre-cropped individual label images

# Place images here
data/SLI/input/label_001.jpg
data/SLI/input/label_002.jpg

What happens: 1. Classifies each label (empty/handwritten/printed/identifier) 2. Corrects rotation if needed 3. Extracts text using OCR 4. Post-processes and structures results

Output: Structured text data with metadata

Basic Usage Examples

Command Line Method

Gemini Pipeline (Recommended):

# Full Gemini pipeline β€” detection, classification, OCR, entities
./tools/pipelines/run_gemini_pipeline_conda.sh

# With custom options
INPUT_DIR=data/MLI/input OUTPUT_DIR=data/MLI/output \
ENTITY_RECOGNITION=true EXPORT_DWC=true EXPORT_CSV=true \
./tools/pipelines/run_gemini_pipeline_conda.sh

Traditional Multi-Label Processing:

# Run detection on multi-label images
python scripts/processing/detection.py -j data/MLI/input -o data/MLI/output

Traditional Single-Label Processing:

# Run SLI components sequentially
python scripts/processing/analysis.py -i data/SLI/input -o data/SLI/output  # empty label filtering
python scripts/processing/classifiers.py -m 1 -j data/SLI/input -o data/SLI/output  # identifier/not_identifier
python scripts/processing/classifiers.py -m 2 -j data/SLI/input -o data/SLI/output  # handwritten/printed
python scripts/processing/rotation.py -i data/SLI/output/printed -o data/SLI/output/printed/rotated

# OCR (choose one)
python scripts/processing/tesseract.py -d data/SLI/output/printed/rotated -o data/SLI/output
python scripts/processing/vision.py -c credentials.json -d data/SLI/output/printed/rotated -o data/SLI/output

Pipeline Scripts

# Gemini pipeline (recommended)
./tools/pipelines/run_gemini_pipeline_conda.sh

# Multi-label pipeline (traditional, conda-based)
./tools/pipelines/run_mli_pipeline_conda.sh

# Single-label pipeline (traditional, conda-based)
./tools/pipelines/run_sli_pipeline_conda.sh

Understanding Results

Multi-Label Results

After MLI processing, you’ll find:

data/MLI/output/
β”œβ”€β”€ input_predictions.csv          # Detection results
β”œβ”€β”€ input_cropped/                 # Cropped label images
β”‚   β”œβ”€β”€ specimen_001_label_0.jpg
β”‚   β”œβ”€β”€ specimen_001_label_1.jpg
β”‚   └── ...
└── consolidated_results.json      # Summary report

Single-Label Results

After SLI processing, you’ll find:

data/SLI/output/
β”œβ”€β”€ empty/                         # Empty labels
β”œβ”€β”€ handwritten/                   # Manual transcription needed
β”œβ”€β”€ printed/                       # OCR processing
β”‚   └── rotated/                   # Rotation-corrected labels
β”œβ”€β”€ identifier/                    # QR codes, barcodes
β”œβ”€β”€ ocr_preprocessed.json          # Tesseract results
β”œβ”€β”€ ocr_google_vision.json         # Google Vision results
β”œβ”€β”€ corrected_transcripts.json     # Cleaned text
β”œβ”€β”€ plausible_transcripts.json     # High-confidence text
└── consolidated_results.json      # Final structured output

Key Output Files

consolidated_results.json

Complete results with all extracted text, confidence scores, and metadata

corrected_transcripts.json

Post-processed text with corrections and standardizations

plausible_transcripts.json

High-confidence extractions suitable for automated processing

Common Workflows

Museum Digitization

# 1. Photograph specimens (multi-label images)
# 2. Process with MLI pipeline
python scripts/processing/detection.py -j photos/ -o detections/

# 3. Move cropped labels to SLI input
mv detections/input_cropped/* data/SLI/input/

# 4. Process individual labels
python scripts/processing/analysis.py -j data/SLI/input -o data/SLI/output

Research Data Preparation

# 1. Process pre-cropped labels directly
python scripts/processing/analysis.py -j research_labels/ -o results/

# 2. Extract high-confidence text
cat results/plausible_transcripts.json

# 3. Run evaluation metrics
python scripts/evaluation/ocr_eval.py -i results/

Quality Assessment

# Generate comprehensive evaluation report
python scripts/evaluation/analysis_eval.py -i data/SLI/output/

# Check clustering analysis
python scripts/evaluation/cluster_eval.py -i data/SLI/output/

# Evaluate classification accuracy
python scripts/evaluation/classifiers_eval.py -i data/SLI/output/

Next Steps

Now that you have the basics working:

  1. User Guide: Read the User Guide for end‑to‑end instructions

  2. API Documentation: Browse API Reference for programmatic usage

  3. Troubleshooting: Consult Troubleshooting for common issues

  4. Contributing: See Contributing to get involved

Tips for Success

Image Quality

  • Use high-resolution images (300+ DPI)

  • Ensure good lighting and contrast

  • Minimize blur and skew

Batch Processing

  • Process images in batches of 10-50 for optimal performance

  • Monitor memory usage with large datasets

  • Use Docker for consistent results across systems

Result Validation

  • Always review high-confidence results manually

  • Check empty label classifications

  • Verify handwritten label identification

Performance Optimization

  • Use GPU acceleration when available (traditional pipelines)

  • Adjust batch sizes based on available memory

  • Consider the Gemini pipeline for best accuracy on both printed and handwritten labels