Quick Start Guideο
π Get running in 5 minutes! Perfect for first-time users who want to test the system quickly.
π Option 1: Gemini Pipeline (Recommended)ο
The simplest and most powerful way to get started β handles printed AND handwritten labels:
# 1. Get the code
git clone https://github.com/MargotBelot/entomological-label-information-extraction.git
cd entomological-label-information-extraction
# 2. One-command setup
conda env create -f environment.yml
conda activate ELIE
pip install -e .
# 3. Set up your Gemini API key (free from https://aistudio.google.com/apikey)
export GEMINI_API_KEY=<your-api-key>
# 4. Add your images
cp /path/to/your/photos/*.jpg data/MLI/input/
# 5. Launch the interface
python launch.py
Thatβs it! π The Streamlit web interface will open where you can:
Select the Gemini pipeline and configure options
Monitor real-time processing progress
Browse results and correct OCR text directly
View extracted entities (species, collectors, dates, localities)
Re-run entity recognition after corrections
Export Darwin Core records (JSON/CSV)
π§ Option 2: Traditional Pipelines (Offline)ο
For offline use with local models (Detectron2 + TensorFlow + Tesseract). Printed labels only.
# 1. Get the code
git clone https://github.com/MargotBelot/entomological-label-information-extraction.git
cd entomological-label-information-extraction
# 2. Setup environment
conda env create -f environment.yml
conda activate ELIE
pip install -e .
# 3. Install Tesseract OCR (required for traditional pipelines)
# macOS: brew install tesseract
# Linux: sudo apt install tesseract-ocr
# 4. Choose your interface:
# Streamlit (recommended)
python launch.py
# OR Desktop GUI (Tkinter-based)
python interfaces/launch_gui.py
# OR Manual pipeline scripts
./tools/pipelines/run_gemini_pipeline_conda.sh # Gemini (recommended)
./tools/pipelines/run_mli_pipeline_conda.sh # Multi-label (traditional)
./tools/pipelines/run_sli_pipeline_conda.sh # Single-label (traditional)
π― What Happens Next?ο
After processing, youβll find your results in the output folders:
Gemini Pipeline Output:
data/MLI/output/
βββ entity_master.json # π All labels with entities, GBIF, OSM
βββ consolidated_results.json # π Labels with OCR text and metadata
βββ quality_report.json # β
Extraction quality scores
βββ darwin_core.json # 𧬠Darwin Core formatted records
βββ darwin_core.csv # π Same in CSV format
βββ input_cropped/ # πΌοΈ Cropped label images
Traditional Pipeline Output:
data/MLI/output/
βββ consolidated_results.json # π Complete summary
βββ input_predictions.csv # πΊ Label locations
βββ input_cropped/ # πΌοΈ Cropped label images
data/SLI/output/
βββ consolidated_results.json # π Complete summary
βββ corrected_transcripts.json # π§Ή Clean text results
βββ classification/ # π Sorted by label type
π Quick Results Checkο
Open consolidated_results.json to see all your extracted text and confidence scores!
# Preview your results
cat data/SLI/output/consolidated_results.json | head -20
π Need Help?ο
Weird results? β Check Troubleshooting
Ready for production? β Read the full User Guide
Want to contribute? β See Contributing
Found a bug? β Report it on GitHub Issues
Understanding Pipeline Typesο
Multi-Label Images (MLI)ο
Use when: You have full specimen photos containing multiple labels
# Place images here
data/MLI/input/specimen_001.jpg
data/MLI/input/specimen_002.jpg
What happens: 1. System detects individual labels in each image 2. Crops each detected label 3. Saves cropped labels for further processing 4. Generates detection results
Output: Detected labels and bounding box coordinates
Single-Label Images (SLI)ο
Use when: You have pre-cropped individual label images
# Place images here
data/SLI/input/label_001.jpg
data/SLI/input/label_002.jpg
What happens: 1. Classifies each label (empty/handwritten/printed/identifier) 2. Corrects rotation if needed 3. Extracts text using OCR 4. Post-processes and structures results
Output: Structured text data with metadata
Basic Usage Examplesο
Streamlit Interface (Recommended)ο
# Quick launch
python launch.py
# OR launch Streamlit directly
streamlit run interfaces/launch_streamlit.py
The Streamlit interface provides:
Interactive web-based UI with pipeline selection (Gemini, MLI, SLI)
Real-time progress tracking with job duration display
Live processing dashboard with system metrics
Results browser with file preview
OCR text correction and entity recognition re-run
Darwin Core export
Command Line Methodο
Gemini Pipeline (Recommended):
# Full Gemini pipeline β detection, classification, OCR, entities
./tools/pipelines/run_gemini_pipeline_conda.sh
# With custom options
INPUT_DIR=data/MLI/input OUTPUT_DIR=data/MLI/output \
ENTITY_RECOGNITION=true EXPORT_DWC=true EXPORT_CSV=true \
./tools/pipelines/run_gemini_pipeline_conda.sh
Traditional Multi-Label Processing:
# Run detection on multi-label images
python scripts/processing/detection.py -j data/MLI/input -o data/MLI/output
Traditional Single-Label Processing:
# Run SLI components sequentially
python scripts/processing/analysis.py -i data/SLI/input -o data/SLI/output # empty label filtering
python scripts/processing/classifiers.py -m 1 -j data/SLI/input -o data/SLI/output # identifier/not_identifier
python scripts/processing/classifiers.py -m 2 -j data/SLI/input -o data/SLI/output # handwritten/printed
python scripts/processing/rotation.py -i data/SLI/output/printed -o data/SLI/output/printed/rotated
# OCR (choose one)
python scripts/processing/tesseract.py -d data/SLI/output/printed/rotated -o data/SLI/output
python scripts/processing/vision.py -c credentials.json -d data/SLI/output/printed/rotated -o data/SLI/output
Pipeline Scriptsο
# Gemini pipeline (recommended)
./tools/pipelines/run_gemini_pipeline_conda.sh
# Multi-label pipeline (traditional, conda-based)
./tools/pipelines/run_mli_pipeline_conda.sh
# Single-label pipeline (traditional, conda-based)
./tools/pipelines/run_sli_pipeline_conda.sh
Understanding Resultsο
Multi-Label Resultsο
After MLI processing, youβll find:
data/MLI/output/
βββ input_predictions.csv # Detection results
βββ input_cropped/ # Cropped label images
β βββ specimen_001_label_0.jpg
β βββ specimen_001_label_1.jpg
β βββ ...
βββ consolidated_results.json # Summary report
Single-Label Resultsο
After SLI processing, youβll find:
data/SLI/output/
βββ empty/ # Empty labels
βββ handwritten/ # Manual transcription needed
βββ printed/ # OCR processing
β βββ rotated/ # Rotation-corrected labels
βββ identifier/ # QR codes, barcodes
βββ ocr_preprocessed.json # Tesseract results
βββ ocr_google_vision.json # Google Vision results
βββ corrected_transcripts.json # Cleaned text
βββ plausible_transcripts.json # High-confidence text
βββ consolidated_results.json # Final structured output
Key Output Filesο
- consolidated_results.json
Complete results with all extracted text, confidence scores, and metadata
- corrected_transcripts.json
Post-processed text with corrections and standardizations
- plausible_transcripts.json
High-confidence extractions suitable for automated processing
Common Workflowsο
Museum Digitizationο
# 1. Photograph specimens (multi-label images)
# 2. Process with MLI pipeline
python scripts/processing/detection.py -j photos/ -o detections/
# 3. Move cropped labels to SLI input
mv detections/input_cropped/* data/SLI/input/
# 4. Process individual labels
python scripts/processing/analysis.py -j data/SLI/input -o data/SLI/output
Research Data Preparationο
# 1. Process pre-cropped labels directly
python scripts/processing/analysis.py -j research_labels/ -o results/
# 2. Extract high-confidence text
cat results/plausible_transcripts.json
# 3. Run evaluation metrics
python scripts/evaluation/ocr_eval.py -i results/
Quality Assessmentο
# Generate comprehensive evaluation report
python scripts/evaluation/analysis_eval.py -i data/SLI/output/
# Check clustering analysis
python scripts/evaluation/cluster_eval.py -i data/SLI/output/
# Evaluate classification accuracy
python scripts/evaluation/classifiers_eval.py -i data/SLI/output/
Next Stepsο
Now that you have the basics working:
User Guide: Read the User Guide for endβtoβend instructions
API Documentation: Browse API Reference for programmatic usage
Troubleshooting: Consult Troubleshooting for common issues
Contributing: See Contributing to get involved
Tips for Successο
Image Qualityο
Use high-resolution images (300+ DPI)
Ensure good lighting and contrast
Minimize blur and skew
Batch Processingο
Process images in batches of 10-50 for optimal performance
Monitor memory usage with large datasets
Use Docker for consistent results across systems
Result Validationο
Always review high-confidence results manually
Check empty label classifications
Verify handwritten label identification
Performance Optimizationο
Use GPU acceleration when available (traditional pipelines)
Adjust batch sizes based on available memory
Consider the Gemini pipeline for best accuracy on both printed and handwritten labels