Entomological Label Information Extractionο
AI-powered text extraction from insect specimen labels π¦
Extract and digitize text from museum specimen labels automatically using computer vision and OCR. Perfect for museum digitization, research data preparation, and biodiversity informatics.
Note
π‘ New to the project? Start with the Quick Start Guide guide for a 5-minute setup!
Main Documentationο
Key Featuresο
β¨ What makes this special:
Gemini Pipeline (recommended): Cloud-based detection, classification, OCR, and handwritten text via Google Gemini API
Smart Detection: Automatically finds labels in specimen photos
AI Classification: Distinguishes handwritten, printed, and empty labels
Triple OCR Support: Gemini API (recommended), Tesseract (free/offline), or Google Vision
Entity Recognition: Extracts structured entities (species, collectors, dates, localities) with GBIF validation and OSM geocoding
Darwin Core Export: Outputs standardised Darwin Core records (JSON and CSV)
Easy to Use: Streamlit web interface + command line + Docker options
Museum Ready: Designed specifically for scientific specimens
Open Source: MIT license, fully extensible
Supported Workflowsο
- Option 1: Gemini Pipeline (Recommended) π
Specimen photos or pre-cropped labels β Gemini detection + classification + OCR/HTR β Entity recognition β GBIF/OSM enrichment β Darwin Core export
- Option 2: Multi-Label Images (MLI) π·
Full specimen photos β Detect labels (Detectron2) β Crop β Classify β Tesseract OCR β Structured output
- Option 3: Single-Label Images (SLI) π·οΈ
Pre-cropped labels β Classify β OCR β Clean text β Structured output
Performance Statsο
Metric |
Performance |
|---|---|
Detection Accuracy |
90%+ F1-score |
Classification Accuracy |
95%+ overall |
OCR Character Error Rate |
<5% on quality images |
Processing Speed |
100+ images/hour |
Need Help?ο
π Common starting points:
Installation issues? β Troubleshooting
Want to contribute? β Contributing
Need API docs? β API Reference
Detailed usage? β User Guide