Entomological Label Information Extraction

Processing Pipeline Overview

AI-powered text extraction from insect specimen labels πŸ¦‹

Extract and digitize text from museum specimen labels automatically using computer vision and OCR. Perfect for museum digitization, research data preparation, and biodiversity informatics.

Note

πŸ’‘ New to the project? Start with the Quick Start Guide guide for a 5-minute setup!

Quick Navigation

5-minute setup guide

πŸš€ Get Started

Install, configure, and run your first processing job.

Quick Start Guide

Complete documentation

πŸ“– User Guide

Learn all features and workflows in detail.

User Guide

Setup instructions

βš™οΈ Installation

Step-by-step installation for all platforms.

Installation

Technical docs

πŸ”§ API Reference

Complete API documentation for developers.

API Reference

Main Documentation

Key Features

✨ What makes this special:

  • Gemini Pipeline (recommended): Cloud-based detection, classification, OCR, and handwritten text via Google Gemini API

  • Smart Detection: Automatically finds labels in specimen photos

  • AI Classification: Distinguishes handwritten, printed, and empty labels

  • Triple OCR Support: Gemini API (recommended), Tesseract (free/offline), or Google Vision

  • Entity Recognition: Extracts structured entities (species, collectors, dates, localities) with GBIF validation and OSM geocoding

  • Darwin Core Export: Outputs standardised Darwin Core records (JSON and CSV)

  • Easy to Use: Streamlit web interface + command line + Docker options

  • Museum Ready: Designed specifically for scientific specimens

  • Open Source: MIT license, fully extensible

Supported Workflows

Option 1: Gemini Pipeline (Recommended) πŸš€

Specimen photos or pre-cropped labels β†’ Gemini detection + classification + OCR/HTR β†’ Entity recognition β†’ GBIF/OSM enrichment β†’ Darwin Core export

Option 2: Multi-Label Images (MLI) πŸ“·

Full specimen photos β†’ Detect labels (Detectron2) β†’ Crop β†’ Classify β†’ Tesseract OCR β†’ Structured output

Option 3: Single-Label Images (SLI) 🏷️

Pre-cropped labels β†’ Classify β†’ OCR β†’ Clean text β†’ Structured output

Performance Stats

Metric

Performance

Detection Accuracy

90%+ F1-score

Classification Accuracy

95%+ overall

OCR Character Error Rate

<5% on quality images

Processing Speed

100+ images/hour

Need Help?

πŸ†˜ Common starting points: