Changelogο
All notable changes to the Entomological Label Information Extraction project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[2.0.0] - 2025-03-13ο
Addedο
Gemini Pipeline: New recommended pipeline using Google Gemini API for detection, classification, OCR, and handwritten text recognition (HTR)
Entity Recognition: Automated extraction of structured entities (scientific names, collectors, dates, localities) from OCR text
GBIF Validation: Scientific name validation against the GBIF Backbone Taxonomy
OSM Geocoding: Locality geocoding via OpenStreetMap Nominatim
Darwin Core Export: Standardised Darwin Core records in JSON and CSV formats
Streamlit OCR Correction: Edit transcribed text directly in the browser and re-run entity recognition
Streamlit Entity Viewer: Browse extracted entities per label with GBIF/OSM enrichment
Docker Gemini Profile: Lightweight
docker-compose --profile geminifor cloud-based processingGemini Requirements: Minimal
pipelines/requirements/gemini.txtfor the Gemini Docker stageGemini Pipeline Script:
tools/pipelines/run_gemini_pipeline_conda.shfor command-line executionNew CLI scripts:
scripts/processing/gemini_classify.py,scripts/processing/gemini_ocr.py,scripts/processing/entity_recognition.pyNew modules:
label_processing/gemini_processor.py,label_processing/entity_recognition.pyUnit tests for
gemini_processor.pyandentity_recognition.pyComprehensive Sphinx documentation with Read the Docs integration
Changedο
Conda environment renamed from
entomological-labeltoELIERemoved all hardcoded API keys from scripts
Warning suppressions centralised in
label_processing/__init__.pyImproved Streamlit interface with pipeline selection, progress tracking, and export features
Updated documentation structure for v2.0
Version bumped to 2.0.0 in
pyproject.tomland Sphinxconf.py
Fixedο
Fixed
check_textβcheck_nuri_formatcall inocr_vision.pyFixed broken imports in test modules
Fixed redundant exception handling in
utils.pyUpdated stale docstrings in
text_recognition.py
[1.0.0] - 2024-01-01ο
Addedο
Initial release of the Entomological Label Information Extraction system
Multi-label image processing pipeline (MLI)
Single-label image processing pipeline (SLI)
GUI interface for easy processing
Docker containerization support
Comprehensive evaluation framework
Support for Tesseract and Google Vision OCR
Label detection using Faster R-CNN
Label classification (empty/handwritten/printed/identifier)
Rotation correction for printed labels
Text post-processing and cleaning
Batch processing capabilities
Configuration management system
Extensive logging and monitoring
Core Featuresο
Label Detection: Faster R-CNN model for detecting labels in specimen images
Classification: CNN-based classifier for label types
OCR Integration: Support for multiple OCR engines
Post-processing: Text cleaning and structuring
Evaluation: Comprehensive metrics and analysis tools
Docker Support: Containerized processing pipelines
GUI Interface: User-friendly graphical interface
Performanceο
Detection accuracy: >90% F1-score on benchmark datasets
Classification accuracy: >95% on label type classification
OCR performance: <5% character error rate on high-quality images
Processing speed: 100+ images per hour on standard hardware
Documentationο
Complete API documentation
User guides and tutorials
Installation instructions
Configuration examples
Evaluation methodologies
Troubleshooting guides
Developmentο
Full test coverage for core functionality
Continuous integration setup
Code quality tools (Black, isort, flake8)
Pre-commit hooks for code consistency
Development environment setup
Securityο
Secure handling of API keys and credentials
Input validation and sanitization
Error handling and logging
Compatibilityο
Python: 3.10, 3.11, 3.12
Operating Systems: Windows 10+, macOS 10.14+, Linux (Ubuntu 18.04+)
Hardware: CPU and GPU processing support
Docker: Multi-platform container support
Known Issuesο
High memory usage with very large images (>50MP)
Google Vision API rate limiting may affect batch processing
Some European characters may be misrecognized in Tesseract OCR
Future Releasesο
Planned Featuresο
Version 2.1.0ο
Enhanced multi-language support for entity recognition
Batch Gemini API processing with rate-limit management
Advanced clustering analysis integration (ELIE-clustering)
Extended format support (TIFF, WebP)
RESTful API for remote processing
Contributingο
We welcome contributions! Please see our Contributing guide for details on:
Code style and standards
Testing requirements
Documentation guidelines
Pull request process
Community guidelines
Licenseο
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgmentsο
Special thanks to:
Contributors and maintainers
Beta testers and early adopters
Museum partners providing test data
Open source community for tools and libraries
Research institutions supporting development