Advanced Configurationο
This document covers advanced configuration options for the Entomological Label Information Extraction pipeline.
Environment Variablesο
The pipeline supports several environment variables to customize paths and behavior:
Project Path Overridesο
# Override the project root directory
export ENTOMOLOGICAL_PROJECT_ROOT="/path/to/project"
# Override the models directory location
export ENTOMOLOGICAL_MODELS_DIR="/path/to/models"
# Override the detection model path specifically
export ENTOMOLOGICAL_DETECTION_MODEL_PATH="/path/to/custom/detection_model.pth"
Use Cases:
Running the pipeline from a non-standard location
Sharing models across multiple project instances
Using models stored on network drives or external storage
Pipeline Path Overridesο
# Override input/output directories for pipelines
export INPUT_DIR="/path/to/input"
export OUTPUT_DIR="/path/to/output"
Use Cases:
Processing images from external drives
Writing outputs to specific locations
Batch processing with custom directory structures
Model Cachingο
Detection Model Cachingο
The detection script (scripts/processing/detection.py) implements an intelligent model caching mechanism to speed up repeated runs.
How it works:
First load: Model is loaded from disk (~10-30 seconds)
Cache created: Model state is saved to
~/.entomological_cache/Subsequent loads: Model loads from cache (~2-5 seconds)
Cache location:
~/.entomological_cache/
βββ model_<hash>.pkl
Cache validation:
Automatically detects model file changes
Uses MD5 hash of model file for validation
Invalidates cache if model is updated
Disable caching:
# In detection.py, modify:
predictor = OptimizedPredictLabel(
path_to_model=model_path,
classes=["label"],
threshold=THRESHOLD,
use_cache=False # Disable caching
)
Clear cache manually:
rm -rf ~/.entomological_cache/
Rotation Model Configurationο
Model Search Pathsο
The rotation script searches for models in the following order:
models/rotation_model.h5(primary)models/label_rotation_model.h5(alternative)models/rotation_classifier.h5(alternative)
Missing Model Handlingο
If no rotation model is found, the pipeline will:
Print error message with searched paths
Exit gracefully with error code 1
Suggest downloading or placing the model
Downloading the rotation model:
# Download from your model repository
wget https://your-repo.com/models/rotation_model.h5 -O models/rotation_model.h5
# Or train your own rotation model
# See: docs/MODEL_TRAINING.md
Conda Environment Customizationο
Using Custom Environment Nameο
If you want to use a different conda environment name:
Edit
environment.yml:name: your-custom-name # Change this line
Update pipeline scripts:
# In tools/pipelines/run_mli_pipeline_conda.sh # and tools/pipelines/run_sli_pipeline_conda.sh conda activate your-custom-name
Create environment:
conda env create -f environment.yml
Docker Configurationο
Custom Port Mappingο
By default, Docker containers use standard ports. To customize:
# In docker-compose.yml, add port mappings:
services:
segmentation:
ports:
- "8080:8080" # host:container
Custom Volume Mountsο
Mount additional directories:
services:
segmentation:
volumes:
- ${PWD}/data:/app/data
- /external/storage:/app/external # Additional mount
Memory and CPU Limitsο
Adjust resource limits based on your hardware:
deploy:
resources:
limits:
memory: 8G # Increase for larger images
cpus: '6.0' # Increase for faster processing
reservations:
memory: 4G
cpus: '2.0'
HPC/Apptainer Configurationο
Environment Variables in HPCο
When using Apptainer/Singularity on HPC:
# In your SLURM script:
export APPTAINERENV_ENTOMOLOGICAL_PROJECT_ROOT=/scratch/username/project
export APPTAINERENV_ENTOMOLOGICAL_MODELS_DIR=/scratch/shared/models
apptainer run --bind /scratch/data:/app/data elie.sif mli
Parallel Processingο
For HPC batch processing:
# Process multiple images in parallel across nodes
srun -n 10 --cpus-per-task=4 apptainer run elie.sif mli
Tesseract Configurationο
Custom Tesseract Pathο
If Tesseract is installed in a non-standard location:
export TESSERACT_CMD="/custom/path/to/tesseract"
Language Packsο
Install additional language packs for OCR:
# Ubuntu/Debian
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-deu # German
# macOS
brew install tesseract-lang
Then specify in OCR command:
python scripts/processing/tesseract.py -d input -o output -l eng+fra
Performance Tuningο
Multiprocessingο
Enable parallel OCR processing:
python scripts/processing/tesseract.py \
-d input \
-o output \
-multi # Enable multiprocessing
Batch Size for Detectionο
Adjust detection batch size based on available memory:
python scripts/processing/detection.py \
-j input \
-o output \
--batch-size 4 # Reduce if out of memory
GPU Configurationο
If you have a GPU available:
# In detection.py or other PyTorch scripts
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Troubleshootingο
Cache Issuesο
If you experience issues with cached models:
# Clear all caches
rm -rf ~/.entomological_cache/
rm -rf __pycache__
find . -type d -name "*.egg-info" -exec rm -rf {} +
Path Resolution Issuesο
Check that paths are correctly resolved:
# Run config validation
python label_processing/config.py
# Expected output shows all paths
Permission Issuesο
Ensure proper permissions:
# Make scripts executable
chmod +x tools/pipelines/*.sh
# Fix model file permissions
chmod 644 models/*.h5 models/*.pth
See Alsoο
README.md - Main documentation
pipelines/README.md - Docker-specific docs
pipelines/HPC_QUICKSTART.md - HPC-specific docs