label_processing Package

The label_processing package contains the core image processing functionality for the Entomological Label Information Extraction system.

Package Contents

config

Configuration module for entomological label information extraction.

detect_empty_labels

Empty Label Detection Module

label_detection

Label Detection Module (Detectron2 / Detecto)

label_rotation

Label Rotation Module (TensorFlow)

ocr_vision

tensorflow_classifier

text_recognition

utils

Utility functions for the entomological label processing pipeline.

Modules

Configuration

Configuration module for entomological label information extraction. Handles platform-specific paths and environment variables.

class label_processing.config.PathConfig[source]

Bases: object

Centralized path configuration for cross-platform compatibility.

get_model_path(model_type)[source]

Get path for a specific model type.

Parameters:

model_type (str) – Type of model (‘detection’, ‘identifier’, ‘handwritten_printed’, ‘multi_single’)

Returns:

Path to the model file

Return type:

Path

Raises:

ValueError – If model type is not recognized

get_class_names(model_type)[source]

Get class names for a specific model type.

Parameters:

model_type (str) – Type of model (‘identifier’, ‘handwritten_printed’, ‘multi_single’)

Returns:

List of class names

Return type:

list

ensure_directories()[source]

Create necessary directories if they don’t exist.

validate_paths()[source]

Validate that all required paths exist.

Returns:

Dictionary mapping path names to existence status

Return type:

Dict[str, bool]

get_temp_dir()[source]

Get a temporary directory for the current platform.

Returns:

Platform-appropriate temporary directory

Return type:

Path

__str__()[source]

String representation of configuration.

Return type:

str

label_processing.config.get_project_root()[source]

Get the project root directory.

Return type:

Path

label_processing.config.get_model_path(model_type)[source]

Get path for a specific model.

Parameters:

model_type (str)

Return type:

Path

label_processing.config.get_models_dir()[source]

Get the models directory.

Return type:

Path

label_processing.config.get_output_dir()[source]

Get the output directory.

Return type:

Path

label_processing.config.validate_setup()[source]

Validate the current setup.

Returns:

True if setup is valid, False otherwise

Return type:

bool

Empty Label Detection

Empty Label Detection Module

Classifies label images as empty or non-empty based on the proportion of dark pixels within a cropped region. Used as the first filtering step in the traditional pipeline.

label_processing.detect_empty_labels.detect_dark_pixels(image, crop_box, threshold=100)[source]

Detect the proportion of dark pixels in an image.

Parameters:
  • image (Image) – Input image.

  • crop_box (tuple) – (left, upper, right, lower) coordinates for image cropping.

  • threshold (int) – Threshold for classifying dark pixels. Defaults to 100.

Returns:

Proportion of dark pixels.

Return type:

float

label_processing.detect_empty_labels.is_empty(image, crop_margin, threshold)[source]

Determines if an image is empty based on a given threshold and crop margin.

Parameters:
  • image (<module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/entomological-label-information-extraction/envs/latest/lib/python3.11/site-packages/PIL/Image.py'>) – PIL Image object

  • crop_margin (float) – float, proportion of the image size to crop from the borders

  • threshold (float) – float, proportion of black pixels below which the image is considered empty

Returns:

bool, whether the image is empty or not

Return type:

bool

label_processing.detect_empty_labels.find_empty_labels(input_folder, output_folder, threshold=0.01, crop_margin=0.1)[source]

Find and copy empty and non-empty labels to respective folders (keeps originals in input).

Parameters:
  • input_folder (str) – Path to the directory containing input images.

  • output_folder (str) – Path to the directory where filtered images will be stored.

  • threshold (float) – Threshold for classifying empty labels. Defaults to 0.01.

  • crop_margin (float) – Margin for cropping images. Defaults to 0.1.

Returns:

None

Return type:

None

Label Detection

Label Detection Module (Detectron2 / Detecto)

Detects and crops individual labels from full specimen photographs using a trained Faster R-CNN object-detection model. Used by the traditional MLI pipeline; the Gemini pipeline uses gemini_processor.detect_and_classify instead.

label_processing.label_detection.is_image_file(path)[source]
Return type:

bool

class label_processing.label_detection.PredictLabel(path_to_model, classes, jpg_path=None, threshold=0.8)[source]

Bases: object

Class for predicting labels using a trained object detection model.

Parameters:
path_to_model

Path to the trained model file.

Type:

str

classes

List of classes used in the model.

Type:

list

jpg_path

Path to a specific JPG file for prediction.

Type:

str|Path|None

threshold

Threshold value for scores. Defaults to 0.8.

Type:

float

model

Trained object detection model.

Type:

detecto.core.Model

property jpg_path

Property for JPG path.

Type:

str|Path|None

retrieve_model()[source]

Retrieve the trained object detection model using Detecto’s Model.load. Includes cross-platform compatibility fixes and integrity verification.

Return type:

detecto.core.Model

class_prediction(jpg_path=None)[source]

Predict labels for a given JPG file.

Parameters:

jpg_path (Path) – Path to the JPG file.

Returns:

Pandas DataFrame with prediction results.

Return type:

pd.DataFrame

label_processing.label_detection.prediction_parallel(jpg_dir, predictor, n_processes)[source]

Perform predictions for all JPG files in a directory with parallel processing.

Parameters:
  • jpg_dir (Path|str) – Path to JPG files for prediction.

  • predictor (PredictLabel) – Prediction instance.

  • n_processes (int) – Number of processes for parallel execution.

Returns:

Pandas DataFrame containing the predictions.

Return type:

pd.DataFrame

label_processing.label_detection.clean_predictions(jpg_dir, dataframe, threshold, out_dir=None)[source]

Filter predictions based on a threshold and save the results to a CSV file.

Parameters:
  • jpg_dir (Path) – Path to the directory with JPG files.

  • dataframe (pd.DataFrame) – Pandas DataFrame with predictions.

  • threshold (float) – Threshold value for scores.

  • out_dir (str) – Output directory for saving the CSV file.

Returns:

Pandas DataFrame with filtered results.

Return type:

pd.DataFrame

label_processing.label_detection.crop_picture(img_raw, path, filename, **coordinates)[source]

Crop the picture using the given coordinates.

Parameters:
  • img_raw (numpy.ndarray) – Input JPG converted to a numpy matrix by cv2.

  • path (str) – Path where the picture should be saved.

  • filename (str) – Name of the picture.

  • coordinates – Coordinates for cropping.

Return type:

None

label_processing.label_detection.create_crops(jpg_dir, dataframe, out_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/entomological-label-information-extraction/checkouts/latest/docs'))[source]

Creates crops by using the csv from applying the model and the original pictures inside a directory.

Parameters:
  • () (jpg_dir) – path to directory with jpgs.

  • dataframe (str) – path to csv file.

  • out_dir (Path) – path to the target directory to save the cropped jpgs.

  • jpg_dir (Path)

Return type:

None

Label Rotation

Label Rotation Module (TensorFlow)

Predicts and corrects the orientation of label images using a trained TensorFlow classification model that outputs one of four angle classes (0°, 90°, 180°, 270°). Used by the traditional pipeline; the Gemini pipeline determines rotation angles via the Gemini API instead.

label_processing.label_rotation.load_image(image_path)[source]

Load an image from a file path.

Parameters:

image_path (str) – Path to the image file.

Returns:

Loaded image.

Return type:

np.ndarray

label_processing.label_rotation.rotate_image(image, angle)[source]

Rotate an image based on a given angle.

Parameters:
  • image (np.ndarray) – Input image.

  • angle (int) – Angle of rotation in multiples of 90 degrees.

Returns:

Rotated image.

Return type:

np.ndarray

label_processing.label_rotation.save_image(image, output_path)[source]

Save an image to a file path.

Parameters:
  • image (np.ndarray) – Image to save.

  • output_path (str) – Path to save the image.

Returns:

True if the image is saved, False otherwise.

Return type:

bool

label_processing.label_rotation.rotate_single_image(image_path, angle, output_dir)[source]

Rotate a single image based on a given angle and save the rotated image.

Parameters:
  • image_path (str) – Path to the input image file.

  • angle (int) – Angle of rotation in multiples of 90 degrees.

  • output_dir (str) – Directory to save the rotated image.

Returns:

True if the image is rotated, False otherwise.

Return type:

bool

label_processing.label_rotation.get_image_paths(input_image_dir)[source]

Get a list of image paths in the input directory.

Parameters:

input_image_dir (str) – Directory containing input images.

Returns:

List of image paths.

Return type:

list

label_processing.label_rotation.load_images(image_paths)[source]

Load images from a list of image paths.

Parameters:

image_paths (list) – List of image paths.

Returns:

Loaded images.

Return type:

np.ndarray

label_processing.label_rotation.get_predicted_angles(model, images)[source]

Predict angles for a list of images using a trained model.

Parameters:
  • model (tf.keras.Model) – Trained model.

  • images (np.ndarray) – List of images.

Returns:

List of predicted angles.

Return type:

list

label_processing.label_rotation.rotate_images(image_paths, predicted_angles, output_image_dir)[source]

Rotate images based on their predicted angles and save them to the output directory.

Parameters:
  • image_paths (list) – List of image paths.

  • predicted_angles (list) – List of predicted angles.

  • output_image_dir (str) – Directory to save rotated images.

Returns:

None

Return type:

None

label_processing.label_rotation.debug_save_by_angle(image_paths, predicted_angles, output_base_dir)[source]

Copy images into angle-named subdirectories for visual debugging.

Parameters:
  • image_paths (List[str]) – List of source image paths.

  • predicted_angles (List[int]) – Predicted angle class per image (0-3).

  • output_base_dir (str) – Base directory for angle subdirectories.

Return type:

None

label_processing.label_rotation.predict_angles(input_image_dir, output_image_dir, model_path, debug=False)[source]

Load a trained model, predict angles for input images, and rotate images accordingly.

Parameters:
  • input_image_dir (str) – Directory containing input images.

  • output_image_dir (str) – Directory to save rotated images.

  • model_path (str) – Path to the trained model.

  • debug (bool) – If True, saves images by predicted angles for debugging.

Returns:

None

Return type:

None

label_processing.label_rotation.rotate_image_pil(image_path, angle_deg, output_path)[source]

Rotate an image using PIL and save the result.

Parameters:
  • image_path (str) – Path to the input image.

  • angle_deg (float) – Counter-clockwise rotation angle in degrees.

  • output_path (str) – Path to save the rotated image.

Return type:

None

OCR Vision

class label_processing.ocr_vision.VisionApi(path, image, credentials, encoding)[source]

Bases: object

Class for interacting with the Google Cloud Vision API for OCR tasks on images.

Parameters:
static read_image(path, credentials, encoding='utf8')[source]

Read an image file and return an instance of the VisionApi class.

Parameters:
  • path (str) – Path to the image file.

  • credentials (str) – Path to the credentials JSON file.

  • encoding (str, optional) – Encoding for the result (‘ascii’ or ‘utf8’). Defaults to ‘utf8’.

Returns:

Instance of the VisionApi class.

Return type:

VisionApi

process_string(result_raw)[source]

Process the Google Vision OCR output, replacing newlines with spaces and encoding as specified.

Parameters:

result_raw (str) – Raw output string directly from Google Vision.

Returns:

Processed string.

Return type:

str

vision_ocr()[source]

Perform the actual API call, handle errors, and return the processed transcription.

Raises:

Exception – Raises an exception if the API does not respond.

Returns:

Dictionary with the filename and the transcript.

Return type:

dict[str, str]

TensorFlow Classifier

label_processing.tensorflow_classifier.get_model(path_to_model)[source]

Load a trained Keras Sequential image classifier model with cross-platform compatibility.

Parameters:

path_to_model (str) – Path to the model file.

Returns:

Trained Keras Sequential image classifier model.

Return type:

model (tf.keras.Sequential)

label_processing.tensorflow_classifier.class_prediction(model, class_names, jpg_dir, out_dir=None, batch_size=32, max_images=10000)[source]

Create a dataframe with predicted classes for each picture with memory-safe batch processing.

Parameters:
  • model (tf.keras.Sequential) – Trained Keras Sequential image classifier model.

  • class_names (list) – Model’s predicted classes.

  • jpg_dir (str) – Path to the directory containing the original jpgs.

  • out_dir (str) – Path where the CSV file will be stored.

  • batch_size (int) – Number of images to process in each batch (default: 32)

  • max_images (int) – Maximum number of images to process (default: 10000)

Returns:

Pandas DataFrame with the predicted results.

Return type:

DataFrame (pd.DataFrame)

label_processing.tensorflow_classifier.create_dirs(dataframe, path)[source]

Create separate directories for every class.

Parameters:
  • dataframe (pd.Dataframe) – DataFrame containing the classes as a column.

  • path (str) – Path of the chosen directory.

Return type:

None

label_processing.tensorflow_classifier.make_file_name(label_id, pic_class)[source]

Create a fitting filename.

Parameters:
  • label_id (str) – String containing the label id.

  • pic_class (str) – Class of the label.

Returns:

The created filename.

Return type:

filename (str)

label_processing.tensorflow_classifier.rename_picture(img_raw, path, filename, pic_class)[source]

Rename the pictures using the predicted class.

Parameters:
  • img_raw (numpy.ndarray) – Input jpg converted to a numpy matrix by cv2.

  • path (str) – Path where the picture should be saved.

  • filename (str) – Name of the picture.

  • pic_class (str) – Class of the label.

Return type:

None

label_processing.tensorflow_classifier.filter_pictures(jpg_dir, dataframe, out_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/entomological-label-information-extraction/checkouts/latest/docs'))[source]

Create new folders for each class of the newly named classified pictures.

Parameters:
  • jpg_dir (str) – Path to directory with jpgs.

  • dataframe (pd.DataFrame) – Pandas DataFrame with class predictions.

  • out_dir (Path) – Path to the target directory to save the cropped jpgs.

Return type:

None

Text Recognition

label_processing.text_recognition.find_tesseract()[source]

Searches for the tesseract executable and raises an error if it is not found.

Return type:

None

class label_processing.text_recognition.ImageProcessor(image, path, blocksize=None, c_value=None)[source]

Bases: object

A class for image preprocessing and other image actions.

Parameters:
  • image (np.ndarray)

  • path (str)

  • blocksize (int)

  • c_value (int)

property blocksize: int
property c_value: int
property image: ndarray
property path: str
copy_this()[source]

Creates a copy of the current Image instance.

Returns:

A copy of the current Image instance.

Return type:

ImageProcessor

static read_image(path)[source]

Read an image from the specified path and return an ImageProcessor instance.

Parameters:

path (str) – The path to a JPG file.

Returns:

An instance of the ImageProcessor class.

Return type:

ImageProcessor

get_grayscale()[source]

Convert the image to grayscale.

Returns:

An instance representing the grayscale image.

Return type:

ImageProcessor

blur(ksize=(5, 5))[source]

Apply Gaussian blur to the image.

Parameters:

ksize (Tuple[int, int], optional) – The kernel size for blurring. Defaults to (5, 5).

Returns:

An instance representing the blurred image.

Return type:

ImageProcessor

remove_noise()[source]

Remove noise from the image using median blur.

Returns:

An instance representing the noise-reduced image.

Return type:

ImageProcessor

apply_clahe(clip_limit=2.0, tile_grid_size=(8, 8))[source]

Apply Contrast Limited Adaptive Histogram Equalization (CLAHE).

CLAHE improves contrast in images with uneven illumination or low contrast, which is common in aged specimen labels or images with inconsistent lighting.

Parameters:
  • clip_limit (float, optional) – Threshold for contrast limiting. Higher values give more contrast. Defaults to 2.0.

  • tile_grid_size (tuple[int, int], optional) – Size of grid for histogram equalization. Defaults to (8, 8).

Returns:

An instance of the Image class with CLAHE applied.

Return type:

ImageProcessor

normalize_illumination()[source]

Normalize image illumination using morphological operations.

This method corrects uneven lighting by estimating and removing the background illumination, useful for images with shadows or uneven flash lighting.

Returns:

An instance of the Image class with normalized illumination.

Return type:

ImageProcessor

thresholding(thresh_mode)[source]

Perform thresholding on the image.

Parameters:

thresh_mode (Threshmode) – The thresholding mode to use (OTSU, ADAPTIVE_MEAN, or ADAPTIVE_GAUSSIAN).

Returns:

An instance representing the thresholded image.

Return type:

ImageProcessor

dilate()[source]

Dilate the image using a 5x5 kernel.

Returns:

An instance representing the dilated image.

Return type:

ImageProcessor

erode()[source]

Erode the image using a 5x5 kernel.

Returns:

An instance representing the eroded image.

Return type:

ImageProcessor

get_skew_angle()[source]

Calculate and return the skew angle of the image.

Returns:

The skew angle in degrees or None if it couldn’t be determined.

Return type:

Optional[np.float64]

deskew(angle)[source]

Rotate the image to deskew it.

Parameters:

angle (Optional[np.float64]) – The skew angle to use for deskewing.

Returns:

An instance representing the deskewed image.

Return type:

ImageProcessor

preprocessing(thresh_mode, use_clahe=False, normalize_illum=False, clahe_clip_limit=2.0, clahe_tile_grid_size=(8, 8))[source]

Perform a series of preprocessing steps on the image.

Parameters:
  • thresh_mode (Threshmode) – The thresholding mode to use (OTSU, ADAPTIVE_MEAN, or ADAPTIVE_GAUSSIAN).

  • use_clahe (bool, optional) – Apply CLAHE for contrast enhancement. Useful for low-contrast or faded labels. Defaults to False.

  • normalize_illum (bool, optional) – Apply illumination normalization to correct uneven lighting. Useful for images with shadows or hotspots. Defaults to False.

  • clahe_clip_limit (float, optional) – CLAHE contrast limiting threshold. Defaults to 2.0.

  • clahe_tile_grid_size (tuple[int, int], optional) – CLAHE grid size. Defaults to (8, 8).

Returns:

An instance of the Image class representing the preprocessed image.

Return type:

ImageProcessor

read_qr_code()[source]

Tries to identify if a picture has a QR-code and then reads and returns it.

Returns:

Decoded QR-code text as a str or None if there is no QR-code found.

Return type:

Optional[str]

save_image(dir_path, appendix=None)[source]

Save the image to a specified directory with an optional appendix.

Parameters:
  • dir_path (str | Path) – The directory path where the image will be saved.

  • appendix (str, optional) – An optional string to append to the image filename. Defaults to None.

Return type:

None

class label_processing.text_recognition.Threshmode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Different possibilities for thresholding.

Parameters:

Enum (int)

OTSU = 1
ADAPTIVE_MEAN = 2
ADAPTIVE_GAUSSIAN = 3
classmethod eval(threshmode)[source]
Parameters:

threshmode (int)

Return type:

Enum

class label_processing.text_recognition.Tesseract(languages='eng+deu+fra+ita+spa+por', config='--psm 6 --oem 3', image=None)[source]

Bases: object

Parameters:

image (Optional[ImageProcessor])

property image: ImageProcessor
image_to_string()[source]

Apply OCR and image parameters on JPG images.

Returns:

A dictionary containing the image ID (filename), OCR-processed text, and confidence score.

Return type:

dict[str, str | float]

OCR preprocessing summary

The text_recognition.ImageProcessor applies, prior to Tesseract OCR: - grayscale conversion - Gaussian/median denoising - binarization via Otsu or adaptive mean/Gaussian (block size/C configurable) - skew estimation within ±10° and deskewing - optional morphological cleaning (dilation/erosion)

Google Vision OCR is invoked on the rotated ROI without thresholding; word-level bounding boxes are captured via ocr_vision.

Utilities

Utility functions for the entomological label processing pipeline.

Provides image validation, filename generation, JSON/CSV I/O, NURI format checking, and model integrity verification helpers used across all pipeline variants.

label_processing.utils.validate_image_integrity(filepath, max_size_mb=25, max_dimensions=(8000, 8000))[source]

Validate image file integrity with strict memory safety limits.

Parameters:
  • filepath (str) – path to image file

  • max_size_mb (int) – maximum file size in MB (default: 25MB)

  • max_dimensions (tuple) – maximum width/height in pixels (default: 8000x8000)

Returns:

True if image is valid and safe to process, False otherwise

Return type:

bool

label_processing.utils.check_dir(directory)[source]

Checks if the directory contains valid jpg files with integrity validation.

Parameters:

directory (str) – path to directory

Raises:
  • FileNotFoundError – raised if no valid jpg files are found in the directory

  • ValueError – raised if corrupted image files are detected

Return type:

None

label_processing.utils.generate_filename(original_path, appendix, extension=None)[source]

Gets the path to a file or directory as an input and returns it with an appendix added to the end.

Parameters:
  • original_path (str) – original path to file or directory

  • appendix (str) – what needs to be appended

  • extension (Optional[str]) – either no extension (for directories) or a file extension as a string

Returns:

new file or directory name

Return type:

str

label_processing.utils.save_json(data, filename, path)[source]

Saves a json file with human-readable format.

Parameters:
  • data (list[dict]) – output of the OCR

  • filename (str) – name for the json file

  • path (str) – path where the json should be saved

Return type:

None

label_processing.utils.check_nuri_format(transcript)[source]

Check NURI’s format in OCR transcription “text”.

Parameters:

transcript (str) – text field from OCR output

Returns:

True if NURI pattern found, False otherwise

Return type:

bool

label_processing.utils.replace_nuri(transcript)[source]

Correct NURI format in OCR transcription JSON output.

Parameters:

transcript (dict[str, str]) – JSON transcript with “ID” and “text” fields.

Returns:

JSON transcript with corrected NURI formats in “text” field.

Return type:

dict[str, str]

label_processing.utils.load_dataframe(filepath_csv)[source]

Loads the CSV file using Pandas.

Parameters:

filepath_csv (str) – path to the CSV file

Returns:

The CSV as a Pandas DataFrame

Return type:

pd.DataFrame

label_processing.utils.load_jpg(filepath)[source]

Loads the jpg files using the OpenCV module.

Parameters:

filepath (str) – path to jpg files

Returns:

OpenCV image object

Return type:

np.ndarray

label_processing.utils.load_json(file)[source]

Load JSON data from a file and deserialize it.

Parameters:

file (str) – The name of the file containing JSON data.

Returns:

The JSON data as a dictionary

Return type:

dict

label_processing.utils.read_vocabulary(file)[source]

Read a CSV file containing vocabulary and convert it to a dictionary.

Parameters:

file (str) – The name of the CSV file containing vocabulary data.

Returns:

A dictionary where keys and values are taken from the CSV data.

Return type:

dict

label_processing.utils.verify_model_integrity(model_path, checksums_file=None, require_checksum=True)[source]

SECURITY: Mandatory model file integrity verification using SHA256 checksums.

Parameters:
  • model_path (str) – path to model file

  • checksums_file (str) – path to checksums file (auto-detected if None)

  • require_checksum (bool) – if True, requires checksum file to exist (default: True)

Returns:

True if model integrity is verified, False otherwise

Return type:

bool

Raises:

SecurityError – If model integrity cannot be verified and require_checksum=True