scripts Package

The scripts package contains standalone utilities, evaluation scripts, and processing tools.

Package Contents

`health_check`	Health Check Script for Entomological Label Information Extraction Validates system requirements and provides diagnostic information.
`evaluation`
`postprocessing`
`processing`

Modules

Health Check

Health Check Script for Entomological Label Information Extraction Validates system requirements and provides diagnostic information.

scripts.health_check.check_python_version()[source]: Check Python version and provide recommendations.

scripts.health_check.check_docker()[source]: Check Docker installation and status.

scripts.health_check.check_project_structure()[source]: Check if we’re in the correct project directory.

scripts.health_check.check_system_resources()[source]: Check available system resources.

scripts.health_check.check_dependencies()[source]: Check for optional dependencies.

scripts.health_check.main()[source]: Run comprehensive health check.

Evaluation Scripts

The evaluation subpackage contains comprehensive evaluation and analysis tools:

scripts.evaluation.analysis_eval.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.evaluation.analysis_eval.evaluate_labels(empty_folder, not_empty_folder)[source]

Evaluate predicted labels against ground truth labels.

Parameters:

empty_folder (str) – Path to directory containing predicted empty labels images.
not_empty_folder (str) – Path to directory containing predicted not empty labels images.

Return type:

None

scripts.evaluation.analysis_eval.main()[source]: Main function to execute label evaluation.

scripts.evaluation.classifiers_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.evaluation.classifiers_eval.main()[source]: Main function to evaluate classifier accuracy and generate reports.

scripts.evaluation.cluster_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.evaluation.cluster_eval.is_word(token)[source]

Checks whether a token is a valid word (not punctuation or too short). :param token: The token to check. :type token: str

Returns:: True if the token is a valid word, False otherwise.
Return type:: bool
Parameters:: token (str)

scripts.evaluation.cluster_eval.tokenize_text(labels, ground_truth)[source]

Tokenizes and lowercases text fields from labels. :param labels: Labels to tokenize. :type labels: List[Dict[str, str]] or Dict[str, tuple[str, str]] :param ground_truth: Whether the labels are ground truth data. :type ground_truth: bool

Returns:

Tokenized labels with IDs.

Return type:

List[Dict[str, Union[str, List[str]]]]

Parameters:

labels (List[Dict[str, str]] | Dict[str, tuple[str, str]])
ground_truth (bool)

scripts.evaluation.cluster_eval.build_word_vectors(labels, ground_truth)[source]

Builds a Word2Vec model from the tokenized labels. :param labels: Labels to build vectors from. :type labels: List[Dict[str, str]] or Dict[str, tuple[str, str]] :param ground_truth: Whether the labels are ground truth data. :type ground_truth: bool

Returns:: A tuple containing the trained Word2Vec model and the tokenized labels.
Return type:: tuple

scripts.evaluation.cluster_eval.build_mean_label_vector(model, labels)[source]

Computes the mean vector for each label using the Word2Vec model. Also tracks labels that have no valid tokens (and thus no vector). :param model: The trained Word2Vec model. :type model: gensim.models.Word2Vec :param labels: Tokenized labels with IDs. :type labels: List[Dict[str, List[str]]]

Returns:: A tuple containing a dictionary of mean vectors and a list of skipped IDs.
Return type:: tuple

scripts.evaluation.cluster_eval.load_json(path)[source]

Loads the ground truth JSON file. :param path: Path to the JSON file. :type path: str

Returns:: List of entries with “ID” and “text” fields.
Return type:: List[Dict[str, str]]
Parameters:: path (str)

scripts.evaluation.cluster_eval.load_cluster_csv(path)[source]

Loads cluster assignments from a CSV file. :param path: Path to the CSV file. :type path: str

Returns:: Dictionary mapping label IDs to their cluster ID and transcript. Skips entries with missing “Transcript” or “Cluster_ID”.
Return type:: Dict[str, List[str]]
Parameters:: path (str)

scripts.evaluation.cluster_eval.plot_tsne(label_vectors, clusters, out_path, verbose, skipped_ids)[source]

Generates and saves a t-SNE scatter plot with cluster coloring and hover text. Also includes skipped labels (no vectors) as a separate “No vector” cluster. :param label_vectors: Dictionary of label IDs to their mean vectors. :type label_vectors: Dict[str, np.ndarray] :param clusters: Dictionary mapping label IDs to their cluster ID and transcript. :type clusters: Dict[str, List[str]] :param out_path: Path to save the t-SNE plot HTML file. :type out_path: str :param verbose: Whether to print verbose output. :type verbose: bool :param skipped_ids: List of label IDs that had no valid tokens and thus no vector. :type skipped_ids: List[str]

Returns:

The generated t-SNE plot.

Return type:

plotly.graph_objects.Figure

Parameters:

label_vectors (Dict[str, ndarray])
clusters (Dict[str, List[str]])
out_path (str)
verbose (bool)
skipped_ids (List[str])

scripts.evaluation.cluster_eval.main(args)[source]

Main entry point for clustering visualization. Loads data, trains embeddings, computes vectors, runs t-SNE, and saves plot. :param args: Parsed command-line arguments. :type args: argparse.Namespace

Returns:: None

scripts.evaluation.detection_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.evaluation.detection_eval.main()[source]: Main function to evaluate IOU scores and generate visualizations.

scripts.evaluation.ocr_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.evaluation.ocr_eval.main()[source]: Main function to evaluate OCR predictions and save results.

scripts.evaluation.redundancy.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.evaluation.redundancy.main()[source]: Main function to evaluate redundancy in a dataset and save results.

scripts.evaluation.rotation_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.evaluation.rotation_eval.load_images(input_image_dir)[source]

Load images from the given directory and extract ground truth labels.

Parameters:: input_image_dir (str) – Directory containing images.
Returns:: (Loaded images as numpy array, Ground truth labels as numpy array, List of filenames)
Return type:: tuple

scripts.evaluation.rotation_eval.rotate_image(img_path, angle)[source]

Rotate the image by the given angle and save it back to the same path.

Parameters:

img_path (str) – Path to the image file.
angle (int) – Rotation angle index (0, 1, 2, 3 corresponding to 0, 90, 180, 270 degrees).

Return type:

None

scripts.evaluation.rotation_eval.evaluate_rotation_model(input_image_dir, output_folder_path)[source]

Load model, predict rotations, and evaluate performance.

Parameters:

input_image_dir (str) – Directory containing images.
output_folder_path (str) – Path to save evaluation results.

Return type:

None

scripts.evaluation.rotation_eval.main()[source]: Main function to execute rotation model evaluation.

Post-processing Scripts

The postprocessing subpackage provides tools for result consolidation and processing:

Consolidate Pipeline Results Script

Creates a single JSON file that links all per-label results across the pipeline (detection → classification → rotation → OCR → post‑processing).

Supports both the traditional (TensorFlow-based) pipeline and the Gemini pipeline. Output is a flat list of per-label entries, each containing: source_image, label_filename, label_index, category, bounding-box coordinates, rotation_angle, and ocr (method, text, confidence).

scripts.postprocessing.consolidate_results.parse_arguments()[source]

Parse command-line arguments.

Return type:: Namespace

scripts.postprocessing.consolidate_results.consolidate_results(output_dir)[source]

Auto-detect pipeline type and consolidate all results.

Parameters:: output_dir (str)
Return type:: List[Dict[str, Any]]

scripts.postprocessing.consolidate_results.main()[source]: Main entry point.

scripts.postprocessing.process.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.postprocessing.process.process_ocr_output(ocr_output, outdir)[source]

Process OCR output to identify Nuri labels, empty labels, and correct plausible labels.

Parameters:

ocr_output (str) – Path to the OCR output JSON file.
outdir (str) – Directory to save processed files.

Return type:

None

scripts.postprocessing.process.main()[source]: Main function to parse arguments and execute OCR processing.

Processing Scripts

The processing subpackage contains core processing utilities:

scripts.processing.analysis.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.processing.analysis.validate_directories(input_dir, output_dir)[source]

Validate that the input directory exists and create the output directory if needed.

Parameters:

input_dir (str) – Path to the input directory.
output_dir (str) – Path to the output directory.

Return type:

None

scripts.processing.analysis.main()[source]: Main execution function. Parses command-line arguments, validates directories, processes images, and prints the execution duration.

scripts.processing.classifiers.parse_arguments()[source]

Parse command-line arguments for the classification script.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.processing.classifiers.resolve_default_model_path(model_int)[source]

Get the default model path based on model number using centralized configuration.

Parameters:: model_int (int) – Model number (1-3)
Returns:: Path to the default model
Return type:: Path

scripts.processing.classifiers.get_class_names_by_model(model_int)[source]

Return default class names for the selected model number using centralized configuration.

Parameters:: model_int (int) – Model number (1-3)
Returns:: Class labels
Return type:: list[str]

scripts.processing.classifiers.load_class_names_from_file(path)[source]

Load class names from a text file (one per line).

Parameters:: path (str) – Path to the class names file.
Returns:: List of class names.
Return type:: list[str]

scripts.processing.classifiers.main()[source]

Main function to execute classification using a TensorFlow model.

Return type:: None

class scripts.processing.detection.OptimizedPredictLabel(path_to_model, classes, threshold=0.8, use_cache=True)[source]

Bases: object

Optimized version of PredictLabel with caching and streamlined loading.

Parameters:

path_to_model (str)
classes (list)
threshold (float)
use_cache (bool)

load_model_optimized()[source]

Load model with optimized strategy.

Return type:: detecto.core.Model

class_prediction(jpg_path)[source]

Predict labels for a given JPG file.

Args: jpg_path (Path): Path to the JPG file

Returns:: Prediction results
Return type:: pd.DataFrame
Parameters:: jpg_path (Path)

scripts.processing.detection.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

scripts.processing.detection.clear_model_cache()[source]: Clear all cached models.

scripts.processing.detection.setup_device(device_arg)[source]

Setup optimal device for inference.

Parameters:: device_arg (str) – Device argument from command line
Returns:: Best available device
Return type:: str

scripts.processing.detection.main()[source]: Main execution function with performance optimizations.