scripts Package

The scripts package contains standalone utilities, evaluation scripts, and processing tools.

Package Contents

health_check

Health Check Script for Entomological Label Information Extraction Validates system requirements and provides diagnostic information.

evaluation

postprocessing

processing

Modules

Health Check

Health Check Script for Entomological Label Information Extraction Validates system requirements and provides diagnostic information.

scripts.health_check.check_python_version()[source]

Check Python version and provide recommendations.

scripts.health_check.check_docker()[source]

Check Docker installation and status.

scripts.health_check.check_project_structure()[source]

Check if we’re in the correct project directory.

scripts.health_check.check_system_resources()[source]

Check available system resources.

scripts.health_check.check_dependencies()[source]

Check for optional dependencies.

scripts.health_check.main()[source]

Run comprehensive health check.

Evaluation Scripts

The evaluation subpackage contains comprehensive evaluation and analysis tools:

scripts.evaluation.analysis_eval.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.evaluation.analysis_eval.evaluate_labels(empty_folder, not_empty_folder)[source]

Evaluate predicted labels against ground truth labels.

Parameters:
  • empty_folder (str) – Path to directory containing predicted empty labels images.

  • not_empty_folder (str) – Path to directory containing predicted not empty labels images.

Return type:

None

scripts.evaluation.analysis_eval.main()[source]

Main function to execute label evaluation.

scripts.evaluation.classifiers_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.evaluation.classifiers_eval.main()[source]

Main function to evaluate classifier accuracy and generate reports.

scripts.evaluation.cluster_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.evaluation.cluster_eval.is_word(token)[source]

Checks whether a token is a valid word (not punctuation or too short). :param token: The token to check. :type token: str

Returns:

True if the token is a valid word, False otherwise.

Return type:

bool

Parameters:

token (str)

scripts.evaluation.cluster_eval.tokenize_text(labels, ground_truth)[source]

Tokenizes and lowercases text fields from labels. :param labels: Labels to tokenize. :type labels: List[Dict[str, str]] or Dict[str, tuple[str, str]] :param ground_truth: Whether the labels are ground truth data. :type ground_truth: bool

Returns:

Tokenized labels with IDs.

Return type:

List[Dict[str, Union[str, List[str]]]]

Parameters:
scripts.evaluation.cluster_eval.build_word_vectors(labels, ground_truth)[source]

Builds a Word2Vec model from the tokenized labels. :param labels: Labels to build vectors from. :type labels: List[Dict[str, str]] or Dict[str, tuple[str, str]] :param ground_truth: Whether the labels are ground truth data. :type ground_truth: bool

Returns:

A tuple containing the trained Word2Vec model and the tokenized labels.

Return type:

tuple

scripts.evaluation.cluster_eval.build_mean_label_vector(model, labels)[source]

Computes the mean vector for each label using the Word2Vec model. Also tracks labels that have no valid tokens (and thus no vector). :param model: The trained Word2Vec model. :type model: gensim.models.Word2Vec :param labels: Tokenized labels with IDs. :type labels: List[Dict[str, List[str]]]

Returns:

A tuple containing a dictionary of mean vectors and a list of skipped IDs.

Return type:

tuple

scripts.evaluation.cluster_eval.load_json(path)[source]

Loads the ground truth JSON file. :param path: Path to the JSON file. :type path: str

Returns:

List of entries with “ID” and “text” fields.

Return type:

List[Dict[str, str]]

Parameters:

path (str)

scripts.evaluation.cluster_eval.load_cluster_csv(path)[source]

Loads cluster assignments from a CSV file. :param path: Path to the CSV file. :type path: str

Returns:

Dictionary mapping label IDs to their cluster ID and transcript. Skips entries with missing “Transcript” or “Cluster_ID”.

Return type:

Dict[str, List[str]]

Parameters:

path (str)

scripts.evaluation.cluster_eval.plot_tsne(label_vectors, clusters, out_path, verbose, skipped_ids)[source]

Generates and saves a t-SNE scatter plot with cluster coloring and hover text. Also includes skipped labels (no vectors) as a separate “No vector” cluster. :param label_vectors: Dictionary of label IDs to their mean vectors. :type label_vectors: Dict[str, np.ndarray] :param clusters: Dictionary mapping label IDs to their cluster ID and transcript. :type clusters: Dict[str, List[str]] :param out_path: Path to save the t-SNE plot HTML file. :type out_path: str :param verbose: Whether to print verbose output. :type verbose: bool :param skipped_ids: List of label IDs that had no valid tokens and thus no vector. :type skipped_ids: List[str]

Returns:

The generated t-SNE plot.

Return type:

plotly.graph_objects.Figure

Parameters:
scripts.evaluation.cluster_eval.main(args)[source]

Main entry point for clustering visualization. Loads data, trains embeddings, computes vectors, runs t-SNE, and saves plot. :param args: Parsed command-line arguments. :type args: argparse.Namespace

Returns:

None

scripts.evaluation.detection_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.evaluation.detection_eval.main()[source]

Main function to evaluate IOU scores and generate visualizations.

scripts.evaluation.ocr_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.evaluation.ocr_eval.main()[source]

Main function to evaluate OCR predictions and save results.

scripts.evaluation.redundancy.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.evaluation.redundancy.main()[source]

Main function to evaluate redundancy in a dataset and save results.

scripts.evaluation.rotation_eval.parse_arguments()[source]

Parse command-line arguments and return the parsed arguments.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.evaluation.rotation_eval.load_images(input_image_dir)[source]

Load images from the given directory and extract ground truth labels.

Parameters:

input_image_dir (str) – Directory containing images.

Returns:

(Loaded images as numpy array, Ground truth labels as numpy array, List of filenames)

Return type:

tuple

scripts.evaluation.rotation_eval.rotate_image(img_path, angle)[source]

Rotate the image by the given angle and save it back to the same path.

Parameters:
  • img_path (str) – Path to the image file.

  • angle (int) – Rotation angle index (0, 1, 2, 3 corresponding to 0, 90, 180, 270 degrees).

Return type:

None

scripts.evaluation.rotation_eval.evaluate_rotation_model(input_image_dir, output_folder_path)[source]

Load model, predict rotations, and evaluate performance.

Parameters:
  • input_image_dir (str) – Directory containing images.

  • output_folder_path (str) – Path to save evaluation results.

Return type:

None

scripts.evaluation.rotation_eval.main()[source]

Main function to execute rotation model evaluation.

Post-processing Scripts

The postprocessing subpackage provides tools for result consolidation and processing:

Consolidate Pipeline Results Script

Creates a single JSON file that links all per-label results across the pipeline (detection → classification → rotation → OCR → post‑processing).

Supports both the traditional (TensorFlow-based) pipeline and the Gemini pipeline. Output is a flat list of per-label entries, each containing: source_image, label_filename, label_index, category, bounding-box coordinates, rotation_angle, and ocr (method, text, confidence).

scripts.postprocessing.consolidate_results.parse_arguments()[source]

Parse command-line arguments.

Return type:

Namespace

scripts.postprocessing.consolidate_results.consolidate_results(output_dir)[source]

Auto-detect pipeline type and consolidate all results.

Parameters:

output_dir (str)

Return type:

List[Dict[str, Any]]

scripts.postprocessing.consolidate_results.main()[source]

Main entry point.

scripts.postprocessing.process.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.postprocessing.process.process_ocr_output(ocr_output, outdir)[source]

Process OCR output to identify Nuri labels, empty labels, and correct plausible labels.

Parameters:
  • ocr_output (str) – Path to the OCR output JSON file.

  • outdir (str) – Directory to save processed files.

Return type:

None

scripts.postprocessing.process.main()[source]

Main function to parse arguments and execute OCR processing.

Processing Scripts

The processing subpackage contains core processing utilities:

scripts.processing.analysis.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.processing.analysis.validate_directories(input_dir, output_dir)[source]

Validate that the input directory exists and create the output directory if needed.

Parameters:
  • input_dir (str) – Path to the input directory.

  • output_dir (str) – Path to the output directory.

Return type:

None

scripts.processing.analysis.main()[source]

Main execution function. Parses command-line arguments, validates directories, processes images, and prints the execution duration.

scripts.processing.classifiers.parse_arguments()[source]

Parse command-line arguments for the classification script.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.processing.classifiers.resolve_default_model_path(model_int)[source]

Get the default model path based on model number using centralized configuration.

Parameters:

model_int (int) – Model number (1-3)

Returns:

Path to the default model

Return type:

Path

scripts.processing.classifiers.get_class_names_by_model(model_int)[source]

Return default class names for the selected model number using centralized configuration.

Parameters:

model_int (int) – Model number (1-3)

Returns:

Class labels

Return type:

list[str]

scripts.processing.classifiers.load_class_names_from_file(path)[source]

Load class names from a text file (one per line).

Parameters:

path (str) – Path to the class names file.

Returns:

List of class names.

Return type:

list[str]

scripts.processing.classifiers.main()[source]

Main function to execute classification using a TensorFlow model.

Return type:

None

class scripts.processing.detection.OptimizedPredictLabel(path_to_model, classes, threshold=0.8, use_cache=True)[source]

Bases: object

Optimized version of PredictLabel with caching and streamlined loading.

Parameters:
load_model_optimized()[source]

Load model with optimized strategy.

Return type:

detecto.core.Model

class_prediction(jpg_path)[source]

Predict labels for a given JPG file.

Args: jpg_path (Path): Path to the JPG file

Returns:

Prediction results

Return type:

pd.DataFrame

Parameters:

jpg_path (Path)

scripts.processing.detection.parse_arguments()[source]

Parse command-line arguments using argparse.

Returns:

Parsed command-line arguments.

Return type:

argparse.Namespace

scripts.processing.detection.clear_model_cache()[source]

Clear all cached models.

scripts.processing.detection.setup_device(device_arg)[source]

Setup optimal device for inference.

Parameters:

device_arg (str) – Device argument from command line

Returns:

Best available device

Return type:

str

scripts.processing.detection.main()[source]

Main execution function with performance optimizations.