label_postprocessing Package

The label_postprocessing package provides utilities for cleaning and structuring OCR results.

Package Contents

ocr_postprocessing

Modules

OCR Post-processing

label_postprocessing.ocr_postprocessing.count_mean_token_length(tokens)[source]

Calculates the mean length of tokens in a list.

Parameters:: tokens (list) – List of tokens.
Returns:: Mean token length.
Return type:: float

label_postprocessing.ocr_postprocessing.is_plausible_prediction(transcript)[source]

Checks if a transcript is a plausible prediction based on the average token length.

Parameters:: transcript (str) – Input transcript.
Returns:: True if the transcript is plausible, False otherwise.
Return type:: bool

label_postprocessing.ocr_postprocessing.correct_transcript(transcript)[source]

Performs corrections on a transcript, removing non-ASCII characters, multiple non-alphanumeric characters, the pipe character, and other special symbols (like °, ‘, , etc.). Also removes any trailing periods.

Parameters:: transcript (str) – Input transcript.
Returns:: Corrected transcript.
Return type:: str

label_postprocessing.ocr_postprocessing.is_nuri(transcript)[source]

Checks if a transcript starts with “http,” indicating a Nuri.

Parameters:: transcript (str) – Input transcript.
Returns:: True if the transcript is a Nuri, False otherwise.
Return type:: bool

label_postprocessing.ocr_postprocessing.is_empty(transcript)[source]

Checks if a transcript is empty.

Parameters:: transcript (str) – Input transcript.
Returns:: True if the transcript is empty, False otherwise.
Return type:: bool

label_postprocessing.ocr_postprocessing.save_transcripts(transcripts, file_name)[source]

Saves transcripts as a CSV file.

Parameters:

transcripts (dict) – Dictionary of transcripts.
file_name (str) – Name of the output CSV file.

Return type:

None

label_postprocessing.ocr_postprocessing.save_json(transcripts, file_name)[source]

Saves transcripts as a JSON file.

Parameters:

transcripts (list) – List of transcripts.
file_name (str) – Name of the output JSON file.

Return type:

None

label_postprocessing.ocr_postprocessing.process_ocr_output(ocr_output)[source]

Processes OCR output, categorizing and saving transcripts based on Nuri, empty, plausible, and corrected.

Parameters:: ocr_output (str) – OCR output file path.
Return type:: None