label_postprocessing Package๏ƒ

The label_postprocessing package provides utilities for cleaning and structuring OCR results.

Package Contents๏ƒ

ocr_postprocessing

Modules๏ƒ

OCR Post-processing๏ƒ

label_postprocessing.ocr_postprocessing.count_mean_token_length(tokens)[source]

Calculates the mean length of tokens in a list.

Parameters:

tokens (list) โ€“ List of tokens.

Returns:

Mean token length.

Return type:

float

label_postprocessing.ocr_postprocessing.is_plausible_prediction(transcript)[source]

Checks if a transcript is a plausible prediction based on the average token length.

Parameters:

transcript (str) โ€“ Input transcript.

Returns:

True if the transcript is plausible, False otherwise.

Return type:

bool

label_postprocessing.ocr_postprocessing.correct_transcript(transcript)[source]

Performs corrections on a transcript, removing non-ASCII characters, multiple non-alphanumeric characters, the pipe character, and other special symbols (like ยฐ, โ€˜, , etc.). Also removes any trailing periods.

Parameters:

transcript (str) โ€“ Input transcript.

Returns:

Corrected transcript.

Return type:

str

label_postprocessing.ocr_postprocessing.is_nuri(transcript)[source]

Checks if a transcript starts with โ€œhttp,โ€ indicating a Nuri.

Parameters:

transcript (str) โ€“ Input transcript.

Returns:

True if the transcript is a Nuri, False otherwise.

Return type:

bool

label_postprocessing.ocr_postprocessing.is_empty(transcript)[source]

Checks if a transcript is empty.

Parameters:

transcript (str) โ€“ Input transcript.

Returns:

True if the transcript is empty, False otherwise.

Return type:

bool

label_postprocessing.ocr_postprocessing.save_transcripts(transcripts, file_name)[source]

Saves transcripts as a CSV file.

Parameters:
  • transcripts (dict) โ€“ Dictionary of transcripts.

  • file_name (str) โ€“ Name of the output CSV file.

Return type:

None

label_postprocessing.ocr_postprocessing.save_json(transcripts, file_name)[source]

Saves transcripts as a JSON file.

Parameters:
  • transcripts (list) โ€“ List of transcripts.

  • file_name (str) โ€“ Name of the output JSON file.

Return type:

None

label_postprocessing.ocr_postprocessing.process_ocr_output(ocr_output)[source]

Processes OCR output, categorizing and saving transcripts based on Nuri, empty, plausible, and corrected.

Parameters:

ocr_output (str) โ€“ OCR output file path.

Return type:

None