label_postprocessing.ocr_postprocessing

Functions

correct_transcript(transcript)

Performs corrections on a transcript, removing non-ASCII characters, multiple non-alphanumeric characters, the pipe character, and other special symbols (like °, ', , etc.).

count_mean_token_length(tokens)

Calculates the mean length of tokens in a list.

is_empty(transcript)

Checks if a transcript is empty.

is_nuri(transcript)

Checks if a transcript starts with "http," indicating a Nuri.

is_plausible_prediction(transcript)

Checks if a transcript is a plausible prediction based on the average token length.

process_ocr_output(ocr_output)

Processes OCR output, categorizing and saving transcripts based on Nuri, empty, plausible, and corrected.

save_json(transcripts, file_name)

Saves transcripts as a JSON file.

save_transcripts(transcripts, file_name)

Saves transcripts as a CSV file.