label_postprocessing.ocr_postprocessing
Functions
|
Performs corrections on a transcript, removing non-ASCII characters, multiple non-alphanumeric characters, the pipe character, and other special symbols (like °, ', , etc.). |
|
Calculates the mean length of tokens in a list. |
|
Checks if a transcript is empty. |
|
Checks if a transcript starts with "http," indicating a Nuri. |
|
Checks if a transcript is a plausible prediction based on the average token length. |
|
Processes OCR output, categorizing and saving transcripts based on Nuri, empty, plausible, and corrected. |
|
Saves transcripts as a JSON file. |
|
Saves transcripts as a CSV file. |