label_postprocessing Package๏
The label_postprocessing package provides utilities for cleaning and structuring OCR results.
Package Contents๏
Modules๏
OCR Post-processing๏
- label_postprocessing.ocr_postprocessing.count_mean_token_length(tokens)[source]
Calculates the mean length of tokens in a list.
- label_postprocessing.ocr_postprocessing.is_plausible_prediction(transcript)[source]
Checks if a transcript is a plausible prediction based on the average token length.
- label_postprocessing.ocr_postprocessing.correct_transcript(transcript)[source]
Performs corrections on a transcript, removing non-ASCII characters, multiple non-alphanumeric characters, the pipe character, and other special symbols (like ยฐ, โ, , etc.). Also removes any trailing periods.
- label_postprocessing.ocr_postprocessing.is_nuri(transcript)[source]
Checks if a transcript starts with โhttp,โ indicating a Nuri.
- label_postprocessing.ocr_postprocessing.is_empty(transcript)[source]
Checks if a transcript is empty.
- label_postprocessing.ocr_postprocessing.save_transcripts(transcripts, file_name)[source]
Saves transcripts as a CSV file.
- label_postprocessing.ocr_postprocessing.save_json(transcripts, file_name)[source]
Saves transcripts as a JSON file.