Archival scanner TIFFs: bleed-through, skew, and tables
Route `scan-tiff-ocr` (tiff_to_text.scan) focuses on scanner TIFFs with skew, gutter shadows, or bleed-through. Decide whether deskewing or binarization is required before OCR. For duplex scans, keep front/back page indices explicit. When printed tables mix with handwriting, zone them separately—expect handwritten rows to need manual entry. Archival workflows should store both the pristine TIFF and the transcript for auditability.
TIFF OCR tips for scanned archives
- Inside `scan-tiff-ocr`, fix orientation and skew, then run recognition on the reading order you expect archivists to follow.
- OCR table headers, body rows, and footnotes in separate crops to reduce column merge errors.
- File transcripts alongside thumbnails, DPI metadata, and operator IDs in the records system.
TIFF-to-text FAQ (scan)
How can teams reduce column drift when OCRing tables in `scan-tiff-ocr`?
Crop header, body, and totals separately; reduce shadows; manually verify currency columns.
Table columns collapse—what should change first?
Crop header, body, and totals separately; deskew and reduce shadows before OCR.
Duplex scans flip page order—how to prevent it?
Record physical front/back in the control sheet; embed volume-page codes in export names.
What evidence do auditors expect?
Pristine TIFF, transcript, operator, and timestamp—encrypt when policy demands.
Handwritten margin notes mix with print—one pass?
Zone handwriting separately; expect manual capture for critical ink annotations.