When pipelines fight invisible encoding drift
Global products routinely lift Chinese rulings, engineering glyphs, and emoji footnotes—then watch Linux CI succeed while Windows notepad shows mojibake because nobody declared charsets. UTF-8 (with or without BOM per picky APIs) is the modern lowest common denominator, yet legacy CSV consumers still surprise you. PDFs may hide private-use glyphs demanding normalization (NFKC) before joins—unless security forbids touching hashes. Ai2Done locks UTF-8 output while exposing progress; pilot pages saturated with rare scripts and verify hex views lack U+FFFD explosions. Encode contracts into integration docs, set HTTP Content-Type honestly, and forbid tertiary tools from transcoding zipped archives behind your back.
UTF-8 exports in three steps
- Survey downstream BOM/newline requirements.
- Export UTF-8 pilot pages heavy on symbols.
- Validate via Unicode tooling; publish charset contracts for partners.