We had to ETL .csv data that must have originated in SQLServer. The utf-16 fact ...

tialaramex · on Jan 1, 2022

I've had to fix this before. A co-worker working with data from a 3rd party supplier had gone "Oh this input data is mangled with stray zero bytes, I'll fix that" and of course that destroys any non-ASCII inputs, eventually I'm told that sometimes the import fails, I investigate, and I realise the "mangled" input is just UTF-16 encoded, conditionally remove the "strip zero bytes" hack and tell the decoder it's UTF-16 and it just works correctly.

The "maybe strip null bytes" code lived for years "just in case" after I fixed that because people couldn't believe that's all that was ever "wrong" with the data.