Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why do you think I wanted to add hashes and encryption at the block level? :)

I’ve had to do similar things in the past and it’s a great side-feature of the format. It’s a horrible feeling when you find a corrupted FASTQ file that was compressed with normal gzip. At least with bgzip corrupted files, you can find and start recovery from the next block.



Even if it doesn't use block-based compression, if there isn't a huge range of corrupted bytes, corruption offsets are usually identifiable, as you will quickly end up with invalid length-distance pairs and similar errors. Although, errors might be reported a few bytes after the actual corruption.

I was motivated some years ago to try recovering from these errors [1] when I was handling a DEFLATE compressed JSON file, where there seemed to be a single corrupted byte every dozen or so bytes in the stream. It looked like something you could recover from. If you output decompressed bytes as the stream was parsed, you could clearly see a prefix of the original JSON being recovered up to the first corruption.

In that case the decompressed payload was plaintext, but even with a binary format, something like kaitai-struct might give you an invalid offset to work from.

For these localized corruptions, it's possible to just bruteforce one or two bytes along this range, and reliably fix the DEFLATE stream. Not really doable once we are talking about a sequence of four or more corrupted bytes.

[1]: https://github.com/nevesnunes/deflate-frolicking




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: