Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PDF/A-3, PDF for Long-Term Preservation, Use of ISO 32000-1... (2020) (loc.gov)
57 points by gabrielsroka on March 21, 2023 | hide | past | favorite | 19 comments


Full title is "PDF/A-3, PDF for Long-term Preservation, Use of ISO 32000-1, With Embedded Files"

> the new publishing framework, known as "V3", for RFCs from the IETF (Internet Engineering Task Force). V3 uses an XML document as the master format from which plain text, HTML, and PDF versions are derived. The PDF is a PDF/A-3u document with the XML master embedded. The first RFC published in the new format was RFC 8650 [0], published in November 2019. For more background on this choice, see RFC 7995: PDF Format for RFCs (December 2016) [1] and additional Useful References below.

[NOTE, I changed the links below slightly to point to the actual new HTML format]

[0] https://www.rfc-editor.org/rfc/rfc8650.html

[1] https://www.rfc-editor.org/rfc/rfc7995.html


background info if useful:

PDF/A is a specification that limits what features of PDF are allowed. The purpose is to not allow features that may be problematic for archiving.

Initially PDF/A was really strict and prevented things like transparencies since they affected reproducibility when printing and embedded files etc.

Then people requested less restricted versions to allow more archiving use cases.

But even the newer less restrictive versions have a more well defined and verifiable specification than the main pdf specification.


The big restriction is that the classic Postscript typefaces are not available (no Times, Helvetica, or Zapf Dingbats), and the PDF file must bundle any fonts it uses.

The pdfsizeopt package will make any PDF smaller, and I think it deletes letters/characters from the included font that are not used.

https://github.com/pts/pdfsizeopt


Reading the descriptions of A-3 and A-4, to me it sounds like PDF/A jumped the shark and for archival purposes the old A-2 might still be the best variant.

In general, embedding files in PDF is kinda neat capability, like the example of having (CSV) dataset embedded in report or something like that. But at the same time I get the feeling that its an indication of general shortcoming of our file handling that it makes sense to use PDF as a container format. ZIP files and such are pretty crude formats for higher-level file bundles and the UX falls short too.


I've often thought that the .a and .lib libraries for object code should be replaced with .zip files.

(What's even dumber is the .a and .lib come in multiple pointlessly different formats.)


Preserving PDFs for future generations is like preserving radioactive waste for them. It's inevitable they'll end up with lots, and they won't thank us for it, but we should at least try to contain the mess.


I love the idea of the martians having invaded earth, wiped out humanity, but when they opened their first pdf file, got all their files crypto-locked. A modern version of War of the Worlds.


Love it. Some kind of polymorphic AI-enabled malware that can adapt to any system that analyzes it.

Alternative scenario: the malware wiped out humanity millennia ago, but is laying dormant in PDFs, just waiting for the aliens to try to interpret these digital relics of a long-dead civilization.


I'm not really getting it, aren't RFCs written in a straightforward Wiki syntax? Then why would they be preserved using PDF, and how is XML the source format, or would be considered useful as the canonical or authoring format when the existence of thousands of RFCs in plain text/light Wiki syntax clearly says otherwise?


I think the FAQ I linked to below addresses some of these https://www.rfc-editor.org/rse/format-faq/


if only there were an open source and easy to use pdf library with pdf/a support :/


It looks like the IETF has some tools for this. A quick search revealed https://github.com/ietf-tools/ietf-at


The mentioned RFC PDFs are generated with Weasyprint which gained PDF/A support apparently last year. https://www.courtbouillon.org/blog/00028-weasyprint-56


Embedding the input formatting directives is neat!


I usually save, along with the PDF file, a .txt version of the contents. Of course, that doesn't include the images, but at least the text is there, and as a bonus it makes the file greppable.



[flagged]


> very portable C implementations

Did you mean "very portable exploitable implementations"?

Sorry, but claiming PDF is stable is absurd to say the least. Any mobile, smartphone, or gaming console usually was exploited because of PDF parsers before pdf.js got embedded in web browsers.

Windows' biggest attack surface is still outlook and PDF files.

So I'd argue that PDF has a too large attack surface, which must be reduced for better archiving purposes without side effects.


Oh wah.

Anyway, they do actually have a "what for" at the link: https://www.loc.gov/preservation/digital/formats/fdd/fdd0003...


The PDF/A versions are subsets of PDF specs that are specifically aimed at archiving. They forbid features like encryption and font linking which would affect access years or decades from now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: