Full title is
"PDF/A-3, PDF for Long-term Preservation, Use of ISO 32000-1, With Embedded Files"
> the new publishing framework, known as "V3", for RFCs from the IETF (Internet Engineering Task Force). V3 uses an XML document as the master format from which plain text, HTML, and PDF versions are derived. The PDF is a PDF/A-3u document with the XML master embedded. The first RFC published in the new format was RFC 8650 [0], published in November 2019. For more background on this choice, see RFC 7995: PDF Format for RFCs (December 2016) [1] and additional Useful References below.
[NOTE, I changed the links below slightly to point to the actual new HTML format]
The big restriction is that the classic Postscript typefaces are not available (no Times, Helvetica, or Zapf Dingbats), and the PDF file must bundle any fonts it uses.
The pdfsizeopt package will make any PDF smaller, and I think it deletes letters/characters from the included font that are not used.
Reading the descriptions of A-3 and A-4, to me it sounds like PDF/A jumped the shark and for archival purposes the old A-2 might still be the best variant.
In general, embedding files in PDF is kinda neat capability, like the example of having (CSV) dataset embedded in report or something like that. But at the same time I get the feeling that its an indication of general shortcoming of our file handling that it makes sense to use PDF as a container format. ZIP files and such are pretty crude formats for higher-level file bundles and the UX falls short too.
Preserving PDFs for future generations is like preserving radioactive waste for them. It's inevitable they'll end up with lots, and they won't thank us for it, but we should at least try to contain the mess.
I love the idea of the martians having invaded earth, wiped out humanity, but when they opened their first pdf file, got all their files crypto-locked. A modern version of War of the Worlds.
Love it. Some kind of polymorphic AI-enabled malware that can adapt to any system that analyzes it.
Alternative scenario: the malware wiped out humanity millennia ago, but is laying dormant in PDFs, just waiting for the aliens to try to interpret these digital relics of a long-dead civilization.
I'm not really getting it, aren't RFCs written in a straightforward Wiki syntax? Then why would they be preserved using PDF, and how is XML the source format, or would be considered useful as the canonical or authoring format when the existence of thousands of RFCs in plain text/light Wiki syntax clearly says otherwise?
I usually save, along with the PDF file, a .txt version of the contents. Of course, that doesn't include the images, but at least the text is there, and as a bonus it makes the file greppable.
Did you mean "very portable exploitable implementations"?
Sorry, but claiming PDF is stable is absurd to say the least. Any mobile, smartphone, or gaming console usually was exploited because of PDF parsers before pdf.js got embedded in web browsers.
Windows' biggest attack surface is still outlook and PDF files.
So I'd argue that PDF has a too large attack surface, which must be reduced for better archiving purposes without side effects.
The PDF/A versions are subsets of PDF specs that are specifically aimed at archiving. They forbid features like encryption and font linking which would affect access years or decades from now.
> the new publishing framework, known as "V3", for RFCs from the IETF (Internet Engineering Task Force). V3 uses an XML document as the master format from which plain text, HTML, and PDF versions are derived. The PDF is a PDF/A-3u document with the XML master embedded. The first RFC published in the new format was RFC 8650 [0], published in November 2019. For more background on this choice, see RFC 7995: PDF Format for RFCs (December 2016) [1] and additional Useful References below.
[NOTE, I changed the links below slightly to point to the actual new HTML format]
[0] https://www.rfc-editor.org/rfc/rfc8650.html
[1] https://www.rfc-editor.org/rfc/rfc7995.html