Pdfcpu: A Go PDF Processor

banku_brougham · on Dec 24, 2022

This is an amazingly underserved area, whenever I have to integrate pdf into my process I cringe.

tdhz77 · on Dec 24, 2022

PDF is a bad format unless you are printing. I worked for the US Federal Government where we had millions of stored PDFs. At one point in the Federal Judiciary we had one of the largest databases in the world. Why? PDFs. We pushed hard for a true digital format like html with a printable format, but the powers that be want a 1:1 replica for Search Warrants and Judges Orders. We can do better for sure, but as a result I was knee deep in pdf’s. It’s tiresome painful little spec. Maybe this go library can solve so many inconsistencies in the pdf world..

franga2000 · on Dec 24, 2022

> We pushed hard for a true digital format like html with a printable format, but the powers that be want a 1:1 replica for Search Warrants and Judges Orders.

Is there a reason you can't have both? Presumably you have structured data at some point, before it's laid out on the page and saved as PDF. Why not just save that alongside the PDFs? You could also serialize it and include it in a PDF metadata field, so it can be extracted from the files even if the database is lost.

auggierose · on Dec 24, 2022

I don't see how PDF is not a true digital format. I guess the powers that be understand something that you don't.

viraptor · on Dec 24, 2022

PDFs are not useful for any processing. They represent things you want to print, but not search, understand, analyse, etc.

Even those with text actually attached / extractable have no structure. "Selecting blocks of text" involves guessing which order the lines go in, depending on their location / distance from other lines.

Compare to having for example "<recipient-address>...</...>" from which you can still generate the printed version.

Someone · on Dec 24, 2022

> "Selecting blocks of text" involves guessing which order the lines go in, depending on their location / distance from other lines.

If you create your own PDFs, you can make sure they contain both information about reading order and the mapping from glyphs back to UTF-8 text by creating an accessible PDF (aka a “tagged PDF”)

I think most modern word processors can create such PDFs. MS Word definitely can (https://support.microsoft.com/en-us/office/create-accessible...).

> Compare to having for example "<recipient-address>...</...>" from which you can still generate the printed version.

Generating _a_ printed version is easy; generating _the_ printed version, guaranteeing 100% reproducibility isn’t. To get the exact same layout, you’ll have to guarantee to use the same fonts (difficult, as OSes can update their fonts, possibly tweaking a glyph, a kerning table or anything else that can affect layout) and, basically, never fix bugs in your PDF generation flow.

That’s why many people keep both the structured source data (e.g. in json or xml) and the generated PDF.

viraptor · on Dec 24, 2022

> creating an accessible PDF (aka a “tagged PDF”)

I don't think I've seen a tagged PDF in the wild... ever. I'm sure they exist, but I'm doing a lot of stuff with PDFs in the healthcare context and this tech may as well not exist for me. To the point that most apps will support embedding a bad PDF in an HL7 file just to add metadata.

> That’s why many people keep both the structured source data (e.g. in json or xml) and the generated PDF.

They totally should. No dispute.

xwolfi · on Dec 24, 2022

Well usually the pdf processing I did always assumed I had a paper of x by y cm, and a mask I d move around "make a 10 by 20 cm rectangle at position (100,200), what s in that rectangle" basically.

There's a structure, just not tag-based but position-based ? Ofc, if humans shit around and change it, you're fucked with versionning your masks, but usually they print from form templates themselves.

As I used to say to my colleagues bemoaning this inconvenient analogue bridge: "if you can read it coherently as a human, we can parse it". We have to accept that administrations communicate via geometry and not semantic, and adapt while we also try to convince them to give structured tagging a chance. But they need a critical mass of their documentation pipeline to be machine-read before they even accept to discuss it.

411111111111111 · on Dec 24, 2022

So your point is that pdf is a shit format that needs a custom parser for each document type before it can be used for anything but print...?

I'm really confused here, it seems we all agree that pdf is a bad format?

xwolfi · on Dec 26, 2022

It's as shit a format as non-UTF strings, yet it's everywhere and we must adapt, is my point.

We can adapt to scanned documents before all documents are semantically tagged, just like we have to adapt to non standard ascii extensions in non English countries, is my point.

kaba0 · on Dec 24, 2022

I don’t know, it seems to be a format that does its intended job just fine. What’s next, will we hate txt as that’s not a good for spreadsheets?

viraptor · on Dec 24, 2022

And by "if humans shit around and change it" you mean things that regularly happen and need to be accounted for like moving the physical location to a place with the address one line longer, or getting a new partner which changes the letterhead, or adding extra information required by legal, or ...

hoosieree · on Dec 24, 2022

> PDFs are not useful for any processing.

Oh my sweet summer child.

https://rawgit.com/osnr/horrifying-pdf-experiments/master/br...

viraptor · on Dec 24, 2022

I mean processing the content in them, not self-modifying. Sure, you can embed scripts to make them interactive.

tashbarg · on Dec 24, 2022

I guess parent is focusing on the point, that PDFs can render as perfectly human-readable documents, but can be completely non-machine readable at the same time.

jimjimjim · on Dec 24, 2022

PDF is a true digital format. In the same way as a zip file is. A pdf page can be made a many many different ways. It depends on what use you are targeting. You want 1:1 digital replica of a page? scan the page as a tiff and add it to a page as an image. Or you could just add the text to the page and the font. Or if you want to mess with people or you are a cad application you draw text as thousands of little lines.

lxgr · on Dec 24, 2022

In that sense, a blurry, warped TIFF is a true digital format as well.

horst_vie · on Dec 24, 2022

pdfcpu is the reason I stopped using pdftk.

ninjin · on Dec 24, 2022

Thank you for saying that, pdftk has been a wonderful tool for me over the years, but if pdfcpu can replace it and thus rid me of my final Java dependency it would be wonderful.

ornornor · on Dec 24, 2022

Strong endorsement. I’m fine with pdftk except for rotating pages: it seems to be using annotations vs actually rotating the image in the pdf. I’m using some odd software that chooses to ignore these annotations and so even though I fixed the page orientation with pdftk in the source pdf, that software will still display it with the wrong orientation (and fail at ocr for that page)

I’m hoping pdfcpu does the right thing instead and actually rotates the image in the file.

whenc · on Dec 24, 2022

  cpdf -upright in.pdf -o out.pdf

will set the page rotation to zero, counter-rotating the page dimensions and content to compensate, leaving it visually unaltered.

(Disclaimer: I wrote it)

mongol · on Dec 24, 2022

Oh that is what I am using. May need to look into that

deeter72 · on Dec 24, 2022

That is impressive, Should take a dive into pdfcpu

zeristor · on Dec 24, 2022

Working for a beverage industry consultancy once I noticed that people had been entering strange Unicode control characters into the CMS.

I’m guessing they were copy and pasting from PDFs, and unchecked this was breaking the front end system.

Jemm · on Dec 24, 2022

Let's not call software a CPU please.

cpach · on Dec 24, 2022

AFAICT it’s a pun. It’s a “processor” that spits out PDF files. Hence “pdfcpu”. Doesn’t seem harmful to me…

ornornor · on Dec 24, 2022

Pendants are going to be pedantic :(

djfobbz · on Dec 24, 2022

This is off topic but the term PDF just makes me cringe. I just got done uninstalling the entire Adobe Creative Cloud suite this past weekend and couldn’t have felt more relieved…Adobe Acrobat accounted for 2.4GB of space and the entire Chromium based CC took up close to 45GB. SMH! You can do better Adobe! I vividly remember installing Photoshop 5.0 back in 1998 with an approx 90MB installer and now it clocks in at 1.26GB.

orf · on Dec 24, 2022

With 1tb SSDs being very common, using 4.5% of your disk space for something “important” like CC is fine. Acrobat would account for 0.24%.

Things have changed since 1998, and as hardware has grown so have the requirements for the software that utilizes it.

I wonder if the relative percentages have changed much since 1998.

giuliomagnifico · on Dec 24, 2022

That’s a great (and huge) work! Thanks.