Is there a .pdf tool which allows compression to a defined file size? Tools like...

hnick · on March 3, 2021

It's a bit more complicated than it sounds, text streams are generally just compressed as good as they can be using whatever available scheme, the bulk of the space usage is often fonts and images. A PDF itself is not compressed so much as each part of it is compressed individually.

There's not much to do with fonts except don't embed them unless you need to, and don't have duplicate/overlapping subsets, if you do have these it is very tricky to untangle. I'm not aware of any good tool to do it automatically.

For images, it depends on the format. If your PDF has JPEG (DCTDecode) images then it will have to resample to the JPEG spec, if it's TIFF likewise, you can change the number of colour bits, or you can downsample the DPI which is a simple gs command line option, or you can change the compression scheme within the JPEG itself then replace it. There are so many avenues to approach this that I'm not sure it's something easily achieved while still obtaining a good result across all possible PDFs.

Within a problem domain though, like PDFs that are just pages of scanned images, you could probably iterate and downsample until you hit your target size.

Abishek_Muthian · on March 3, 2021

I appreciate your detailed comment.

>Within a problem domain though, like PDFs that are just pages of scanned images, you could probably iterate and downsample until you hit your target size.

I presumed any solution for this problem would involve multiple passes through the compression routine to hit the target size. Having to deal with text, font, images separately as you said does make that complex.

Right now, I'm just waiting to see if there's really a need gap for this. It's usually the Govt. websites which has very low limit for document upload size like 100KB, I personally take the image out of .pdf and compress it to minimum jpeg quality; but documents with multiple pages make it tricky.

hnick · on March 3, 2021

100KB sounds very limiting, you'll easily go above that with just fonts in some poorly constructed PDFs.

The first rule of PDFs is to always reproduce at the source if you can, they can be modified/edited but are better considered as an append-only format because of how they are made. There are so many choices that can be made in their construction that are hard to undo later, such as per-character placement instead of per-word or per-line with spaces, each consuming more stream data because of extra overhead in offsets etc (which compresses well as a text stream but still adds up).

Taking out or resampling images like you said is probably the best starting point unless you've found there is a lot of overhead (metadata/unused objects) to trim.