Using ColdFusion and Xpdf to extract PDF metadata

#coldfusion #commandline

Xpdf is an open source projects that includes a PDF viewer, but it also includes a collection of command line tools for Linux, Windows and Mac that can perform some helpful functions:

xpdf: PDF viewer (click for a screenshot)
pdftotext: converts PDF to text
pdftops: converts PDF to PostScript
pdftoppm: converts PDF pages to netpbm (PPM/PGM/PBM) image files
pdftopng: converts PDF pages to PNG image files
pdftohtml: converts PDF to HTML
pdfinfo: extracts PDF metadata
pdfimages: extracts raw images from PDF files
pdffonts: lists fonts used in PDF files
pdfdetach: extracts attached files from PDF files

Can ColdFusion already do some of this? Of course it can, but I am always exploring alternative options and have to occasionally perform some process intensive operations outside the context of potential CF timeouts, threads and java heap limitations. I've encountered some issues in the past where ColdFusion will evaluate isPDFFile as TRUE when reading a non-Acrobat-or-CF-generated PDF, but then decide that it's not really a PDF file and throw a CF error when using CFPDF to read the same PDF (using action="getInfo").

When it comes to metadata, I haven't entirely decided if I'm a purist regarding returned values. For example, CFPDF returns "created" and "modified" as a string formatted like "D:20250324103702-07'00'". It's probably consistent with how the metadata is stored in the PDF file, but fails IMHO as it's not a valid date format and requires additional parsing in order to be useful. (It does appear to retain timezone info. That's nice, I guess.) CFPDF also returns a boolean rotation flags and page sizes for every page as separate arrays. If you attempt to pass pages="1" in hopes of minimizing the response, a hard error is thrown as this argument is not allowed. It appears that metadata for every page is the one and only option.

Recently when using CFPDF to personalize an existing single-page cover PDF by adding a watermark, I needed to know both the dimensions & rotation of the preexisting PDF so I could generate a PDF (using WKHTMLTOPDF) with the correct watermark placement. I decided to use Xpdf's pdfinfo.exe to extract this information primarily so that the output would be consistent regardless of which version of CFML platform is used. It's definitely possible that the future CFPDF action="getinfo" option may be updated to return different data in the name of progress/modernity. I also wanted dates to be dates, numeric values to be numeric, boolean to be boolean and for "rotation" to be calculated and the width/height to be converted to inches. (The "points" unit is nice, but I prefer to use "in" with WKHTMLTOPDF for CSS absolute positioning of elements and defining the width/height output of the PDF.)