18

I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for reading a PDF file, for example in the PDF viewer pdf-js.

window.addEventListener('change', function webViewerChange(evt) {
  var files = evt.target.files;
  if (!files || files.length === 0)
    return;

  // Read the local file into a Uint8Array.
  var fileReader = new FileReader();
  fileReader.onload = function webViewerChangeFileReaderOnload(evt) {
    var buffer = evt.target.result;
    var uint8Array = new Uint8Array(buffer);
    PDFView.open(uint8Array, 0);
  };

  var file = files[0];
  fileReader.readAsArrayBuffer(file);
  PDFView.setTitleUsingUrl(file.name);
  ........

Can this code be used to extract images from a PDF file?

1

3 Answers 3

27

If you open a page with pdf.js, for example

PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
    doc.getPage(1).then(function (page) {
        window.page = page;
    })
})

you can then use getOperatorList to search for paintJpegXObject objects and grab the resources.

window.objs = []
page.getOperatorList().then(function (ops) {
    for (var i=0; i < ops.fnArray.length; i++) {
        if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
            window.objs.push(ops.argsArray[i][0])
        }
    }
})

Now args will have a list of the resources from that page that you need to fetch.

console.log(window.args.map(function (a) { page.objs.get(a) }))

should print to the console a bunch of <img /> objects with data-uri src= attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.

It only works for embedded JPEG objects, but it's a start!

Sign up to request clarification or add additional context in comments.

8 Comments

Typos. Change to i < ops.fnArray.length and PDFJS.OPS.paintJpegXObject
@JasonSiefken after extract an image file and do some operation on it, like resizing image, How can I insert it back to file for replace the existing image inside file? Thanks
If you call page.objs.get() before the image is loaded, you get an error. To be safe, pass a callback as the second parameter to get() instead of relying on a return value. Working example: codepen.io/Sphinxxxx/pen/MxwGQZ
Together with paintJpegXObject comparison, you can also check for paintImageXObject. This worked in my case, probably because a pdf contained png objects.
What about the paintImageXObject .png and other types, it's giving Uint8ClampedArray now converting Uint8ClampedArray array to image is new challange :)
Hi Umair. Did you manage to convert Uint8ClampedArray -> image?
|
1

Here is link to working example of getting images from pdf and adding alpha channel to Uint8ClampedArray to be able to display it. It displays images in canvas.

Example in codepen: https://codepen.io/allandiego/pen/RwVGbyj

Getting data url from canvas to be able to display it in img tag:

const canvas = document.createElement('canvas');
canvas.width = imageWidth;
canvas.height = imageHeight;
const ctx = canvas.getContext('2d');
ctx!.putImageData(imageData, 0, 0);
const dataURL = canvas.toDataURL();

1 Comment

this doesn't extract, but only convert a canvas to an image.
0

In case anyone else stumbles upon this and doesn't want to implement the various cases him/herself, I finally found a library that does everything for me - pdf-img-convert. It uses pdf.js under the hood.

npm install pdf-img-convert

And use like this:

import { convert } from "pdf-img-convert";

const outputImages = await convert("/path/to/pdf.pdf");
const imagePaths = outputImages.map((image, i) => {
  const path = "output" + i + ".png";
  writeFileSync(path, image);
  return path;
});

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.