Extract images from PDF file with JavaScript

Question

I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for reading a PDF file, for example in the PDF viewer pdf-js.

window.addEventListener('change', function webViewerChange(evt) {
  var files = evt.target.files;
  if (!files || files.length === 0)
    return;

  // Read the local file into a Uint8Array.
  var fileReader = new FileReader();
  fileReader.onload = function webViewerChangeFileReaderOnload(evt) {
    var buffer = evt.target.result;
    var uint8Array = new Uint8Array(buffer);
    PDFView.open(uint8Array, 0);
  };

  var file = files[0];
  fileReader.readAsArrayBuffer(file);
  PDFView.setTitleUsingUrl(file.name);
  ........

Can this code be used to extract images from a PDF file?

this should work. But the right code is now here: github.com/mozilla/pdf.js/blob/gh-pages/build/pdf.js#L1112 function loadJpegStream — user753676
– user753676, Commented Oct 23, 2013 at 11:50

Jason Siefken · Accepted Answer · 2018-03-01 05:52:31Z

27

If you open a page with pdf.js, for example

PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
    doc.getPage(1).then(function (page) {
        window.page = page;
    })
})

you can then use getOperatorList to search for paintJpegXObject objects and grab the resources.

window.objs = []
page.getOperatorList().then(function (ops) {
    for (var i=0; i < ops.fnArray.length; i++) {
        if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
            window.objs.push(ops.argsArray[i][0])
        }
    }
})

Now args will have a list of the resources from that page that you need to fetch.

console.log(window.args.map(function (a) { page.objs.get(a) }))

should print to the console a bunch of <img /> objects with data-uri src= attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.

It only works for embedded JPEG objects, but it's a start!

edited Mar 1, 2018 at 5:52

answered Oct 4, 2016 at 14:50

Jason Siefken

7897 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

svenyonson Over a year ago

Typos. Change to i < ops.fnArray.length and PDFJS.OPS.paintJpegXObject

Houy Narun Over a year ago

@JasonSiefken after extract an image file and do some operation on it, like resizing image, How can I insert it back to file for replace the existing image inside file? Thanks

Sphinxxx Over a year ago

If you call page.objs.get() before the image is loaded, you get an error. To be safe, pass a callback as the second parameter to get() instead of relying on a return value. Working example: codepen.io/Sphinxxxx/pen/MxwGQZ

Davor Over a year ago

Together with paintJpegXObject comparison, you can also check for paintImageXObject. This worked in my case, probably because a pdf contained png objects.

Umair Ahmed Over a year ago

What about the paintImageXObject .png and other types, it's giving Uint8ClampedArray now converting Uint8ClampedArray array to image is new challange :)

Otabek Eshpulatov 2 days ago

Hi Umair. Did you manage to convert Uint8ClampedArray -> image?

|

kubanm3 · Accepted Answer · 2022-02-02 08:20:52Z

Here is link to working example of getting images from pdf and adding alpha channel to Uint8ClampedArray to be able to display it. It displays images in canvas.

Example in codepen: https://codepen.io/allandiego/pen/RwVGbyj

Getting data url from canvas to be able to display it in img tag:

const canvas = document.createElement('canvas');
canvas.width = imageWidth;
canvas.height = imageHeight;
const ctx = canvas.getContext('2d');
ctx!.putImageData(imageData, 0, 0);
const dataURL = canvas.toDataURL();

this doesn't extract, but only convert a canvas to an image.

Marek Lisý · Accepted Answer · 2023-06-30 13:17:48Z

In case anyone else stumbles upon this and doesn't want to implement the various cases him/herself, I finally found a library that does everything for me - pdf-img-convert. It uses pdf.js under the hood.

npm install pdf-img-convert

And use like this:

import { convert } from "pdf-img-convert";

const outputImages = await convert("/path/to/pdf.pdf");
const imagePaths = outputImages.map((image, i) => {
  const path = "output" + i + ".png";
  writeFileSync(path, image);
  return path;
});

Collectives™ on Stack Overflow

Extract images from PDF file with JavaScript

3 Answers 3

8 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

1 Comment

Comments

Linked

Related