80

I wonder if is possible to get the text inside of a PDF file by using only Javascript? If yes, can anyone show me how?

I know there are some server-side java, c#, etc libraries but I would prefer not using a server. thanks

12 Answers 12

97

Because pdf.js has been developing over the years, I would like to give a new answer. That is, it can be done locally without involving any server or external service. The new pdf.js has a function: page.getTextContent(). You can get the text content from that. I've done it successfully with the following code.

  1. What you get in each step is a promise. You need to code this way: .then( function(){...}) to proceed to the next step.
  1. PDFJS.getDocument( data ).then( function(pdf) {

  2. pdf.getPage(i).then( function(page){

  3. page.getTextContent().then( function(textContent){

  1. What you finally get is an string array textContent.bidiTexts[]. You concatenate them to get the text of 1 page. Text blocks' coordinates are used to judge whether newline or space need to be inserted. (This may not be totally robust, but from my test it seems ok.)

  2. The input parameter data needs to be either a URL or ArrayBuffer type data. I used the ReadAsArrayBuffer(file) function in FileReader API to get the data.

Note: According to some other user, the library has updated and caused the code to break. According to the comment by async5 below, you need to replace textContent.bidiTexts with textContent.items.

    function Pdf2TextClass(){
     var self = this;
     this.complete = 0;

    /**
     *
     * @param data ArrayBuffer of the pdf file content
     * @param callbackPageDone To inform the progress each time
     *        when a page is finished. The callback function's input parameters are:
     *        1) number of pages done;
     *        2) total number of pages in file.
     * @param callbackAllDone The input parameter of callback function is 
     *        the result of extracted text from pdf file.
     *
     */
     this.pdfToText = function(data, callbackPageDone, callbackAllDone){
     console.assert( data  instanceof ArrayBuffer  || typeof data == 'string' );
     PDFJS.getDocument( data ).then( function(pdf) {
     var div = document.getElementById('viewer');
    
     var total = pdf.numPages;
     callbackPageDone( 0, total );        
     var layers = {};        
     for (i = 1; i <= total; i++){
        pdf.getPage(i).then( function(page){
        var n = page.pageNumber;
        page.getTextContent().then( function(textContent){
          if( null != textContent.bidiTexts ){
            var page_text = "";
            var last_block = null;
            for( var k = 0; k < textContent.bidiTexts.length; k++ ){
                var block = textContent.bidiTexts[k];
                if( last_block != null && last_block.str[last_block.str.length-1] != ' '){
                    if( block.x < last_block.x )
                        page_text += "\r\n"; 
                    else if ( last_block.y != block.y && ( last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null ))
                        page_text += ' ';
                }
                page_text += block.str;
                last_block = block;
            }

            textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);
            layers[n] =  page_text + "\n\n";
          }
          ++ self.complete;
          callbackPageDone( self.complete, total );
          if (self.complete == total){
            window.setTimeout(function(){
              var full_text = "";
              var num_pages = Object.keys(layers).length;
              for( var j = 1; j <= num_pages; j++)
                  full_text += layers[j] ;
              callbackAllDone(full_text);
            }, 1000);              
          }
        }); // end  of page.getTextContent().then
      }); // end of page.then
    } // of for
  });
 }; // end of pdfToText()
}; // end of class
Sign up to request clarification or add additional context in comments.

8 Comments

Ancient question but excellent answer. You have any idea how to get the textLayer to not render characters in individual divs but to render them as whole words? I'm getting quite a big performance hit from trying to use the text layer overlap with the divs absolute positioned as there are so many of them. If you'd prefer this as a separate actual StackOverflow question I'll make one.
@gm2008 I have been trying to extract text from a PDF using your function. However, I am unable to extract the text. The full_text returns an empty string at the end. Can you please help.
I couldn't get this to work either (API has changed). Added my own example below.
replace textContent.bidiTexts with textContent.items
|
14

I couldn't get gm2008's example to work (the internal data structure on pdf.js has changed apparently), so I wrote my own fully promise-based solution that doesn't use any DOM elements, queryselectors or canvas, using the updated pdf.js from the example at mozilla

It eats a file path for the upload since i'm using it with node-webkit. You need to make sure you have the cmaps downloaded and pointed somewhere and you nee pdf.js and pdf.worker.js to get this working.

    /**
     * Extract text from PDFs with PDF.js
     * Uses the demo pdf.js from https://mozilla.github.io/pdf.js/getting_started/
     */
    this.pdfToText = function(data) {

        PDFJS.workerSrc = 'js/vendor/pdf.worker.js';
        PDFJS.cMapUrl = 'js/vendor/pdfjs/cmaps/';
        PDFJS.cMapPacked = true;

        return PDFJS.getDocument(data).then(function(pdf) {
            var pages = [];
            for (var i = 0; i < pdf.numPages; i++) {
                pages.push(i);
            }
            return Promise.all(pages.map(function(pageNumber) {
                return pdf.getPage(pageNumber + 1).then(function(page) {
                    return page.getTextContent().then(function(textContent) {
                        return textContent.items.map(function(item) {
                            return item.str;
                        }).join(' ');
                    });
                });
            })).then(function(pages) {
                return pages.join("\r\n");
            });
        });
    }

usage:

 self.pdfToText(files[0].path).then(function(result) {
      console.log("PDF done!", result);
 })

2 Comments

"PDFJS.getDocument(...).then is not a function"
9

Just leaving here a full working sample.

<html>
    <head>
        <script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>
    </head>
    <body>
        <input id="pdffile" name="pdffile" type="file" />
        <button id="btn" onclick="convert()">Process</button>
        <div id="result"></div>
    </body>
</html>

<script>

    function convert() {
        var fr=new FileReader();
        var pdff = new Pdf2TextClass();
        fr.onload=function(){
            pdff.pdfToText(fr.result, null, (text) => { document.getElementById('result').innerText += text; });
        }
        fr.readAsDataURL(document.getElementById('pdffile').files[0])
        
    }

    function Pdf2TextClass() {
        var self = this;
        this.complete = 0;

        this.pdfToText = function (data, callbackPageDone, callbackAllDone) {
            console.assert(data instanceof ArrayBuffer || typeof data == 'string');
            var loadingTask = pdfjsLib.getDocument(data);
            loadingTask.promise.then(function (pdf) {


                var total = pdf._pdfInfo.numPages;
                //callbackPageDone( 0, total );        
                var layers = {};
                for (i = 1; i <= total; i++) {
                    pdf.getPage(i).then(function (page) {
                        var n = page.pageNumber;
                        page.getTextContent().then(function (textContent) {

                            //console.log(textContent.items[0]);0
                            if (null != textContent.items) {
                                var page_text = "";
                                var last_block = null;
                                for (var k = 0; k < textContent.items.length; k++) {
                                    var block = textContent.items[k];
                                    if (last_block != null && last_block.str[last_block.str.length - 1] != ' ') {
                                        if (block.x < last_block.x)
                                            page_text += "\r\n";
                                        else if (last_block.y != block.y && (last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null))
                                            page_text += ' ';
                                    }
                                    page_text += block.str;
                                    last_block = block;
                                }

                                textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);
                                layers[n] = page_text + "\n\n";
                            }
                            ++self.complete;
                            //callbackPageDone( self.complete, total );
                            if (self.complete == total) {
                                window.setTimeout(function () {
                                    var full_text = "";
                                    var num_pages = Object.keys(layers).length;
                                    for (var j = 1; j <= num_pages; j++)
                                        full_text += layers[j];
                                    callbackAllDone(full_text);
                                }, 1000);
                            }
                        }); // end  of page.getTextContent().then
                    }); // end of page.then
                } // of for
            });
        }; // end of pdfToText()
    }; // end of class

</script>

1 Comment

> Deprecated API usage: No "GlobalWorkerOptions.workerSrc" specified.
7

Here's some JavaScript code that does what you want using Pdf.js from http://hublog.hubmed.org/archives/001948.html:

var input = document.getElementById("input");  
var processor = document.getElementById("processor");  
var output = document.getElementById("output");  

// listen for messages from the processor  
window.addEventListener("message", function(event){  
  if (event.source != processor.contentWindow) return;  

  switch (event.data){  
    // "ready" = the processor is ready, so fetch the PDF file  
    case "ready":  
      var xhr = new XMLHttpRequest;  
      xhr.open('GET', input.getAttribute("src"), true);  
      xhr.responseType = "arraybuffer";  
      xhr.onload = function(event) {  
        processor.contentWindow.postMessage(this.response, "*");  
      };  
      xhr.send();  
    break;  

    // anything else = the processor has returned the text of the PDF  
    default:  
      output.textContent = event.data.replace(/\s+/g, " ");  
    break;  
  }  
}, true);

...and here's an example:

http://git.macropus.org/2011/11/pdftotext/example/

2 Comments

While those links may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
hi, i'm trying this, but this still requires a file be uploaded to the server. how can i process files locally, client-side?
4

Note: This code assumes you're using nodejs. That means you're parsing a local file instead of one from a web page since the original question doesn't explicitly ask about parsing pdfs on a web page.

@gm2008's answer was a great starting point (please read it and its comments for more info), but needed some updates (08/19) and had some unused code. I also like examples that are more full. There's more refactoring and tweaking that could be done (e.g. with await), but for now it's as close to that original answer as it could be.

As before, this uses Mozilla's PDFjs library. The npmjs package is at https://www.npmjs.com/package/pdfjs-dist.

In my experience, this doesn't do well in finding where to put spaces, but that's a problem for another time.

[Edit: I believe the update to the use of .transform has restored the whitespace as it originally behaved.]

// This file is called myPDFfileToText.js and is in the root folder
let PDFJS = require('pdfjs-dist');

let pathToPDF = 'path/to/myPDFfileToText.pdf';

let toText = Pdf2TextObj();
let onPageDone = function() {}; // don't want to do anything between pages
let onFinish = function(fullText) { console.log(fullText) };
toText.pdfToText(pathToPDF, onPageDone, onFinish);

function Pdf2TextObj() {
    let self = this;
    this.complete = 0;

    /**
     *
     * @param path Path to the pdf file.
     * @param callbackPageDone To inform the progress each time
     *        when a page is finished. The callback function's input parameters are:
     *        1) number of pages done.
     *        2) total number of pages in file.
     *        3) the `page` object itself or null.
     * @param callbackAllDone Called after all text has been collected. Input parameters:
     *        1) full text of parsed pdf.
     *
     */
    this.pdfToText = function(path, callbackPageDone, callbackAllDone) {
        // console.assert(typeof path == 'string');
        PDFJS.getDocument(path).promise.then(function(pdf) {

            let total = pdf.numPages;
            callbackPageDone(0, total, null);

            let pages = {};
            // For some (pdf?) reason these don't all come in consecutive
            // order. That's why they're stored as an object and then
            // processed one final time at the end.
            for (let pagei = 1; pagei <= total; pagei++) {
                pdf.getPage(pagei).then(function(page) {
                    let pageNumber = page.pageNumber;
                    page.getTextContent().then(function(textContent) {
                        if (null != textContent.items) {
                            let page_text = "";
                            let last_item = null;
                            for (let itemsi = 0; itemsi < textContent.items.length; itemsi++) {
                                let item = textContent.items[itemsi];
                                // I think to add whitespace properly would be more complex and
                                // would require two loops.
                                if (last_item != null && last_item.str[last_item.str.length - 1] != ' ') {
                                    let itemX = item.transform[5]
                                    let lastItemX = last_item.transform[5]
                                    let itemY = item.transform[4]
                                    let lastItemY = last_item.transform[4]
                                    if (itemX < lastItemX)
                                        page_text += "\r\n";
                                    else if (itemY != lastItemY && (last_item.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null))
                                        page_text += ' ';
                                } // ends if may need to add whitespace

                                page_text += item.str;
                                last_item = item;
                            } // ends for every item of text

                            textContent != null && console.log("page " + pageNumber + " finished.") // " content: \n" + page_text);
                            pages[pageNumber] = page_text + "\n\n";
                        } // ends if has items

                        ++self.complete;

                        callbackPageDone(self.complete, total, page);


                        // If all done, put pages in order and combine all
                        // text, then pass that to the callback
                        if (self.complete == total) {
                            // Using `setTimeout()` isn't a stable way of making sure 
                            // the process has finished. Watch out for missed pages.
                            // A future version might do this with promises.
                            setTimeout(function() {
                                let full_text = "";
                                let num_pages = Object.keys(pages).length;
                                for (let pageNum = 1; pageNum <= num_pages; pageNum++)
                                    full_text += pages[pageNum];
                                callbackAllDone(full_text);
                            }, 1000);
                        }
                    }); // ends page.getTextContent().then
                }); // ends page.then
            } // ends for every page
        });
    }; // Ends pdfToText()

    return self;
}; // Ends object factory

Run in the terminal:

node myPDFfileToText.js

1 Comment

"Cannot set property 'complete' of undefined"
2

Updated 02/2021

<script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>
    <script>
    
function Pdf2TextClass(){
    var self = this;
    this.complete = 0;

    this.pdfToText = function(data, callbackPageDone, callbackAllDone){
    console.assert( data  instanceof ArrayBuffer  || typeof data == 'string' );
    var loadingTask = pdfjsLib.getDocument(data);
    loadingTask.promise.then(function(pdf) {


    var total = pdf._pdfInfo.numPages;
    //callbackPageDone( 0, total );        
    var layers = {};        
    for (i = 1; i <= total; i++){
       pdf.getPage(i).then( function(page){
       var n = page.pageNumber;
       page.getTextContent().then( function(textContent){
       
       //console.log(textContent.items[0]);0
         if( null != textContent.items ){
           var page_text = "";
           var last_block = null;
           for( var k = 0; k < textContent.items.length; k++ ){
               var block = textContent.items[k];
               if( last_block != null && last_block.str[last_block.str.length-1] != ' '){
                   if( block.x < last_block.x )
                       page_text += "\r\n"; 
                   else if ( last_block.y != block.y && ( last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null ))
                       page_text += ' ';
               }
               page_text += block.str;
               last_block = block;
           }

           textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);
           layers[n] =  page_text + "\n\n";
         }
         ++ self.complete;
         //callbackPageDone( self.complete, total );
         if (self.complete == total){
           window.setTimeout(function(){
             var full_text = "";
             var num_pages = Object.keys(layers).length;
             for( var j = 1; j <= num_pages; j++)
                 full_text += layers[j] ;
             console.log(full_text);
           }, 1000);              
         }
       }); // end  of page.getTextContent().then
     }); // end of page.then
   } // of for
 });
}; // end of pdfToText()
}; // end of class
var pdff = new Pdf2TextClass();
pdff.pdfToText('PDF_URL');
    </script>

Comments

1

@SchizoDuckie's solution, made shorter:

import { getDocument as loadPdf } from 'pdfjs-dist';

...

async function pdfToTxt(file: File): Promise<string> {

  const pdf = await loadPdf(await file.arrayBuffer()).promise;

  return Promise.all([...Array(pdf.numPages).keys()]
    .map(async num => (await (await pdf.getPage(num + 1)).getTextContent())
      .items.map(item => (<any>item).str).join(' ')))
    .then(pages => pages.join('\n'));

}

4 Comments

I needed to ' import "pdfjs-dist/build/pdf.worker.entry" ' to work. github.com/mozilla/pdf.js/issues/10478#issuecomment-1560704162. Thank you Tom for your solution.
I get the error "Attempted import error: 'getDocument' is not exported from 'pdfjs-dist' (imported as 'loadPdf').". I am using next.js app router and this is in a api route. The d.ts file for the library shows getDocument being exported. I don't know why it's not working. Found related issue: github.com/vercel/next.js/issues/58313
@tom any chance you can provide an example of how you are reading and passing the pdf file into the pdftoTxt() function?
There is a type="file" <input> field on the page. On it's onchange event you can access its files property, which is of FileList type. The item(0) property of this file list will be your file, that you can pass to the function.
0
npm install pdf-parse

required file:

/node_modules/pdf-parse/lib/pdf.js/v2.0.550/build/pdf.js

that loads:

pdf.worker.js

usage:

var pdf = await pdfjsLib.getDocument({ data: new Uint8Array(buffer) }).promise;
var numPages = pdf.numPages;
var texts = [];

for (let i = 1; i <= numPages; i++) {
    let page = await pdf.getPage(i);
    let textContent = await page.getTextContent();
    let textItems = textContent.items;
    let pageText = textItems.map(item => item.str).join(" ").replace(/\s+/g," ");
    texts.push(pageText);
}

console.log(texts); 

buffer can come from:

var file = $("input[type=file]");
file.onchange = function () {
    var file = this.files[0];
    var reader = new FileReader();
    reader.onload = async () => {
        var buffer = reader.result;
        // use buffer
    }

    if (file && file.type == "application/pdf") {
        reader.readAsArrayBuffer(file);
    } 

}

Comments

0

I used pdf.js-extract and this function helped me to extract text from array of files:

static getText = async (files) => {
    try {
      const PDFExtract = require("pdf.js-extract").PDFExtract;
      const pdfExtract = new PDFExtract();
      const options = {};

      const texts = await Promise.all(
        files.map(async (file) => {
          try {
            const data = await pdfExtract.extract(file.path, options);
            return data.pages
              .flatMap((page) => page.content.map((item) => item.str))
              .join(" ");
          } catch (error) {
            console.error(
              `Error extracting PDF text from ${file.path}:`,
              error
            );
            return ""; // Return empty string for failed extractions
          }
        })
      );

      return texts.join(" ");
    } catch (error) {
      console.error("Error in getText function:", error);
      throw error;
    }
  };

Comments

-1

The Google APIs have shifted considerably in the last few years, rendering a lot of older posts on this subject maddeningly obsolete for Google Apps Script (GAS), and confusing all the AI chatbots in the process. Below is a full working example as of October 2024. Just sign into Google account, and go to https://script.new then paste the code below, configure, and click run (you'll need to add advanced Drive service and authorize). Below gets you the PDF content as text, but to go further and extract keywords to a Google spreadsheet see here with more documentation. Either way, you may need to batch process to avoid timeouts if you have a lot of content to parse.

function convertPDFsInFolderToText(folderId) {
  const folder = DriveApp.getFolderById(folderId);
  const files = folder.getFiles();
  let allTextContent = "";

  while (files.hasNext()) {
    const pdfFile = files.next();
    try {
      const textContent = convertSinglePDFToText(pdfFile.getId());
      allTextContent += `\n=== Start of ${pdfFile.getName()} ===\n`;
      allTextContent += textContent;
      allTextContent += `\n=== End of ${pdfFile.getName()} ===\n`;
    } catch (error) {
      Logger.log(`Failed to process file ${pdfFile.getName()}: ${error.message}`);
    }
  }
  return allTextContent;
}

// Function to convert a single PDF to text using OCR with detailed logging
function convertSinglePDFToText(fileId) {
  Logger.log(`Starting conversion for file ID: ${fileId}`);

  try {
    const pdfFile = DriveApp.getFileById(fileId);
    const fileBlob = pdfFile.getBlob();

    const fileMetadata = {
      name: pdfFile.getName().replace(/\.pdf$/, ''),
      mimeType: 'application/vnd.google-apps.document'
    };

    const options = {
      ocr: true,
      ocrLanguage: "en",
      fields: 'id, name'
    };

    Logger.log(`Converting PDF to Google Docs using OCR...`);
    const response = Drive.Files.create(fileMetadata, fileBlob, options);
    const { id } = response;
    Utilities.sleep(10000); // Wait 10 seconds for OCR processing

    const doc = DocumentApp.openById(id);
    const textContent = doc.getBody().getText();

    DriveApp.getFileById(id).setTrashed(true);  // Optionally delete the temporary Google Document

    Logger.log(`Text successfully extracted from Google Document: ${textContent.length} characters`);
    return textContent;

  } catch (error) {
    Logger.log(`Error during PDF-to-text conversion: ${error.message}`);
    throw error;
  }
}

Comments

-2

For all the people who actually want to use it on a node server:

/**
 * Created by velten on 25.04.16.
 */
"use strict";
let pdfUrl = "http://example.com/example.pdf";
let request = require('request');
var pdfParser = require('pdf2json');

let pdfPipe = request({url: pdfUrl, encoding:null}).pipe(pdfParser);

pdfPipe.on("pdfParser_dataError", err => console.error(err) );
pdfPipe.on("pdfParser_dataReady", pdf => {
    //optionally:
    //let pdf = pdfParser.getMergedTextBlocksIfNeeded();

    let count1 = 0;
    //get text on a particular page
    for (let page of pdf.formImage.Pages) {
        count1 += page.Texts.length;
    }

    console.log(count1);
    pdfParser.destroy();
});

2 Comments

"dest.on is not a function"
@BartusZak foo.bar is also not a function ;)
-3

It is possible but:

  • you would have to use the server anyway, there's no way you can get content of a file on user computer without transferring it to server and back
  • I don't thing anyone has written such library yet

So if you have some free time you can learn pdf format and write such a library yourself, or you can just use server side library of course.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.