How can I convert a scanned PDF with OCRed text to one without OCRed text?

Question

I have a scanned PDF file, with low-quality OCRed text.

I would like to have a PDF file without the OCRed text.

How can I convert a scanned PDF with OCRed text to without OCRed text?

I am thinking about what ways can recover the original scanned PDF file before OCR as much as possible, without changing the width and height of each page in pixels, and without changing the pixels per inch of each page?

Is some kind of rasterization again help? Will rasterization again loose the image quality?

Several attmepts:

I use the print to file in Evince, which I think uses cups-pdf, it doesn't remove OCRed text.
Following command using gs doesn't remove OCRed text either (I think I haven't found out how to use gs properly):
```
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
   -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf toc.pdf
```

Your example PDF file is a 404 error.

Nathaniel M. Beaver
– Nathaniel M. Beaver

2019-11-08 22:06:31 +00:00
Commented Nov 8, 2019 at 22:06 — Nathaniel M. Beaver
– Nathaniel M. Beaver, Commented Nov 8, 2019 at 22:06

Community · Accepted Answer · 2017-05-23 12:40:02Z

Here is how I would remove the OCR-ed text should I have to...

First, you need to know, that OCR-ed text in a PDF is not a layer, but a special text rendering mode. The following screenshot from the official PDF specification lists all available text rendering modes:

For more background, please see these answers of mine on StackOverflow:

Now for the procedure I envisage:

0. Make a backup of your original PDF file

'nuff said...

1. Use `qpdf` to un-compress most of the PDF objects

qpdf is a beautiful command line tool to transform most PDFs into a form that makes it easier to manipulate through a text editor (or through sed):

qpdf                       \
  --qdf                    \
  --object-streams=disable \
    input.pdf              \
    editable.pdf

2. Search for spots where PDF code contains `3 Tr`

All spots in the editable.pdf where there is 'invisible' (a.k.a. neither filled nor stroked) text is marked by an initial definition of

3 Tr

Change these to now read

1 Tr

This should make the previously hidden text visible. Glyphs will appear in thick outlines, overlaying the original scanned page images.

It will look very ugly.

Save the edited PDF.

3. Change `Tj` and `TJ` text stroking operators to 'no-ops'

Whenever a text string is prepared for being rendered, the actual operator that is responsible for doing so is named Tj or TJ.

Look out for all of these. Replace them by tJ and tj. This will change them into 'no-ops': they have no meaning at all in the PDF source code; no PDF viewer or processor will "understand" them. (Be careful not to change the number of bytes when replacing stuff in PDF source code, because otherwise you may cause it to become "corrupted".)

Save the PDF file.

4. Check how the PDF file looks now

The PDF should now look "clean" again. The renamed text operators do not have any meaning any more for the PDF viewer, nor for any PDF interpreter.

5. Use Ghostscript to create the final PDF

This command should achieve what you want:

gs                        \
  -o final.pdf            \
  -sDEVICE=pdfwrite       \
  -dPDFSETTINGS=/prepress \
   editable.pdf

This final step uses editable.pdf as input. It outputs final.pdf. The output will have removed all traces of text. The input still had the text, albeit in an "unusable" form, because the operator renaming. Since Ghostscript does not "understand" the re-named operators, it will simply skip them by default.

Thanks. (1) What does the final step actually do? Does it undo some previous step? (2) Can your way remove OCRed text from any pdf file? (3) can a pdf file processed by your way be OCRed again by Adobe Acrobat (which can't OCR a pdf file which has been OCRed)? — Tim
– Tim, Commented Jan 28, 2015 at 23:18
@Tim: (2) This method can be used to remove OCRed text from any PDF file. -- (3) Yes, the resulting PDF file can be OCRed again by Adobe Acrobat. — Kurt Pfeifle
– Kurt Pfeifle, Commented Jan 28, 2015 at 23:33
Thanks. (1) which option to gs will remove the unusable text? — Tim
– Tim, Commented Jan 28, 2015 at 23:44
@Tim: GS requires no special option to remove the unusable text. Since it doesn't understand it, it simply skips these sections. — Kurt Pfeifle
– Kurt Pfeifle, Commented Jan 29, 2015 at 0:17
(1) Does "The output will have removed all traces of text. The input still had the text, it was just unusable" mean the command by gs removes the unusable text? (3) Is it good to keep unusable text in a pdf file? If not, how can I remove the unusable text? — Tim
– Tim, Commented Jan 29, 2015 at 0:23

Anthon · Accepted Answer · 2014-12-07 13:07:06Z

5

There are multiple ways to get rid of the OCRed text in the file.

Export the scanned images from the PDF and recombine them. You can use pdfimages for the extraction (from the poppler-utils package) and convert (from imagemagick) to convert them back:
```
pdfimages toc.pdf toctmp
convert toctmp*.pbm newtoc.pdf
```
Print to PDF (with PDF support from cups-pdf)

PDF is a horrible format for scanned images, but quite often used because it can include multiple pages in one file. The storage format however often is the inappropriate (for scans) JPEG format. Recovering the original images (there is no such thing as the original scanned PDF file) from the PDF can probably not be done because making the PDF from the scanned images is most often the quality reducing step after scanning. You can try to get the images out of the PDF with pdfimage (or pdftoppm) but OCR software that works on images in PDF already knows how to get the best (only) quality images out of these PDFs, there is unlikely something you can do to improve that.

The problem probably lies with your scanning software, not with the OCR software. If you still have the original material, scan that one more to multipage TIFF (lzw compressed) that gives much better OCR than anything that got converted to PDF when that includes JPEG.

edited Dec 7, 2014 at 13:07

answered Dec 7, 2014 at 12:43

Anthon

81.4k42 gold badges174 silver badges228 bronze badges

Thanks. (1) I use the print to file in Evince, which I think uses cups-pdf, it doesn't remove OCRed text. (2) But can the ways of yours recover the original scanned pdf file before OCR as much as possible, without changing the width and height of each page in pixels, and without changing the PPI or DPI (pixels per inch) of each page?

Tim
– Tim

2014-12-07 12:46:02 +00:00
Commented Dec 7, 2014 at 12:46
@Tim Evince probably tries to be smart and include the ocr-ed text, did not know it could do so.

Anthon
– Anthon

2014-12-07 12:51:43 +00:00
Commented Dec 7, 2014 at 12:51
I don't have scanners, and am only given the scanned pdf file with OCRed text. (see the link to the file in my first sentence, and my attempt at the end of my post)

Tim
– Tim

2014-12-07 12:56:05 +00:00
Commented Dec 7, 2014 at 12:56
@Tim assuming that you want to OCR the scans again can't you use the PDF as input and tell the software to redo the scan (and try harder?)

Anthon
– Anthon

2014-12-07 12:59:12 +00:00
Commented Dec 7, 2014 at 12:59
(1) THe OCR software I have doesn't remove existing OCRed text, and re-OCR will result in both old and new OCRed text. (2) In Windows, Adobe Pro doesn't re-OCR a pdf file with OCRed text.

Tim
– Tim

2014-12-07 13:01:39 +00:00
Commented Dec 7, 2014 at 13:01

| Show 15 more comments

Kurt Pfeifle · Accepted Answer · 2015-01-28 23:28:44Z

When I tried to access the link to your sample scanned file earlier, it didn't work for me. However, meanwhile I downloaded it, and had a closer look.

1. Using `pdfimages -list` to investigate the embedded images

If you run a recent (!) version of the Poppler variant of pdfimages, you'll have the -list parameter available. This parameter prints a useful list of images contained in your PDF file. The most recent versions also will tell you some additional info (like image resolution and compression ratio), which were not so easily available before.

Unfortunately, your PDF file contains some syntax errors, which give this garbled output:

kp@mbp:#175536> pdfimages -l 1 -list toc.pdf
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image  2000  2650  icc     1   1  jbig2  no       51 0   300   300 12.4K 1.9%

So let's redirect <stderr> output to /dev/null and try again:

kp@mbp:#175536> pdfimages -list toc.pdf 2>/dev/null
page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
   1   0 image  2000  2650  icc     1   1  jbig2  no       51 0   300   300 12.4K 1.9%
   2   1 image  2012  2659  icc     1   1  jbig2  no      616 0   300   301 16.1K 2.5%
   3   2 image  2014  2661  icc     1   1  jbig2  no      696 0   301   300 16.0K 2.4%
   4   3 image  2000  2650  icc     1   1  jbig2  no      778 0   300   300 16.2K 2.5%
   5   4 image  2000  2650  icc     1   1  jbig2  no      855 0   300   300 16.2K 2.5%
   6   5 image  2000  2650  icc     1   1  jbig2  no      938 0   300   300 15.7K 2.4%
   7   6 image  2000  2650  icc     1   1  jbig2  no     1026 0   300   300 15.5K 2.4%
   8   7 image  2022  2667  icc     1   1  jbig2  no     1103 0   300   300 15.7K 2.4%
   9   8 image  2000  2650  icc     1   1  jbig2  no     1190 0   300   300 15.5K 2.4%
  10   9 image  2011  2658  icc     1   1  jbig2  no     1271 0   300   301 15.7K 2.4%
  11  10 image  2000  2650  icc     1   1  jbig2  no     1347 0   300   300 15.7K 2.4%
  12  11 image  2010  2657  icc     1   1  jbig2  no     1429 0   300   300 15.5K 2.4%
  13  12 image  2000  2650  icc     1   1  jbig2  no     1504 0   300   300 16.8K 2.6%
  14  13 image  2000  2650  icc     1   1  jbig2  no     1589 0   300   300 15.4K 2.4%
  15  14 image  2000  2650  icc     1   1  jbig2  no     1666 0   300   300 17.6K 2.7%
  16  15 image  2010  2657  icc     1   1  jbig2  no     1740 0   300   300 18.7K 2.9%
  17  16 image  2006  2654  icc     1   1  jbig2  no     1823 0   300   301 17.7K 2.7%
  18  17 image  2007  2656  icc     1   1  jbig2  no     1905 0   300   300 16.9K 2.6%
  19  18 image  2000  2650  icc     1   1  jbig2  no     1983 0   300   300 16.7K 2.6%
  20  19 image  2000  2650  icc     1   1  jbig2  no     2065 0   300   300 17.4K 2.7%
  21  20 image  2000  2650  icc     1   1  jbig2  no     2148 0   300   300 17.4K 2.7%
  22  21 image  2011  2658  icc     1   1  jbig2  no     2229 0   300   301 17.2K 2.6%
  23  22 image  2006  2654  icc     1   1  jbig2  no     2305 0   300   301 17.5K 2.7%
  24  23 image  2000  2650  icc     1   1  jbig2  no     2377 0   300   300 14.5K 2.2%

This output means:

24 images (numbered 0--23) on 24 pages (each page 1 image).
All images have very similar dimensions (width/height) and a resolution of 300 PPI.
All images use the same compression method, JBIG2.

These results gives me confidence to suggest a different method to remove the OCR-ed text from your PDF:

Extract all images.
Create a new PDF from these images.

2. Extract all images from PDF

If you have one of the most recent Poppler versions of pdfimages, you are able to extract the images in the JBIG2 compression:

pdfimages -jbig2 toc.pdf toc--

The resulting image files will carry the file names toc---000.jb2e, toc---000.jb2e, ... (suffix .jb2e). Each of these files should have another one with it, named toc---000.jb2g, toc---000.jb2g, ... (suffix .jb2g).

If you do not get .jb2e images as a result, but .pbm instead, you'll have to use ImageMagick's convert to create JPEGs:

for i in toc--*.pbm; do
  convert $i ${i/.pbm/.jpg}
done

However, the JPEG images will be much bigger than the JBIG2 ones. (I tried it: JPEGs are in total 15 MByte, PBMs are in total 15 MBytes, JBIG2 are in total 436 kBytes for the 24 images!)

3. Create a new PDF from the extracted images

If you were unlucky and had to convert to JPEG, you can now convert these to a PDF:

convert toc--*.jpg -density out.pdf

Voila!, you now have a 15 MByte PDF file without the OCR-ed text, where you before had a 1.6 MByte PDF file with OCR-ed text! (But you'll not have lost much of the previous quality...)

_{Since my own pdfimages is compiled from sources, I from time to time suffer from a bug with it. Right now it does not correctly extract images as JBIG2 files. That's why I cannot create a PDF from them either. But this PDF's size would be similar to the original toc.pdf's size....}

Just FYI, one could convert pbm files to png (or run a Poppler version of pdfimages with -png), then use agl/jbig2enc (generates jbig2 with globals), then use pdf.py (in that project) to create a pdf. I know this works if the pdf is made up exclusively of jbig2 images, one per page. — labreuer
– labreuer, Commented Nov 1, 2016 at 1:28
@labreuer: Just FYI, going the PNG route does not offer any advantages IMHO. If it does, please explain to me: which? Because PNG typically is larger than JPEG, so the disadvantages I clearly outlined (file size of new PDF sans OCR) would be even worse... — Kurt Pfeifle
– Kurt Pfeifle, Commented Nov 1, 2016 at 13:13
False in this case: when I switched your code to png, the resultant pdf I got was 1.97MB. You'd probably be right if we weren't dealing with bitonal images of text; png compresses those quite well. But it's also irrelevant, because I was only using png as an intermediary to jbig2. I knew I could do this, because your pdfimages -list results showed that all the images were jbig2. — labreuer
– labreuer, Commented Nov 1, 2016 at 17:06
@labreuer: Interesting. Thanks for checking this. I'll investigate this some more (and probably update my answer, giving you credit) once I find the time to do it. — Kurt Pfeifle
– Kurt Pfeifle, Commented Nov 1, 2016 at 17:08
Sure; it was quite easy, as I'm in the process of recompressing various PDFs into lossy jbig2 with globals using iTextSharp. When I applied that to the above pdf generated from pngs, I got a 264KB pdf. (For others who happen along, I may open source the resultant C# project at some point in the future. I have to decide how deeply to dive into understanding pdfs.) — labreuer
– labreuer, Commented Nov 1, 2016 at 17:51

hife · Accepted Answer · 2023-11-21 20:16:23Z

While

gs -o output.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf

is now a simple solution, there can be an issue with embedded jbig2 images that are recompiled by gs and increase in size.

Below are two options without ghostscript and without bursting and recompiling the entire pdf (posted here before).

In particular, they preserve bookmarks and page numbers.

1) A command line solution

following the current top reply, but avoiding the ghostscript step at the end:

Back up your pdf.
Decompress your pdf with qpdf (or pdftk)
```
qpdf --qdf --object-streams=disable input.pdf editable.pdf
```
This creates a pdf file in qdf mode, readable in text editors (that can handle large files).
Remove all lines ending with Tj or TJ in a text editor or via sed:
```
sed 'T[Jj]$/d' ./editable.pdf > editable-no-text.pdf
```
Those are the pdf commands that render text strings.

This will leave behind further placement commands like Tm and Td that are related to positioning on the page and Tr that determines the display style of the text. These do not contain any text themselves and don't take up as much space. You may remove them as well via:
```
sed 'T[Jjdmr]$/d' ./editable.pdf > editable-no-text.pdf
```
I have not had any negative side effects, but check the result before proceeding.
Check that editable-no-text.pdf looks like it's supposed to.

Recompress your pdf:

qpdf --compress-streams=y --object-streams=generate editable-no-text.pdf final.pdf

2) A GUI solution

I used this before discovering the above. It is simpler, but more work with longer pdf files. I also assume it is safer, but you should have backups anyway.

Use Master PDF Editor (use version 4 from the end of that page, as the current version 5 has a lot of locked functions).

You can set it to select only text objects and then just select everything with Ctrl+A and remove with Del. Unfortunately, you have to do this for every page, so I would just cycle through Ctrl+A, Del, Page down.

While this is not properly scriptable, you could probably bodge it using xdotool.

Eduard Florinescu · Accepted Answer · 2018-02-08 00:08:30Z

0

Best way I found for quality and multilayered pdfs is to use inkscape and img2pdf. I made this quick bash script:

#!/bin/bash
mkdir "$1_temp"
cp "$1" "$1_temp"/to_do.pdf
cd "$1_temp"
pdftk to_do.pdf burst output pg_%04d.pdf
ls ./pg*.pdf | xargs -L1 -I {}  inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {}  -quality 100 -density 300 - {}.jpg
rm *.pdf
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
rm *.jpg
pdftk *.pdf cat output combined.pdf

edited Feb 8, 2018 at 0:08

answered Feb 8, 2018 at 0:02

Eduard Florinescu

12.5k19 gold badges61 silver badges70 bronze badges

Add a comment |

Stack Exchange Network

How can I convert a scanned PDF with OCRed text to one without OCRed text?

5 Answers 5

0. Make a backup of your original PDF file

1. Use `qpdf` to un-compress most of the PDF objects

2. Search for spots where PDF code contains `3 Tr`

3. Change `Tj` and `TJ` text stroking operators to 'no-ops'

4. Check how the PDF file looks now

5. Use Ghostscript to create the final PDF

1. Using `pdfimages -list` to investigate the embedded images

2. Extract all images from PDF

3. Create a new PDF from the extracted images

1) A command line solution

2) A GUI solution

You must log in to answer this question.

Linked

Hot Network Questions

How can I convert a scanned PDF with OCRed text to one without OCRed text?

5 Answers 5

0. Make a backup of your original PDF file

1. Use qpdf to un-compress most of the PDF objects

2. Search for spots where PDF code contains 3 Tr

3. Change Tj and TJ text stroking operators to 'no-ops'

4. Check how the PDF file looks now

5. Use Ghostscript to create the final PDF

1. Using pdfimages -list to investigate the embedded images

2. Extract all images from PDF

3. Create a new PDF from the extracted images

1) A command line solution

2) A GUI solution

You must log in to answer this question.

Linked

Related

Hot Network Questions

1. Use `qpdf` to un-compress most of the PDF objects

2. Search for spots where PDF code contains `3 Tr`

3. Change `Tj` and `TJ` text stroking operators to 'no-ops'

1. Using `pdfimages -list` to investigate the embedded images