When I tried to access the link to your sample scanned file earlier, it didn't work for me. However, meanwhile I downloaded it, and had a closer look.
1. Using pdfimages -list to investigate the embedded images
If you run a recent (!) version of the Poppler variant of pdfimages, you'll have the -list parameter available. This parameter prints a useful list of images contained in your PDF file. The most recent versions also will tell you some additional info (like image resolution and compression ratio), which were not so easily available before.
Unfortunately, your PDF file contains some syntax errors, which give this garbled output:
kp@mbp:#175536> pdfimages -l 1 -list toc.pdf
Syntax Warning: Couldn't link the profiles
Syntax Warning: Can't create transform
Syntax Warning: Couldn't link the profiles
Syntax Warning: Can't create transform
Syntax Warning: Couldn't link the profiles
Syntax Warning: Can't create transform
Syntax Warning: Couldn't link the profiles
Syntax Warning: Can't create transform
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 2000 2650 icc 1 1 jbig2 no 51 0 300 300 12.4K 1.9%
So let's redirect <stderr> output to /dev/null and try again:
kp@mbp:#175536> pdfimages -list toc.pdf 2>/dev/null
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 2000 2650 icc 1 1 jbig2 no 51 0 300 300 12.4K 1.9%
2 1 image 2012 2659 icc 1 1 jbig2 no 616 0 300 301 16.1K 2.5%
3 2 image 2014 2661 icc 1 1 jbig2 no 696 0 301 300 16.0K 2.4%
4 3 image 2000 2650 icc 1 1 jbig2 no 778 0 300 300 16.2K 2.5%
5 4 image 2000 2650 icc 1 1 jbig2 no 855 0 300 300 16.2K 2.5%
6 5 image 2000 2650 icc 1 1 jbig2 no 938 0 300 300 15.7K 2.4%
7 6 image 2000 2650 icc 1 1 jbig2 no 1026 0 300 300 15.5K 2.4%
8 7 image 2022 2667 icc 1 1 jbig2 no 1103 0 300 300 15.7K 2.4%
9 8 image 2000 2650 icc 1 1 jbig2 no 1190 0 300 300 15.5K 2.4%
10 9 image 2011 2658 icc 1 1 jbig2 no 1271 0 300 301 15.7K 2.4%
11 10 image 2000 2650 icc 1 1 jbig2 no 1347 0 300 300 15.7K 2.4%
12 11 image 2010 2657 icc 1 1 jbig2 no 1429 0 300 300 15.5K 2.4%
13 12 image 2000 2650 icc 1 1 jbig2 no 1504 0 300 300 16.8K 2.6%
14 13 image 2000 2650 icc 1 1 jbig2 no 1589 0 300 300 15.4K 2.4%
15 14 image 2000 2650 icc 1 1 jbig2 no 1666 0 300 300 17.6K 2.7%
16 15 image 2010 2657 icc 1 1 jbig2 no 1740 0 300 300 18.7K 2.9%
17 16 image 2006 2654 icc 1 1 jbig2 no 1823 0 300 301 17.7K 2.7%
18 17 image 2007 2656 icc 1 1 jbig2 no 1905 0 300 300 16.9K 2.6%
19 18 image 2000 2650 icc 1 1 jbig2 no 1983 0 300 300 16.7K 2.6%
20 19 image 2000 2650 icc 1 1 jbig2 no 2065 0 300 300 17.4K 2.7%
21 20 image 2000 2650 icc 1 1 jbig2 no 2148 0 300 300 17.4K 2.7%
22 21 image 2011 2658 icc 1 1 jbig2 no 2229 0 300 301 17.2K 2.6%
23 22 image 2006 2654 icc 1 1 jbig2 no 2305 0 300 301 17.5K 2.7%
24 23 image 2000 2650 icc 1 1 jbig2 no 2377 0 300 300 14.5K 2.2%
This output means:
- 24 images (numbered 0--23) on 24 pages (each page 1 image).
- All images have very similar dimensions (width/height) and a resolution of 300 PPI.
- All images use the same compression method, JBIG2.
These results gives me confidence to suggest a different method to remove the OCR-ed text from your PDF:
- Extract all images.
- Create a new PDF from these images.
2. Extract all images from PDF
If you have one of the most recent Poppler versions of pdfimages, you are able to extract the images in the JBIG2 compression:
pdfimages -jbig2 toc.pdf toc--
The resulting image files will carry the file names toc---000.jb2e, toc---000.jb2e, ... (suffix .jb2e). Each of these files should have another one with it, named toc---000.jb2g, toc---000.jb2g, ... (suffix .jb2g).
If you do not get .jb2e images as a result, but .pbm instead, you'll have to use ImageMagick's convert to create JPEGs:
for i in toc--*.pbm; do
convert $i ${i/.pbm/.jpg}
done
However, the JPEG images will be much bigger than the JBIG2 ones. (I tried it: JPEGs are in total 15 MByte, PBMs are in total 15 MBytes, JBIG2 are in total 436 kBytes for the 24 images!)
3. Create a new PDF from the extracted images
If you were unlucky and had to convert to JPEG, you can now convert these to a PDF:
convert toc--*.jpg -density out.pdf
Voila!, you now have a 15 MByte PDF file without the OCR-ed text, where you before had a 1.6 MByte PDF file with OCR-ed text! (But you'll not have lost much of the previous quality...)
Since my own pdfimages is compiled from sources, I from time to time suffer from a bug with it. Right now it does not correctly extract images as JBIG2 files. That's why I cannot create a PDF from them either. But this PDF's size would be similar to the original toc.pdf's size....