Timeline for How can I convert a scanned PDF with OCRed text to one without OCRed text?
Current License: CC BY-SA 3.0
6 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Nov 1, 2016 at 17:51 | comment | added | labreuer | Sure; it was quite easy, as I'm in the process of recompressing various PDFs into lossy jbig2 with globals using iTextSharp. When I applied that to the above pdf generated from pngs, I got a 264KB pdf. (For others who happen along, I may open source the resultant C# project at some point in the future. I have to decide how deeply to dive into understanding pdfs.) | |
| Nov 1, 2016 at 17:08 | comment | added | Kurt Pfeifle | @labreuer: Interesting. Thanks for checking this. I'll investigate this some more (and probably update my answer, giving you credit) once I find the time to do it. | |
| Nov 1, 2016 at 17:06 | comment | added | labreuer |
False in this case: when I switched your code to png, the resultant pdf I got was 1.97MB. You'd probably be right if we weren't dealing with bitonal images of text; png compresses those quite well. But it's also irrelevant, because I was only using png as an intermediary to jbig2. I knew I could do this, because your pdfimages -list results showed that all the images were jbig2.
|
|
| Nov 1, 2016 at 13:13 | comment | added | Kurt Pfeifle | @labreuer: Just FYI, going the PNG route does not offer any advantages IMHO. If it does, please explain to me: which? Because PNG typically is larger than JPEG, so the disadvantages I clearly outlined (file size of new PDF sans OCR) would be even worse... | |
| Nov 1, 2016 at 1:28 | comment | added | labreuer |
Just FYI, one could convert pbm files to png (or run a Poppler version of pdfimages with -png), then use agl/jbig2enc (generates jbig2 with globals), then use pdf.py (in that project) to create a pdf. I know this works if the pdf is made up exclusively of jbig2 images, one per page.
|
|
| Jan 28, 2015 at 23:28 | history | answered | Kurt Pfeifle | CC BY-SA 3.0 |