Timeline for How can I convert a scanned PDF with OCRed text to one without OCRed text?

Current License: CC BY-SA 3.0

6 events

when toggle format	what		by	license	comment
Nov 1, 2016 at 17:51	comment	added	labreuer		Sure; it was quite easy, as I'm in the process of recompressing various PDFs into lossy jbig2 with globals using iTextSharp. When I applied that to the above pdf generated from pngs, I got a 264KB pdf. (For others who happen along, I may open source the resultant C# project at some point in the future. I have to decide how deeply to dive into understanding pdfs.)
Nov 1, 2016 at 17:08	comment	added	Kurt Pfeifle		@labreuer: Interesting. Thanks for checking this. I'll investigate this some more (and probably update my answer, giving you credit) once I find the time to do it.
Nov 1, 2016 at 17:06	comment	added	labreuer		False in this case: when I switched your code to png, the resultant pdf I got was 1.97MB. You'd probably be right if we weren't dealing with bitonal images of text; png compresses those quite well. But it's also irrelevant, because I was only using png as an intermediary to jbig2. I knew I could do this, because your `pdfimages -list` results showed that all the images were jbig2.
Nov 1, 2016 at 13:13	comment	added	Kurt Pfeifle		@labreuer: Just FYI, going the PNG route does not offer any advantages IMHO. If it does, please explain to me: which? Because PNG typically is larger than JPEG, so the disadvantages I clearly outlined (file size of new PDF sans OCR) would be even worse...
Nov 1, 2016 at 1:28	comment	added	labreuer		Just FYI, one could convert `pbm` files to `png` (or run a Poppler version of `pdfimages` with `-png`), then use agl/jbig2enc (generates jbig2 with globals), then use `pdf.py` (in that project) to create a pdf. I know this works if the pdf is made up exclusively of jbig2 images, one per page.
Jan 28, 2015 at 23:28	history	answered	Kurt Pfeifle	CC BY-SA 3.0