Store OCR'ed data in pdf #26

DanielSwain · 2019-04-16T13:32:24Z

I have OCR'ed my first set of documents with the fallback to Tesseract. It worked very well. In order for this to be most useful, OCR'ed text should be saved not only to the database but also back into the pdf. That way a user can do a Ctrl+F to find text within the document when viewing it. Have you thought about implementing this functionality?

DanielSwain · 2019-04-17T00:06:27Z

I see that near the bottom of this page of the Tesseract docs it says:

You can also create a searchable pdf directly from tesseract ( versions >=3.03):

The Tesseract parser is launched in the pdf_parser.py code. Looks like a change here would provide for saving searchable pdfs.

DanielSwain mentioned this issue Apr 17, 2019

Textract dependency issue; Wagtail version dependency #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store OCR'ed data in pdf #26

Store OCR'ed data in pdf #26

DanielSwain commented Apr 16, 2019

DanielSwain commented Apr 17, 2019

Store OCR'ed data in pdf #26

Store OCR'ed data in pdf #26

Comments

DanielSwain commented Apr 16, 2019

DanielSwain commented Apr 17, 2019