Add duplicate (hidden) text layer in pdf for additional search

Question

Add duplicate (hidden) text layer in pdf for additional search

My problem:

I have a pdf with a lot of Roman characters with complex diacritics (e.g. ṣ, ś, ṝ, ǎ, etc.). To simplify the search in PDF, I would like to add an additional layer, as is done with hoc, where the same text is present without diacritics.

When using full-text search engines, I can index several terms in the same position (vector) - I would like to get the same effect here.

I read a lot about adding an hcr layer to scanned images, but I just want to duplicate the text layer, pass it through a script that breaks diacritics (fairly straightforwardly), and then adds them back as a hidden, but searchable layer.

Anyone have any suggestions? (Solutions that include any platform, language, library, or toolchain will be helpful!)

Thanks:)

Edit: please let me know if the question is unclear.

+4

search pdf

simon Oct 27 '10 at 9:45

source share

2 answers

I wrote something similar to add searchable text using OCR'ing images and convert it to PDF in C #. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image, and it worked pretty well.

In your case, QuickPDF allows you to extract text strings along with bounding fields and font details. You can then normalize the text and create invisible text objects using existing fonts and location information, and then save them to a new file.

This will basically give you the same PDF file that you have now, and also give you both the original and normalized text as you receive it.

QuickPDF is a commercial library. If your solution works well for you, then there is no buying a commercial engine. It's nice, however, that this only requires 1 SDK, and you would look at it if you had more than a few PDFs to convert.

+1

Andrew Cash Nov 25 '10 at 9:36

source share

simon · Accepted Answer · 2010-11-25T07:44:06+0000

Well, I have a (slightly ugly and hacky) solution, so I decided to share it.

I use PDFMiner to extract text along with coordinates. Then I use ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position as the hidden text. To position the positions correctly, I found that I needed to use the exact same font, so I used a combination of FontForge and MuPDF to extract the required font from the original pdf.

Finally, creating a new pdf, I use pdftk to combine it with the original.

This works very well, but has the disadvantage that copying text from PDF results results in normalized text being copied as well. But this is acceptable for my current purposes, and I see no way around this. Pdf specification. actually does not support my goal, and therefore I do not think I can do it better than this hacker solution.

Add duplicate (hidden) text layer in pdf for additional search

More articles: