My problem:
I have a pdf with a lot of Roman characters with complex diacritics (e.g. แนฃ, ล, แน, ว, etc.). To simplify the search in PDF, I would like to add an additional layer, as is done with hoc, where the same text is present without diacritics.
When using full-text search engines, I can index several terms in the same position (vector) - I would like to get the same effect here.
I read a lot about adding an hcr layer to scanned images, but I just want to duplicate the text layer, pass it through a script that breaks diacritics (fairly straightforwardly), and then adds them back as a hidden, but searchable layer.
Anyone have any suggestions? (Solutions that include any platform, language, library, or toolchain will be helpful!)
Thanks:)
Edit: please let me know if the question is unclear.
source share