PDF - at the very heart - is a print format and thus writes text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and which is still displayed in many documents. With complex scenarios such as Arabic or indexes that require glyph replacement, dressing, and reordering, you often get a mess, mostly. Usually you get glyph identifiers that are used in embedded fonts that have no resemblance to Unicode or the actual text encoding (fonts are glyphs, some of which can be mapped to Unicode code points, but some of them are simply necessary for internal use of fonts such as glyph variations based on context or ligatures). You can see the same with the PDF files created by LaTeX, especially with non-ASCII characters and math.
PDF also has the ability to insert text as text along with a visual representation, but only at the discretion of the generating application. I heard that Word tries very hard to save this information when creating PDF files, but many PDF generators do not (this usually works for Latin, probably why almost no one bothers).
I think the best option for you if the PDF does not have plain text available is OCR to PDF as an image.
Joey source share