I checked your file with special emphasis on your sample “मतद | र”, which is extracted as “मतदरर” at the very top line of the document pages.
In a nutshell:
The document itself contains information, for example. the glyphs "मतद | र" in the title bar represent the text "मतदरर". You should ask the source of your document for the version of the document in which the font information is not misleading. If this is not possible, you should go for OCR.
More details:
The top line of the first page is generated by the following operations in the page content stream:
/9 280 Tf (-12"!%$"234%56*5) Tj
The first line selects a font with the name / 9 with a size of 280 (the operation at the beginning of the page scales everything by 0.05 times, so the effective size is 14 units, which you see in the file).
The second line prints glyphs. These glyphs are referenced between brackets using the custom encoding of this font.
When a program tries to extract text, it should output the actual characters from these glyph links using information from the font.
The font / 9 on the first page of your PDF file is defined using the following objects:
242 0 obj<< /Type/Font/Name/9/BaseFont 243 0 R/FirstChar 33/LastChar 94 /Subtype/TrueType/ToUnicode 244 0 R/FontDescriptor 247 0 R/Widths 248 0 R>> endobj 243 0 obj/CDAC-GISTSurekh-Bold+0 endobj 247 0 obj<< /Type/FontDescriptor/FontFile2 245 0 R/FontBBox 246 0 R/FontName 243 0 R /Flags 4/MissingWidth 946/StemV 0/StemH 0/CapHeight 500/XHeight 0 /Ascent 1050/Descent -400/Leading 0/MaxWidth 1892/AvgWidth 946/ItalicAngle 0>> endobj
Thus, there is no / Encoding element, but at least there is a link to the / ToUnicode card . Thus, the program that extracts the text should rely on this mapping / ToUnicode .
The stream referenced by / ToUnicode contains the following mappings of interest when extracting text from (-12 "!% $" 234% 56 * 5):
<21> <21> <0930> <22> <22> <0930> <24> <24> <091c> <25> <25> <0020> <2a> <2a> <0031> <2d> <2d> <092e> <31> <31> <0924> <32> <32> <0926> <33> <33> <0926> <34> <34> <002c> <35> <35> <0032> <36> <36> <0030>
(Already here you can see that several character codes are mapped to the same Unicode code point ...)
Thus, text extraction should result in:
- = 0x2d -> 0x092e = म 1 = 0x31 -> 0x0924 = त 2 = 0x32 -> 0x0926 = द " = 0x22 -> 0x0930 = र instead of | ! = 0x21 -> 0x0930 = र % = 0x25 -> 0x0020 = $ = 0x24 -> 0x091c = ज " = 0x22 -> 0x0930 = र 2 = 0x32 -> 0x0926 = द 3 = 0x33 -> 0x0926 = द 4 = 0x34 -> 0x002c = , % = 0x25 -> 0x0020 = 5 = 0x35 -> 0x0032 = 2 6 = 0x36 -> 0x0030 = 0 * = 0x2a -> 0x0031 = 1 5 = 0x35 -> 0x0032 = 2
Thus, the text iTextSharp (as well as Adobe Reader!) Extracts from the title on the first page of the document exactly what is true in the document in font applications.
The reason for this is misleading information about the mapping in the font definition, it is not surprising that there are incorrect interpretations throughout the document.