Parsing pdf (Devanagari script) using PDFminer gives incorrect output

I am trying to parse a pdf file containing a list of Indian voters, which is in Hindi (Devanagari script).

PDF displays all text correctly, but when I tried to dump this pdf file into text format using PDFminer, it outputs characters that are different from the original pdf characters

For example, the displayed / correct word सामान्य

But the output word सपमपनद

Now I want to know why this is happening and how to parse this type of PDF file

I also include an example pdf file -

http://164.100.180.82/Rollpdf/AC276/S24A276P001.pdf

+4
source share
1 answer

, , .

, ToUnicode Devanagari script, , Unicode. , , , , .


Unicode, ( ). , Devanagari script, , , U + f020 U + f062 "uniF020" ..

Compact UnicodeBmp

, .. , .

, , , , .


, : PDF Devanagari script , , PDF , Unicode, . 5 .

, , ( toUnicode), , .


, python. , , pdfminer ( - ) ToUnicode, .

+1

All Articles