Parsing pdf (Devanagari script) using PDFminer gives incorrect output

Question

Parsing pdf (Devanagari script) using PDFminer gives incorrect output

I am trying to parse a pdf file containing a list of Indian voters, which is in Hindi (Devanagari script).

PDF displays all text correctly, but when I tried to dump this pdf file into text format using PDFminer, it outputs characters that are different from the original pdf characters

For example, the displayed / correct word सामान्य

But the output word सपमपनद

Now I want to know why this is happening and how to parse this type of PDF file

I also include an example pdf file -

http://164.100.180.82/Rollpdf/AC276/S24A276P001.pdf

+4

python parsing pdf pdfminer hindi

Rohit Aug 7 '15 at 11:15

source share

1 answer

mkl · Accepted Answer · 2015-08-10T15:08:07+0000

, , .

, ToUnicode Devanagari script, , Unicode. , , , , .

Unicode, ( ). , Devanagari script, , , U + f020 U + f062 "uniF020" ..

, .. , .

, , , , .

, : PDF Devanagari script , , PDF , Unicode, . 5 .

, , ( toUnicode), , .

, python. , , pdfminer ( - ) ToUnicode, .

Parsing pdf (Devanagari script) using PDFminer gives incorrect output

More articles: