I want to clear a pdf hindi (indian language) file using python

Question

I want to clear a pdf hindi (indian language) file using python

I wrote python code that flushes all the data from a PDF file. The problem here is that after it is cleared, the words lose their grammar. How to fix this problem? I am attaching a code.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()

         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf")

and here is a screenshot of the PDF.

+4

python pdf ocr pdf-scraping pdfminer

Abhinav mishra Mar 14 '16 at 18:50

source share

1 answer

Abhinav mishra · Accepted Answer · 2016-03-21T20:09:32+0000

The best way to solve the problem is to use a module textractfrom python and load the Hindi test data from your github repository and write the extracted text to a txt file. This solved my problem.

I want to clear a pdf hindi (indian language) file using python

More articles: