PDF scraper: how to automate the creation of txt files for each pdf file scrambled in Python?

Here's what I want to do: a program that will present a list of PDF files as its input and return one .txt file for each list file.

For example, given listA = ["file1.pdf", "file2.pdf", "file3.pdf"], I want Python to create three txt files (one for each pdf file), for example, "file1. Txt" , "file2.txt" and "file3.txt".

I have a conversion part working smoothly thanks to this guy . The only change I made is the maxpages statement, in which I assigned 1 instead of 0 to extract only the first page. As I said, this part of my code works fine. here is the code.

def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
#maxpages = 0
maxpages = 1
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str

The thing is, I cannot return Python to me what I said in the second paragraph. I tried the following code:

def save(lst):
i = 0

while i < len(lst):
    txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files
    artigo = convert_pdf_to_txt(lst[0])
    with open(txtfile, "w") as textfile:
        textfile.write(artigo)
    i += 1

I launched this save function with a list of two PDF files as input, but it generated only one txt file and continued to work for several minutes without generating a second txt file. What is the best approach to fulfill my goals?

+4
source share
1 answer

i, , i += 1:

def save(lst):
    i = 0   # set to 0 but never changes
    while i < len(lst):
        txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files
        artigo = convert_pdf_to_txt(lista[0])
        with open(txtfile, "w") as textfile:
            textfile.write(artigo)
     i += 1 # you need to  increment i

range:

def save(lst):
    for i in range(len(lst)): 
        txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files
        artigo = convert_pdf_to_txt(lista[0])
        with open(txtfile, "w") as textfile:
            textfile.write(artigo)

lista[0], .

lst - lista, enumerate:

   def save(lst):
        for i, ele in enumerate(lst): 
            txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files
            artigo = convert_pdf_to_txt(ele)
            with open(txtfile, "w") as textfile:
                textfile.write(artigo)
+2

All Articles