Scrambling data from PDF to CSV? Python vs PHP?

I have a bunch of reports that I collect manually every day, and this happens forever, so I thought about automating the whole process. I will extract data from: (1) HTML, (2) CSV / XLS, (3) PDF. I basically have only scraped data from CSV / HTML with PHP and wondered if there are any reliable libraries or ways to capture tabular data from PDF to PHP?

I also just started to learn Python and see that it would be nice to try to do this with PDFMiner in combination with Scrapy. Would it be better? Or are there other options?

Please let me know. Thanks!

+4
source share
2 answers

Beautiful Soup is another good cleanup alternative, and PDFminer was the best PDF parser for Python I've found. I mainly use pdf2txt.py and then reformat if necessary.

+3
source

If you have command line access for a Linux server, try the pdftotext command

$ pdftotext file.pdf 

If you're lucky, you'll get something you can work with. Depending on the PDF, the text may seem strange due to the fact that the tables were originally formatted, in my opinion, anyway. Good luck.

+2
source

All Articles