Scrambling data from PDF to CSV? Python vs PHP?

Question

Scrambling data from PDF to CSV? Python vs PHP?

I have a bunch of reports that I collect manually every day, and this happens forever, so I thought about automating the whole process. I will extract data from: (1) HTML, (2) CSV / XLS, (3) PDF. I basically have only scraped data from CSV / HTML with PHP and wondered if there are any reliable libraries or ways to capture tabular data from PDF to PHP?

I also just started to learn Python and see that it would be nice to try to do this with PDFMiner in combination with Scrapy. Would it be better? Or are there other options?

Please let me know. Thanks!

+4

python php pdf screen-scraping

tr3online Sep 09 '11 at 2:30

source share

2 answers

If you have command line access for a Linux server, try the pdftotext command

$ pdftotext file.pdf

If you're lucky, you'll get something you can work with. Depending on the PDF, the text may seem strange due to the fact that the tables were originally formatted, in my opinion, anyway. Good luck.

+2

Adam Sep 09 '11 at 2:38

source share

Stedy · Accepted Answer · 2011-09-09T02:36:03+0000

Beautiful Soup is another good cleanup alternative, and PDFminer was the best PDF parser for Python I've found. I mainly use pdf2txt.py and then reformat if necessary.

Scrambling data from PDF to CSV? Python vs PHP?

More articles: