Data Scraper from PDF and Excel

Question

Data Scraper from PDF and Excel

I am doing a little data cleansing. There are 3 types of files from which I clear the data.

1- HTML
2- PDF
3- Excel (xls)

For HTML, I’m comfortable, I use HTML Agility for this.

For PDF and Excel, I need suggestions from anyone.

Thanks in advance.

+4

excel pdf screen-scraping

Sakhawat ali Jun 30 '10 at 9:02

source share

4 answers

I can recommend Cogniview PDF2XL , a low-cost commercial product, for extracting data from tables to PDF files in Excel. We have used it with great success.

+1

Govert Jul 9 '10 at 15:15

source share

HTML Agility is a library. Its good to use. But why, why do you need separate tools for various data mining purposes? Use Automation Anywhere to retrieve data from any source. As far as I know, this will work for all three sources that you indicated. Google it.

0

Bob Jan 31 '12 at 9:08

source share

You can use UiPath to achieve this. It can clear 100% accurate PDF, Excel, HTML, Java, Windows, .NET, WPF, legacy. Also works with virtualized environments, but only with OCR cleanup.

It can be used from code (SDK), but you can also create visual automation (workflows) using UiPath Studio. Here is a tutorial on extracting web data

Note: I work in UiPath, so I know that it can handle this task. You should also try other visual automation tools such as Automation Anywhere, WinAutomation, Jacada, use them side by side and choose the one that suits you best.

0

mbadit Nov 05 '14 at 11:01

source share

renick · Accepted Answer · 2010-06-30T09:08:28+0000

Regarding Excel. If you are in an MS environment, you can either perform Office automation or use OLEDB. In a Java environment, consider the Apache POI.

EDIT: Regarding PDF in Java, try Apache PDFBox . May also work in .NET using IKVM

Data Scraper from PDF and Excel

More articles: