Data Scraper from PDF and Excel

I am doing a little data cleansing. There are 3 types of files from which I clear the data.

1- HTML
2- PDF
3- Excel (xls)

For HTML, I’m comfortable, I use HTML Agility for this.

For PDF and Excel, I need suggestions from anyone.

Thanks in advance.

+4
source share
4 answers

Regarding Excel. If you are in an MS environment, you can either perform Office automation or use OLEDB. In a Java environment, consider the Apache POI.

EDIT: Regarding PDF in Java, try Apache PDFBox . May also work in .NET using IKVM

+4
source

I can recommend Cogniview PDF2XL , a low-cost commercial product, for extracting data from tables to PDF files in Excel. We have used it with great success.

+1
source

HTML Agility is a library. Its good to use. But why, why do you need separate tools for various data mining purposes? Use Automation Anywhere to retrieve data from any source. As far as I know, this will work for all three sources that you indicated. Google it.

0
source

You can use UiPath to achieve this. It can clear 100% accurate PDF, Excel, HTML, Java, Windows, .NET, WPF, legacy. Also works with virtualized environments, but only with OCR cleanup.

It can be used from code (SDK), but you can also create visual automation (workflows) using UiPath Studio. Here is a tutorial on extracting web data

Note: I work in UiPath, so I know that it can handle this task. You should also try other visual automation tools such as Automation Anywhere, WinAutomation, Jacada, use them side by side and choose the one that suits you best.

0
source

Source: https://habr.com/ru/post/1314304/


All Articles