Is there any language that is just “perfect” for web scraping?

I used 3 languages ​​for web clips - Ruby, PHP and Python, and, frankly, none of them are suitable for this task.

Ruby has an excellent Mechanicalize and XML syntax library, but table support is very poor.

PHP has an excellent spreadsheet and HTML parsing library, but it does not have the equivalent of WWW: Mechanize.

Python has a very poor Mechanize library. I had a lot of problems with this and still could not solve them. His spreadsheet library is also more or less decent, as it cannot create XLSX files.

Is there something that is just perfect for webscraping.

PS: I work on the Windows platform.

+7
python ruby php web-scraping
source share
4 answers

Check out Python + Scrappy, this is pretty good:

http://scrapy.org/

+2
source share

Why not just use the XML Spreadsheet format? It is very simple to create, and it is likely to be trivial with any type of class based system.

Also, for Python, have you tried BeautifulSoup for parsing? Urllib + BeautifulSoup makes a pretty powerful combo.

+1
source share

The short answer is no.

The problem is that HTML is a large family of formats, and only more recent variations are consistent (and based on XML). If you intend to use PHP, I would recommend using the DOM parser, as this can handle a lot of html that does not qualify as well-formed XML.

Reading between the lines of your message - you seem to:

1) capture content from the Internet with the requirement to manage complex interactions

2) parsing data into a consistent machine-readable format

3) writing data to a spreadsheet

Which, of course, is connected with three separate problems: if no language meets all three requirements, why not use the best tool for work and just worry about a suitable intermediate format / medium for data?

FROM.

+1
source share

Python + Beautiful Soup for web scraping, and since you're in windows, you can use win32com to automate Excel to create your xlsx files.

0
source share

All Articles