How to make truly multi-threaded web mining with IE / .Net / C #?

I want to get large amounts of data from the Internet using the IE browser. However, spawning many, many instances of IE through the WatiN system crash. Is there a better way to do this? Note: I can't just do WebRequests - I really need a browser because you need to interact with JS-driven behavior on the site.

+4
source share
5 answers

Have you tried using the commercial version of iMacros yet? This is a bit like WatiN, but more focused on web automation / web scraping. Basically they added special code to deal with all the various annoyances of the browser. Their sample code contains C # / VB.NET multithreaded sample code for use with IE and Firefox. We use it with Ruby ;)

We have no problem running many instances / server. Although I cannot name our company, I know that AlertFox uses the same approach for web monitoring .

+3
source

I mine a lot of pages using WatiN. Actually 30+ at this moment. Of course, this requires a lot of resources - about 2.5 GB of RAM, but it is almost impossible to do the same with WebRequest . I can not imagine that I am doing such a thing in a reasonable amount of time. WatiN takes a few hours.

I don't know if this helps you, but I use the webbrowser control to do this. Each instance is a separate process. But, what I think is more important for you, I once tried to reduce the amount of memory used by doing all this in one process. You can simply make a separate AppDomain instead of processes and force them to use the same DLL (especially Microsoft.mshtml.dll) instead of loading the same DLL separately for each new application domain. I can’t remember how to do it now, but it’s not difficult for Google. I remember that everything worked fine, and the use of RAM was significantly reduced, so I think it's worth a try.

+2
source

How to run multiple instances of a WebBrowser control (it's IE anyway) in a .NET application to process data mining jobs with async?

If perfection is a problem, sharing a task and clicking it on the cloud can also help.

+1
source

The best way would be to create one process per instance of the web browser, because the web browser is not managed code, its COM, and there are cases where unmanaged exceptions cannot be processed in managed code, and the application will certainly crash.

It is best to create a process host that will spawn multiple processes, and you can use named pipes or sockets or WCF to communicate between each process if you need to.

It would be best to create a small SQL Embedded database, and you can queue your jobs in it, the mining process can receive a new query and send the query back to the database, and this database can be used to synchronize everything.

+1
source

I had a project in which I read about 45 million requests (with forms) on an expanded basis. On an ongoing basis, I scraped about 20 simultaneous clients, and my pipe was a bottleneck.

I used Selinium Remote-Control after experimenting with writing my own WebClient, WaTiN / WaTiR and using the Microsoft UI Automation API.

Selenium RC allows you to choose a browser. I used Firefox. Setting up the initial scrambling scripts took about an hour of experimentation and tuning. Selenium was much faster than writing native code and much more robust with little investment. Great tool.

To scale the process, I tried several different approaches, but in the end, what worked best was for each instance of SRC to be located in its own shared virtual machine, and then spawn so many of those supported by the workstation. The equivalent number of SRC instances running on the host, instead of vms, inevitably stops when I get up to +10 instances. This required more overhead and installation time before scrambling, but it will work for several days without interruption.

Another consideration: adjust your Firefox settings so that no page loads, turn off all non-essential ones (spoofing checks, cookies if this is not required for your scratch, images, ad unit and flash block, etc.).

+1
source

All Articles