I am looking to clean many sites and search all of them, which system should I use?

I need to clean about 5,000 websites that have information related to something. Thus, the data will be somehow structured, for example item_id, name, description, date ....

Additional information found on the page should be searchable.

My idea is that I do not need a relational database, I do not need to make logical queries, I just need to search for data with a given keyword. Therefore, someone can simply type "green yellow" , and he will look for all the elements that have these two words. Given that elements can reach many millions, I was wondering which technology would be best used with this, something scalable, hopefully, or maybe there are solutions in the cloud?

For scraping, I was thinking about Node.js, since I can associate it with jQuery, which serves DOM and HTML structures fine. For storage, I'm still a little lost, but I have some experience with Lucene, so I can store scraper data directly in Lucene.

What do you think? Any advice from people who did something similar would already be great! Thanks.

+4
source share
3 answers

Natche is actually perfect for this. It contains Lucene / Solr as a component of a search engine.

Also check out Lucidworks Solr, which has a built-in web crawler along with a pretty neat GUI.

http://www.lucidimagination.com/products/lucidworks-search-platform/enterprise

+1
source

Solr is absolutely perfect for this task.

0
source

All Articles