I am looking to clean many sites and search all of them, which system should I use?

Question

I am looking to clean many sites and search all of them, which system should I use?

I need to clean about 5,000 websites that have information related to something. Thus, the data will be somehow structured, for example item_id, name, description, date ....

Additional information found on the page should be searchable.

My idea is that I do not need a relational database, I do not need to make logical queries, I just need to search for data with a given keyword. Therefore, someone can simply type "green yellow" , and he will look for all the elements that have these two words. Given that elements can reach many millions, I was wondering which technology would be best used with this, something scalable, hopefully, or maybe there are solutions in the cloud?

For scraping, I was thinking about Node.js, since I can associate it with jQuery, which serves DOM and HTML structures fine. For storage, I'm still a little lost, but I have some experience with Lucene, so I can store scraper data directly in Lucene.

What do you think? Any advice from people who did something similar would already be great! Thanks.

+4

search search-engine full-text-search web-crawler screen-scraping

Luca matteis Jun 01 '11 at 7:52

source share

3 answers