I need to clean about 5,000 websites that have information related to something. Thus, the data will be somehow structured, for example item_id, name, description, date ....
Additional information found on the page should be searchable.
My idea is that I do not need a relational database, I do not need to make logical queries, I just need to search for data with a given keyword. Therefore, someone can simply type "green yellow" , and he will look for all the elements that have these two words. Given that elements can reach many millions, I was wondering which technology would be best used with this, something scalable, hopefully, or maybe there are solutions in the cloud?
For scraping, I was thinking about Node.js, since I can associate it with jQuery, which serves DOM and HTML structures fine. For storage, I'm still a little lost, but I have some experience with Lucene, so I can store scraper data directly in Lucene.
What do you think? Any advice from people who did something similar would already be great! Thanks.
source share