I would not use any of the tools you mentioned.
You need to draw a high level diagram (I like pencil and paper).
I would develop a system in which different modules do different things, it would be advisable to create it so that you can have many instances of each module working in parallel.
I would consider using multiple queues for
- URLs to crawl
- Scanned pages from the Internet
- Extracted information based on templates and business rules.
- Developed Results
- normalized and filtered results
You would have simple (possibly command lines without a UI) programs that would read the data from the queues and paste the data into one or more queues (The Crawler will feed both the "URL for crawling" and the "Crawled pages from the Internet" ). You can use:
- Web crawler
- Data extractor
- Parser
- Normalizer and filter
They would correspond to the queues, and you could run many copies of these files on separate PCs, allowing you to scale.
The last queue can be sent to another program that actually sends everything to the database for actual use.
Osama al-maadeed
source share