I looked at http://ricksarassociates.com/ and I can’t find any partners at all, so in my opinion, you better benefit from this if you are not better off looking for some other invention.
I did this kind of datascraping from time to time, and in Norway we have laws - or should I say "laws" - that you are not allowed to send messages to people, but you are allowed to send messages by email to the company - thus the same problem from another angle.
I wish I knew mathematics and algorithms by heart, because I’m sure that there is a fascinating solution hidden in artificial intelligence and machine learning, but, in my opinion, the only solution I can see is to create a set of rules that time is probably getting quite complicated, Maby, you can apply some Bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing that I know is not important, you can start by creating a crawler environment and building a dataset. Have a database for URLS so you can add more at any time and run a crawl on what you already have so that you can test by querying your own data with a 100% copy. This will save you a lot of time, not live curettage when setting up.
I made my own search engine a few years ago, scraping all the NO domains, but I only needed an index file. It took only one week to scrape it off, it only took one week, and I think it was 8 GB of data for this single file only, and I had to use several trusted servers and also make it work due to problems with a lot of DNS traffic. There are many problems to take care of. I think I'm just saying - if you crawl large-scale, you can start receiving data if you want to work more efficiently with parsings later.
Good luck and make a message if you get permission. I don’t think this is possible without algorythm or AI, though - people create websites the way they like and they pull the templates out of their ass, so there are no rules to follow. As a result, you will get bad data.
Do you have funding for this? If so, then it is easier. Then you can simply scan each site and create a profile for each site. You can use someone cheap to manually view the analyzed data and remove all errors. This is probably the way most people do, if someone has not done so already, and the database is being sold / accessible from webservice, so it can be cleared.