If I have a collection of random sites, how do I get specific information from each?

Let's say I have a collection of websites for accountants, for example:

http://www.johnvanderlyn.com http://www.rubinassociatespa.com http://www.taxestaxestaxes.com http://janus-curran.com http://ricksarassociates.com http://www.condoaudits.com http://www.krco-cpa.com http://ci.boca-raton.fl.us 

I want to go around everyone and get the names and email addresses of partners. How can I approach this problem at a high level?

Suppose I know how to actually crawl every site (and all subpages) and parse HTML elements - I'm using Oga .

What I'm struggling with is how to understand data that is presented in a variety of ways. For example, an email address for a company (or partner) can be found in one of the following ways:

  • On the About Us page, under the partner’s name.
  • On the About Us page, as a shared public email address.
  • On the Team page, under the partner’s name.
  • On the page "Contacts" as a common public email address.
  • On the partner page under the name of the partner.

Or it could be in any other way.

One way to think about approaching email is to simply look for mailto a tags and filter from there.

The obvious disadvantage of this is that there is no guarantee that the letter will be for the partner, and not another employee.

Another problem, which is more obvious, is finding partner names only from markup. At first, I thought I could just pull out all the heading tags and text in them, but I came across several sites that have partner names in span tags.

I know that SO usually refers to specific programming issues, but I'm not sure how to approach this and where to ask about it. Is there another StackExchange site for which this question is more appropriate?

Any advice you could give me on a specific topic would be great.

+7
html architecture web-crawler web-scraping
source share
4 answers

The links you provide are mainly on the US site, so I think you focus on English names. In this case, instead of parsing from the html tags, I would just search the entire webpage for the name. ( There is a free database with first and last name ) This may also work if you donig this for any other company in Europe, but it will be a problem for a company from some countries. Take, for example, China, while there is a set of corrections with a surname, you can use basically any combination of a Chinese character as a name, so this solution will not work on a Chinese site.

It is easy to find email from a web page since there is a fixed format (username) @ (domain name) with no space between them. Again, I will not consider it as html tags, but just like a regular line, so that you can find email regardless of whether it is in the mailto tag or in plain text. Then, to determine what this message is:

 Only one email in page? Yes -> catch-all email. No -> Is name found in that page as well? No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment) Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email. Then, it should be safe to assume the name appear first belongs to more important member, eg Chairman or partner. 
+3
source share

I looked at http://ricksarassociates.com/ and I can’t find any partners at all, so in my opinion, you better benefit from this if you are not better off looking for some other invention.

I did this kind of datascraping from time to time, and in Norway we have laws - or should I say "laws" - that you are not allowed to send messages to people, but you are allowed to send messages by email to the company - thus the same problem from another angle.

I wish I knew mathematics and algorithms by heart, because I’m sure that there is a fascinating solution hidden in artificial intelligence and machine learning, but, in my opinion, the only solution I can see is to create a set of rules that time is probably getting quite complicated, Maby, you can apply some Bayesian filtering - it works very well for email.

But - to be a little more productive here. One thing that I know is not important, you can start by creating a crawler environment and building a dataset. Have a database for URLS so you can add more at any time and run a crawl on what you already have so that you can test by querying your own data with a 100% copy. This will save you a lot of time, not live curettage when setting up.

I made my own search engine a few years ago, scraping all the NO domains, but I only needed an index file. It took only one week to scrape it off, it only took one week, and I think it was 8 GB of data for this single file only, and I had to use several trusted servers and also make it work due to problems with a lot of DNS traffic. There are many problems to take care of. I think I'm just saying - if you crawl large-scale, you can start receiving data if you want to work more efficiently with parsings later.

Good luck and make a message if you get permission. I don’t think this is possible without algorythm or AI, though - people create websites the way they like and they pull the templates out of their ass, so there are no rules to follow. As a result, you will get bad data.

Do you have funding for this? If so, then it is easier. Then you can simply scan each site and create a profile for each site. You can use someone cheap to manually view the analyzed data and remove all errors. This is probably the way most people do, if someone has not done so already, and the database is being sold / accessible from webservice, so it can be cleared.

+3
source share

I made similar scrapers for these types of pages, and it varies greatly from site to site. If you are trying to make one crawler to sort the automatic search for information, it will be difficult. However, a high level looks something like this.

  • For each site you are checking, find the element templates. Divs often have labels, identifiers, and classes that easily let you capture information. You may find that many divs have a specific class name. Check it out first.
  • It is often better to collect too much data from a particular page and then weld it on your side. You could perhaps look for information that appears on the screen using type (link) or regular expression (email) to search for formatted text. Names and professions will be more difficult to find by this method, but can be positionally linked on many pages to other well-formatted elements.
  • Names will often be affixed with reverence (Mrs., Mr., Dr., JD, MD, etc.). You can come up with a bank of these people and check them out on any page you end up on.
  • Finally, if you really want to make this process a common goal, you can make several heuristics to improve your methods based on expected information; Names, for example, are most often on a specific list. If it was worth your time, you can check for specific text to see if it matches a list of more common names.

What you mentioned in your original question, it seems that you will benefit greatly from the regular regular expression finder, and you can improve it because you know more about the sites you interact with.

+1
source share

All Articles