Starting the crawler does not receive the same data as in training

When teaching my crawler to clear the Yelp page, it gets all the information without doing anything, but when I start the crawler, the address is not recognized and not recorded.

+7
source share
3 answers

Getting company data from Yelp

In this case, we want to get addresses for companies in San Francisco from www.yelp.com.

Site analysis

We can get a list of companies starting with the letter "A" on this page:

http://www.yelp.com/sm/san-francisco-ca-us/a/1 

This catalog page tells us that for "A" there are 42 pages of results with up to 80 results per page.

It's a good news.

Create API

Now I'm going to create an API to retrieve data from the first page, and then use Bulk Extract to pass a list of URLs to all 42 pages.

Using Magic, I can create an API in just a few clicks:

  • Go to Magic.import.io
  • Embed the Yelp Page URL (link above)
  • Click Extract Data
  • Click Get API
  • Click "Copy this to" My Details "

Now we have an API!

(Note that if you need more control over what to include or exclude from the API, you can use Extractor)

Create URL List

To create a list of URLs that will allow us to receive data from pages 1 to 42, I am going to use an external service located at:

http://texttool.blogspot.co.uk/

Find the generate list of numbers tool and create a list of URLs:

 http://www.yelp.com/sm/san-francisco-ca-us/a/1 http://www.yelp.com/sm/san-francisco-ca-us/a/2 http://www.yelp.com/sm/san-francisco-ca-us/a/3 http://www.yelp.com/sm/san-francisco-ca-us/a/4 http://www.yelp.com/sm/san-francisco-ca-us/a/5 http://www.yelp.com/sm/san-francisco-ca-us/a/6 http://www.yelp.com/sm/san-francisco-ca-us/a/7 http://www.yelp.com/sm/san-francisco-ca-us/a/8 http://www.yelp.com/sm/san-francisco-ca-us/a/9 http://www.yelp.com/sm/san-francisco-ca-us/a/10 http://www.yelp.com/sm/san-francisco-ca-us/a/11 http://www.yelp.com/sm/san-francisco-ca-us/a/12 http://www.yelp.com/sm/san-francisco-ca-us/a/13 http://www.yelp.com/sm/san-francisco-ca-us/a/14 http://www.yelp.com/sm/san-francisco-ca-us/a/15 http://www.yelp.com/sm/san-francisco-ca-us/a/16 http://www.yelp.com/sm/san-francisco-ca-us/a/17 http://www.yelp.com/sm/san-francisco-ca-us/a/18 http://www.yelp.com/sm/san-francisco-ca-us/a/19 http://www.yelp.com/sm/san-francisco-ca-us/a/20 http://www.yelp.com/sm/san-francisco-ca-us/a/21 http://www.yelp.com/sm/san-francisco-ca-us/a/22 http://www.yelp.com/sm/san-francisco-ca-us/a/23 http://www.yelp.com/sm/san-francisco-ca-us/a/24 http://www.yelp.com/sm/san-francisco-ca-us/a/25 http://www.yelp.com/sm/san-francisco-ca-us/a/26 http://www.yelp.com/sm/san-francisco-ca-us/a/27 http://www.yelp.com/sm/san-francisco-ca-us/a/28 http://www.yelp.com/sm/san-francisco-ca-us/a/29 http://www.yelp.com/sm/san-francisco-ca-us/a/30 http://www.yelp.com/sm/san-francisco-ca-us/a/31 http://www.yelp.com/sm/san-francisco-ca-us/a/32 http://www.yelp.com/sm/san-francisco-ca-us/a/33 http://www.yelp.com/sm/san-francisco-ca-us/a/34 http://www.yelp.com/sm/san-francisco-ca-us/a/35 http://www.yelp.com/sm/san-francisco-ca-us/a/36 http://www.yelp.com/sm/san-francisco-ca-us/a/37 http://www.yelp.com/sm/san-francisco-ca-us/a/38 http://www.yelp.com/sm/san-francisco-ca-us/a/39 http://www.yelp.com/sm/san-francisco-ca-us/a/40 http://www.yelp.com/sm/san-francisco-ca-us/a/41 http://www.yelp.com/sm/san-francisco-ca-us/a/42 

Bulk extraction

Now you can use Bulk Extract to retrieve data from each of these URLs at a time.

For this:

  • Go to the Configuration tab of your Yelp API.
  • Select Bulk Retrieval from the drop-down list.
  • Paste in a list of 42 URLs
  • Click Run Queries

Note. You may receive several failed requests. By clicking on the "X URLs failed" icon, you can retry failed requests.

Export

Now you can export this data to a spreadsheet, like HTML or JSON.

Further reading

http://support.import.io/knowledgebase/articles/669784-getting-company-data-from-yelp

+8
source share

you should use xpath to choose what you ever want on yelp, I did it before because yelp and xpath are more accurate than manual training.

+1
source share

I’m more fortunate with http://datascramblr.com , like everything for you automatically for Yelp

0
source share

All Articles