The spider does not scrap the right amount of items

Question

The spider does not scrap the right amount of items

I have been studying Scrapy for the past few days and am having trouble getting all the list items on the page.

Thus, the page has a similar structure:

<ol class="list-results">
    <li class="SomeClass i">
        <ul>
            <li class="name">Name1</li>
        </ul>
    </li>
    <li class="SomeClass 0">
        <ul>
            <li class="name">Name2</li>
        </ul>
    </li>
    <li class="SomeClass i">
        <ul>
            <li class="name">Name3/li>
        </ul>
    </li>
</ol>

In the Parse Scrapy function, I get all the list items like this:

def parse(self, response):
        sel = Selector(response)
        all_elements = sel.css('.SomeClass')
        print len(all_elemts)

I know that on the test page I request there are about 300 list items with this class , however after printing len (all_elements) I get only 61 .

I tried using xpaths like:

sel.xpath("//*[contains(concat(' ', @class, ' '), 'SomeClass')]")

And yet I get as 61 elements instead of 300, which I should be.

I also use try and except claws if one element should give me an exception.

, : https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter=

, , !

, ! ! , !

+4

python css-selectors xpath web-scraping scrapy

Nazariy1995 03 . '15 22:14

1

alecxe · Accepted Answer · 2015-02-04T04:09:08+0000

, HTML, Scrapy ( lxml) . , . div li:

<li class="unit"><span>Unit:</span> 
    <div class="unit-block"> Language Program                  
</li>

HTML BeautifulSoup. , Scrapy, HTML- BeautifulSoup.

scrapy shell:

$ scrapy shell "https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter="
In [1]: len(response.css('li.student'))
Out[1]: 55

In [2]: from bs4 import BeautifulSoup

In [3]: soup = BeautifulSoup(response.body)

In [4]: len(soup.select('li.student'))
Out[4]: 281

CrawlSpider LinkExtractor BeautifulSoup, .

, BeautifulSoup

The spider does not scrap the right amount of items

More articles: