Scrapy - Follow RSS Links

I was wondering if anyone ever tried to fetch / follow links in an RSS feed using SgmlLinkExtractor / CrawlSpider. I can not make it work ...

I use the following rule:

    rules = (
        Rule (SgmlLinkExtractor (tags = ('link',), attrs = False),
            follow = True,
            callback = 'parse_article'),
        )

(bearing in mind that rss links are in the link tag).

I am not sure how to tell SgmlLinkExtractor to retrieve the text () link and not look for attributes ...

Any help is appreciated, thanks in advance

+7
python web-crawler scrapy
source share
4 answers

CrawlSpider rules do not work. You'll probably need to subclass BaseSpider and implement your own link extraction in the spider callback. For example:

from scrapy.spider import BaseSpider from scrapy.http import Request from scrapy.selector import XmlXPathSelector class MySpider(BaseSpider): name = 'myspider' def parse(self, response): xxs = XmlXPathSelector(response) links = xxs.select("//link/text()").extract() return [Request(x, callback=self.parse_link) for x in links] 

You can also try XPath in the shell by running, for example:

 scrapy shell http://blog.scrapy.org/rss.xml 

And then enter the text in the shell:

 >>> xxs.select("//link/text()").extract() [u'http://blog.scrapy.org', u'http://blog.scrapy.org/new-bugfix-release-0101', u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release'] 
+7
source share

Here XMLFeedSpider can be used now.

+6
source share

I did this using CrawlSpider:

 class MySpider(CrawlSpider): domain_name = "xml.example.com" def parse(self, response): xxs = XmlXPathSelector(response) items = xxs.select('//channel/item') for i in items: urli = i.select('link/text()').extract() request = Request(url=urli[0], callback=self.parse1) yield request def parse1(self, response): hxs = HtmlXPathSelector(response) # ... yield(MyItem()) 

but I'm not sure if this is a very correct decision ...

0
source share

XML example From scrapy XMLFeedSpider

 from scrapy.spiders import XMLFeedSpider from myproject.items import TestItem class MySpider(XMLFeedSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/feed.xml'] iterator = 'iternodes' # This is actually unnecessary, since it the default value itertag = 'item' def parse_node(self, response, node): self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract())) #item = TestItem() item = {} # change to dict for removing the class not found error item['id'] = node.xpath('@id').extract() item['name'] = node.xpath('name').extract() item['description'] = node.xpath('description').extract() return item 
-one
source share

All Articles