Scraw Crawler in python cannot follow links?

I wrote a crawler in python using the python scrapy tool. The following is python code:

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector #from scrapy.item import Item from a11ypi.items import AYpiItem class AYpiSpider(CrawlSpider): name = "AYpi" allowed_domains = ["a11y.in"] start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"] rules =( Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item') ) def parse_item(self,response): #filename = response.url.split("/")[-1] #open(filename,'wb').write(response.body) #testing codes ^ (the above) hxs = HtmlXPathSelector(response) item = AYpiItem() item["foruri"] = hxs.select("//@foruri").extract() item["thisurl"] = response.url item["thisid"] = hxs.select("//@foruri/../@id").extract() item["rec"] = hxs.select("//@foruri/../@rec").extract() return item 

But instead of following the links, an error is thrown:

 Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 131, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_help func(*a, **kw) File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command cmd.run(args, opts) File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/commands/crawl.py", line 45, in run q.append_spider_name(name, **opts.spargs) --- <exception caught here> --- File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/queue.py", line 89, in append_spider_name spider = self._spiders.create(name, **spider_kwargs) File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/spidermanager.py", line 36, in create return self._spiders[spider_name](**spider_kwargs) File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 38, in __init__ self._compile_rules() File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 82, in _compile_rules self._rules = [copy.copy(r) for r in self.rules] exceptions.TypeError: 'Rule' object is not iterable 

Can someone please explain to me what is going on? Since this is the material mentioned in the documentation, and I leave the allow field blank, which should follow True by default. So why is it a mistake? What kind of optimization can I do with my crawler to make it fast?

+6
python scrapy
source share
1 answer

From what I see, it looks like your rule is not iterable. It looks like you tried to make the rules a tuple, you should read the tuples in the python documentation .

To fix your problem, change this line:

  rules =( Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item') ) 

To:

  rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'),) 

Pay attention to the comma at the end?

+32
source share

All Articles