So, I'm trying to export data cleared from a website using Scrapy to be in a specific format when I export it to XML.
Here is what I would like my XML to look like this:
<?xml version="1.0" encoding="UTF-8"?> <data> <row> <field1><![CDATA[Data Here]]></field1> <field2><![CDATA[Data Here]]></field2> </row> </data>
I run scrape using the command:
$ scrapy crawl my_scrap -o items.xml -t xml
The current output I get corresponds to the lines:
<?xml version="1.0" encoding="utf-8"?> <items><item><field1><value>Data Here</value></field1><field2><value>Data Here</value></field2></item>
As you can see, this is adding <value> fields, and I cannot rename root nodes or node nodes. I know that I need to use XmlItemExporter , but I'm not sure how to implement this in my project.
I tried adding it to pipelines.py as shown here , but I always get the error:
AttributeError: 'CrawlerProcess' object has no attribute 'signals'
Does any authority XmlItemExporter examples of reformatting data when exporting to XML using XmlItemExporter ?
Edit:
Displaying my XmlItemExporter in my piplines.py module:
from scrapy import signals from scrapy.contrib.exporter import XmlItemExporter class XmlExportPipeline(object): def __init__(self): self.files = {} @classmethod def from_crawler(cls, crawler): pipeline = cls() crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) return pipeline def spider_opened(self, spider): file = open('%s_products.xml' % spider.name, 'w+b') self.files[spider] = file self.exporter = XmlItemExporter(file) self.exporter.start_exporting() def spider_closed(self, spider): self.exporter.finish_exporting() file = self.files.pop(spider) file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item
Edit (Show Changes and Traceback):
I changed the spider_opened function:
def spider_opened(self, spider): file = open('%s_products.xml' % spider.name, 'w+b') self.files[spider] = file self.exporter = XmlItemExporter(file, 'data', 'row') self.exporter.start_exporting()
A trace back I get:
Traceback (most recent call last): File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/core/engine.py", line 265, in <lambda> spider=spider, reason=reason, spider_stats=self.crawler.stats.get_stats())) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred return signal.send_catch_log_deferred(*a, **kw) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred *arguments, **named) --- <exception caught here> --- File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 134, in maybeDeferred result = f(*args, **kw) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply return receiver(*arguments, **named) File "/root/self_opportunity/self_opportunity/pipelines.py", line 28, in spider_closed self.exporter.finish_exporting() exceptions.AttributeError: 'XmlExportPipeline' object has no attribute 'exporter'