Formatting Scrapy output to XML

So, I'm trying to export data cleared from a website using Scrapy to be in a specific format when I export it to XML.

Here is what I would like my XML to look like this:

<?xml version="1.0" encoding="UTF-8"?> <data> <row> <field1><![CDATA[Data Here]]></field1> <field2><![CDATA[Data Here]]></field2> </row> </data> 

I run scrape using the command:

 $ scrapy crawl my_scrap -o items.xml -t xml 

The current output I get corresponds to the lines:

 <?xml version="1.0" encoding="utf-8"?> <items><item><field1><value>Data Here</value></field1><field2><value>Data Here</value></field2></item> 

As you can see, this is adding <value> fields, and I cannot rename root nodes or node nodes. I know that I need to use XmlItemExporter , but I'm not sure how to implement this in my project.

I tried adding it to pipelines.py as shown here , but I always get the error:

AttributeError: 'CrawlerProcess' object has no attribute 'signals'

Does any authority XmlItemExporter examples of reformatting data when exporting to XML using XmlItemExporter ?

Edit:

Displaying my XmlItemExporter in my piplines.py module:

 from scrapy import signals from scrapy.contrib.exporter import XmlItemExporter class XmlExportPipeline(object): def __init__(self): self.files = {} @classmethod def from_crawler(cls, crawler): pipeline = cls() crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) return pipeline def spider_opened(self, spider): file = open('%s_products.xml' % spider.name, 'w+b') self.files[spider] = file self.exporter = XmlItemExporter(file) self.exporter.start_exporting() def spider_closed(self, spider): self.exporter.finish_exporting() file = self.files.pop(spider) file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item 

Edit (Show Changes and Traceback):

I changed the spider_opened function:

  def spider_opened(self, spider): file = open('%s_products.xml' % spider.name, 'w+b') self.files[spider] = file self.exporter = XmlItemExporter(file, 'data', 'row') self.exporter.start_exporting() 

A trace back I get:

 Traceback (most recent call last): File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/core/engine.py", line 265, in <lambda> spider=spider, reason=reason, spider_stats=self.crawler.stats.get_stats())) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred return signal.send_catch_log_deferred(*a, **kw) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred *arguments, **named) --- <exception caught here> --- File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 134, in maybeDeferred result = f(*args, **kw) File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply return receiver(*arguments, **named) File "/root/self_opportunity/self_opportunity/pipelines.py", line 28, in spider_closed self.exporter.finish_exporting() exceptions.AttributeError: 'XmlExportPipeline' object has no attribute 'exporter' 
+4
source share
1 answer

You can make XmlItemExporter do most of what you want by simply specifying the names of the nodes you need:

 XmlItemExporter(file, 'data', 'row') 

See the documentation .

The problem with the value elements in your fields is that these fields are not scalar values. If the XmlItemExporter detects a scalar value, it simply prints <fieldname>data</fieldname> , but if it encounters an iterable value, it will serialize as follows: <fieldname><value>data1</value><value>data2</value></fieldname> . The solution is to stop the output of non-scalar field values ​​for your elements.

If you do not want to do this, subclass XmlItemExporter and override its _export_xml_field method to do what you want when the element value is iterated. This is the code for XmlItemExporter so you can see the implementation.

+5
source

All Articles