I am new to python and scrapy and I am following the dmoz tutorial. As a minor version of the tutorial, the starting URL was suggested, I selected the Japanese category from the dmoz example site and noticed that the feed export I end up with shows unicode numeric values, not actual Japanese characters.
It seems I need to use TextResponse somehow , but I'm not sure how to get my spider to use this object instead of the base Response object.
- How do I change my code to show Japanese characters in my release?
- How do I get rid of square brackets, single quotes, and βuβ that wrap my output values?
Ultimately, I want to get the result say
γͺ γ³ γ© γ€ γ³ γ· γ§ γ γ (these are Japanese characters)
instead of current exit
[u '\ u30aa \ u30f3 \ u30e9 \ u30a4 \ u30f3 \ u30b7 \ u30e7 \ u30c3 \ u30d7'] (unicode)
If you look at my screenshot, this corresponds to cell C7, one of the text names.
Here is my spider (the same as in the tutorial, with the exception of different start_url):
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz.org" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/World/Japanese/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() item['desc'] = site.select('text()').extract() items.append(item) return items
settings.py:
FEED_URI = 'items.csv' FEED_FORMAT = 'csv'
output screenshot: http://i55.tinypic.com/eplwlj.png (sorry, I do not have enough SO points to send images)
fortuneRice
source share