Sample channels for international Unicode characters (e.g. Japanese characters)

Question

Sample channels for international Unicode characters (e.g. Japanese characters)

I am new to python and scrapy and I am following the dmoz tutorial. As a minor version of the tutorial, the starting URL was suggested, I selected the Japanese category from the dmoz example site and noticed that the feed export I end up with shows unicode numeric values, not actual Japanese characters.

It seems I need to use TextResponse somehow , but I'm not sure how to get my spider to use this object instead of the base Response object.

How do I change my code to show Japanese characters in my release?
How do I get rid of square brackets, single quotes, and “u” that wrap my output values?

Ultimately, I want to get the result say

オンラインショップ (these are Japanese characters)

instead of current exit

[u '\ u30aa \ u30f3 \ u30e9 \ u30a4 \ u30f3 \ u30b7 \ u30e7 \ u30c3 \ u30d7'] (unicode)

If you look at my screenshot, this corresponds to cell C7, one of the text names.

Here is my spider (the same as in the tutorial, with the exception of different start_url):

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz.org" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/World/Japanese/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() item['desc'] = site.select('text()').extract() items.append(item) return items

settings.py:

 FEED_URI = 'items.csv' FEED_FORMAT = 'csv'

output screenshot: http://i55.tinypic.com/eplwlj.png (sorry, I do not have enough SO points to send images)

+7

python unicode scrapy

fortuneRice May 31 '11 at 18:31

source share

1 answer

Acorn · Accepted Answer · 2011-05-31T20:29:58+0000

When you clear text from a page, it is stored in Unicode.

What you want to do is encode it to something like UTF8.

 unicode_string.encode('utf-8')

In addition, when you extract text using the selector, it is saved in the list even if there is only one result, so you need to select the first element.

Sample channels for international Unicode characters (e.g. Japanese characters)

More articles: