Correctly scraping and displaying Japanese characters using Python Django BeautifulSoup and Curl

Question

Correctly scraping and displaying Japanese characters using Python Django BeautifulSoup and Curl

I am trying to clear a page in Japanese using python, curl and BeautifulSoup. Then I save the text in a MySQL database that uses utf-8 encoding, and displays the resulting data using Django.

Here is an example URL:

https://www.cisco.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=930026&CurrentPage=180

I have a function that I use to extract HTML as a string:

def get_html(url): c = Curl() storage = StringIO() c.setopt(c.URL, str(url)) cookie_file = 'cookie.txt' c.setopt(c.COOKIEFILE, cookie_file) c.setopt(c.COOKIEJAR, cookie_file) c.setopt(c.WRITEFUNCTION, storage.write) c.perform() c.close() return storage.getvalue()

Then I pass it to BeautifulSoup:

 html = get_html(str(scheduled_import.url)) soup = BeautifulSoup(html)

Then it is parsed and stored in the database. Then I use Django to output data to json. Here is the view I am using:

 def get_jobs(request): jobs = Job.objects.all().only(*fields) joblist = [] for job in jobs: job_dict = {} for field in fields: job_dict[field] = getattr(job, field) joblist.append(job_dict) return HttpResponse(dumps(joblist), mimetype='application/javascript')

As a result, the bytecode is displayed on the page, for example:

xe3 \ x82 \ xb7 \ xe3 \ x83 \ xa3 \ xe3 \ x83 \ xaa \ xe3 \ x82 \ xb9 \ xe3 \ x83 \ x88

\ xe8 \ x81 \ xb7 \ xe5 \ x8b \ x99 \ xe5 \ x86 \ x85 \ xe5 \ xae \ xb9
\ xe3 \ x82 \ xb7 \ xe3 \ x82 \ xb9 \ xe3 \ x82 \ xb3 \ xe3 \ x82 \ xb7 \ xe3 \ x82 \ xb9 \ xe3 \ x83 \ x86 \ XE3 \ x83 \ xa0 \ XE3 \ x82 \ Xba \ XE3 \ x81 \ XAE \ XE3 \ x82 \ xb3 \ XE3 \ x83 \ xA9 \ XE3 \ x83 \ x9c \ XE3 \ x83 \ XAC \ XE3 \ x83 \ XBC \ XE3 \ x82 \ xb7 \ XE3 \ x83 \ xa7 \ XE3 \ x83 \ xb3 \ xe4 \ Xba \ x8b \ XE6 \ xa5 \ XAD \ xE9 \ x83 \ xa8 \ XE3 \ x81 \ xa7 \ XE3 \ x81 \ XAF \ XE3 \ x80 \ x81 \ xe4 \ Xba \ Xba \ XE3 \ x82 \ x92 \ xe4 \ XB8 \ XAD \ xe5 \ XBF \ x83 \ XE3 \ x81 \ xa8 \ XE3 \ x81 \ x97 \ XE3 \ x81 \ x9f \ XE3 \ x82 \ xb3 \ XE3 \ x83 \ x9f \ XE3 \ x83 \ xa5 \ XE3 \ x83 \ x8b \ XE3 \ x82 \ XB1 \ XE3 \ x83 \ XBC \ XE3 \ x82 \ xb7 \ XE3 \ x83 \ xa7 \ XE3 \ x83 \ xb3 \ XE3 \ x81 \ Xab \ XE3 \ x82 \ x88 \ XE3 \ x82 \ X8A \ XE3 \

Instead of Japanese.

I figured out all day and converted my database to utf-8, tried to decode text from iso-8859-1 and encode to utf-8.

Basically, I have no idea what I'm doing, and I will be grateful for any help or suggestions I can get, so I can avoid another day trying to figure it out.

+7

python django utf-8 beautifulsoup iso-8859-1

Ryan rogers Sep 13 '12 at 0:38

source share

1 answer

Torsten engelbrecht · Answer 1 · 2012-09-13T02:11:42+0000

The examples you provided are somehow a representation of an ascii string. You need to convert this to a python unicode string. Typically, you can use string encoding and decoding to complete a task. If you are not sure which one is correct, just experiment with it in the python console.

Try my_new_string = my_string.decode('utf-8') to get a python unicode string. This should display correctly in Django templates, can be saved to the database, etc. As an example, you can also just try print my_new_string and see that it prints Japanese characters.

Correctly scraping and displaying Japanese characters using Python Django BeautifulSoup and Curl

More articles: