I am trying to clear a page in Japanese using python, curl and BeautifulSoup. Then I save the text in a MySQL database that uses utf-8 encoding, and displays the resulting data using Django.
Here is an example URL:
https://www.cisco.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=930026&CurrentPage=180
I have a function that I use to extract HTML as a string:
def get_html(url): c = Curl() storage = StringIO() c.setopt(c.URL, str(url)) cookie_file = 'cookie.txt' c.setopt(c.COOKIEFILE, cookie_file) c.setopt(c.COOKIEJAR, cookie_file) c.setopt(c.WRITEFUNCTION, storage.write) c.perform() c.close() return storage.getvalue()
Then I pass it to BeautifulSoup:
html = get_html(str(scheduled_import.url)) soup = BeautifulSoup(html)
Then it is parsed and stored in the database. Then I use Django to output data to json. Here is the view I am using:
def get_jobs(request): jobs = Job.objects.all().only(*fields) joblist = [] for job in jobs: job_dict = {} for field in fields: job_dict[field] = getattr(job, field) joblist.append(job_dict) return HttpResponse(dumps(joblist), mimetype='application/javascript')
As a result, the bytecode is displayed on the page, for example:
xe3 \ x82 \ xb7 \ xe3 \ x83 \ xa3 \ xe3 \ x83 \ xaa \ xe3 \ x82 \ xb9 \ xe3 \ x83 \ x88
\ xe8 \ x81 \ xb7 \ xe5 \ x8b \ x99 \ xe5 \ x86 \ x85 \ xe5 \ xae \ xb9
\ xe3 \ x82 \ xb7 \ xe3 \ x82 \ xb9 \ xe3 \ x82 \ xb3 \ xe3 \ x82 \ xb7 \ xe3 \ x82 \ xb9 \ xe3 \ x83 \ x86 \ XE3 \ x83 \ xa0 \ XE3 \ x82 \ Xba \ XE3 \ x81 \ XAE \ XE3 \ x82 \ xb3 \ XE3 \ x83 \ xA9 \ XE3 \ x83 \ x9c \ XE3 \ x83 \ XAC \ XE3 \ x83 \ XBC \ XE3 \ x82 \ xb7 \ XE3 \ x83 \ xa7 \ XE3 \ x83 \ xb3 \ xe4 \ Xba \ x8b \ XE6 \ xa5 \ XAD \ xE9 \ x83 \ xa8 \ XE3 \ x81 \ xa7 \ XE3 \ x81 \ XAF \ XE3 \ x80 \ x81 \ xe4 \ Xba \ Xba \ XE3 \ x82 \ x92 \ xe4 \ XB8 \ XAD \ xe5 \ XBF \ x83 \ XE3 \ x81 \ xa8 \ XE3 \ x81 \ x97 \ XE3 \ x81 \ x9f \ XE3 \ x82 \ xb3 \ XE3 \ x83 \ x9f \ XE3 \ x83 \ xa5 \ XE3 \ x83 \ x8b \ XE3 \ x82 \ XB1 \ XE3 \ x83 \ XBC \ XE3 \ x82 \ xb7 \ XE3 \ x83 \ xa7 \ XE3 \ x83 \ xb3 \ XE3 \ x81 \ Xab \ XE3 \ x82 \ x88 \ XE3 \ x82 \ X8A \ XE3 \
Instead of Japanese.
I figured out all day and converted my database to utf-8, tried to decode text from iso-8859-1 and encode to utf-8.
Basically, I have no idea what I'm doing, and I will be grateful for any help or suggestions I can get, so I can avoid another day trying to figure it out.