Character \ u260e

Question

Character \ u260e

While iterating over the pages, I got the \ u260e character in Unicode. My way out is "The Last Resort", "+977 1 4700525". Therefore, instead of â ~ Ž there should be ☎.

How do I return a phone sign (☎)? Thus, the output will be "The Last Resort, ☎ +977 1 4700525".

Krish

+1

python unicode

Elisa Sep 01 '11 at 6:40

source share

2 answers

You can print them on the results page using HTML objects with the given code.

for example: http://www.danshort.com/HTMLentities/index.php?w=dingb

Or use the string.encode function to encode it in the desired encoding.

+1

Dhruvpathak Sep 01 '11 at 6:42

source share

Ray toal · Accepted Answer · 2011-09-01T07:28:38+0000

When you cleaned the site, Python recognized the "☎" symbol and saved it in a string.

This character has a code 260e. However, when characters are stored, they are stored as sequences of one or more bytes. These bytes depend on the encoding used. In your case, UTF-8 was probably used.

The character encoding for UTF-8 is E2 98 8E (see http://www.fileformat.info/info/unicode/char/260e/index.htm ).

So now you have a sequence of bytes representing your character. what are you going to do with it You are going to get it somewhere. But you want to convert this byte string to characters, so you need to specify the encoding. Let's say you specify the encoding Windows-1252 (see http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT ).

E2 is
98 - ~
8E - Ž

this is what you see. You must write your Python string in UTF-8. Or, if you are writing HTML, use the DruvPathak suggestion to use references to HTML symbol objects, in this case

&#x260e;

or

 &#9742;

I suspect that it happened that you did not specify the encoding when writing your line and that Windows-1252 is the default. Or perhaps your browser has been configured to display Windows-1252 by default.

An interesting feature of sending data in HTML is that you can send a UTF-8 byte stream, set the HTTP content type to UTF-8, and put the meta tags in your HTML document, indicating that the page is encoded in UTF-8, but if the end user uses a browser that allows him or her to override the encoding sent by the server, it is likely that the end user will see the data erroneously.

If you use links to symbolic entities, the browser will always display it correctly.

However, it may be inconvenient to use these entity references everywhere. Most people these days do not manually install a browser to override the encoding sent by the server.

ADDITION

So, let's say you have a Unicode string, and you want to create a regular (non-Unicode) string ( type str ) containing references to HTML character objects. Here is a complete script example that illustrates the direct, though not necessarily the most Pythonic way of doing this:

 def to_character_entity_reference_string(s): return "".join(["&#" + str(ord(c)) + ";" for c in s]) print(to_character_entity_reference_string(u'काठमाण्डु'))

If you run this script, you will get the output

 &#2325;&#2366;&#2336;&#2350;&#2366;&#2339;&#2381;&#2337;&#2369;

You can put this output in a file and open it in a web browser, and you will see काठमाण्डु as expected.

You can create variations on this script base so that characters with code points less than 128 are saved, and everything else becomes a reference to the character object. You can also learn the Python functions of encode and decode . And again, the symbol object refers to people who manually change their browser settings to override your encodings, which, of course, is fine, but can be considered redundant. End users who are at war with these settings can be said to get what they deserve, so it is generally accepted that you just need to encode everything in UTF-8, period. However, it is good to know about character entity references.

Character \ u260e

More articles: