How to replace "★ ✿ •" with your codes?

Question

How to replace "★ ✿ •" with your codes?

I am working on some python web parser and now it fits with special characters like ★ ✿ • and others, sometimes I get them in utf-8: "â¿" , and sometimes in unicode: u"\xe2\x80\xa2" . I found a table of them , but the only thing I can do is:

 set = [] set.append([u"\xe2\x80\xa2","&#8226;"]) set.append(["&#226;&#156;&#191;","&#10047;"]) for i in set: s=s.replace(i[0],i[1])

I write it with my own hands.

Because I could not find a table that links the left to the right.

Can you help me?

+4

python encoding html-parsing character-encoding

scythargon Feb 05 '13 at 2:19

source share

1 answer

icktoofay · Accepted Answer · 2013-02-05T02:41:55+0000

Specify a Unicode string containing one character:

 symbol = u'★'

It can be converted to HTML syntax as follows:

 html = '&#{};'.format(ord(symbol))

To convert back, extract the number by separating &# and ; convert to integer and then use chr (Python 3) or unichr (Python 2).

If you need to process input not from the conversion above, you may need to process hexadecimal, which looks like &#xZZZ; where ZZZ is a bunch of hexadecimal digits. To detect them, just notice that it starts with x ; analyze the remainder with radix 16.

In addition, you may need to deal with named objects. See the last two paragraphs for this.

If you want Python to deal with the encoding of an entire string, you can use this:

 text = u"I like symb★ls!" html = text.encode('ascii', errors='xmlcharrefreplace').decode('ascii')

Unfortunately, there is no equivalent for decoding, and this also does not exclude potentially dangerous HTML characters such as < (which may or may not be what you want). If you need to decode, perhaps use the correct HTML parser, which can also deal with named objects such as &clubs; (& clubs;).

If you want to deal with named objects and don't want to use real HTML parser, there is a machine-readable (with Python json module) list of objects .

How to replace "★ ✿ •" with your codes?

More articles: