How to remove unicode?

Question

How to remove unicode?

While erasing web pages and after getting rid of all the html tags, I got a black phone symbol \ u260e in unicode (☎). But unlike this answer, I also want to get rid of it.

I used the following regular expressions in Scrapy to remove html tags:

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

Then I tried to match \ u260e, and I think I came under the backslash . I tried these templates unsuccessfully:

 pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M) pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M) pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

None of this worked, and I still have \ u260e as output. How can I make it disappear?

+7

python python-2.7 regex scrapy

rafa May 06 '13 at 15:16

source share

3 answers

If your string is already unicode, there are two easy ways. The second will affect not only ☎, obviously.

 >>> import string >>> foo = u"Lorum ☎ Ipsum" >>> foo.replace(u'☎', '') u'Lorum Ipsum' >>> "".join(s for s in foo if s in string.printable) u'Lorum Ipsum'

Delete characters without ascii, but leave periods and spaces for more information about string.printable
The best way to remove multiple spaces in a string in Python is if you do not want to use multiple spaces.

+4

timss May 06 '13 at 15:27

source share

You can try with BeatifulSoup as described here , with something like

 soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

+1

octoback May 06 '13 at 15:29

source share

Rubens · Accepted Answer · 2013-05-06T15:24:37+0000

Using Python 2.7.3, the following works fine for me:

 import re pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M) s = u"bla ble \u260e blo" re.sub(pattern, "", s)

Output:

 u'bla ble blo'

As pointed out by @Zack, this works because the string is now in unicode, i.e. the string has already been converted, and the sequence of characters \u260e now perhaps two bytes to write this small black telephone ☎ (

As soon as the string to be searched and the regular expression has the black phone itself, and not the \u260e character \u260e , they both match.

How to remove unicode?

More articles: