How to remove unicode?

While erasing web pages and after getting rid of all the html tags, I got a black phone symbol \ u260e in unicode (☎). But unlike this answer, I also want to get rid of it.

I used the following regular expressions in Scrapy to remove html tags:

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M) 

Then I tried to match \ u260e, and I think I came under the backslash . I tried these templates unsuccessfully:

 pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M) pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M) pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M) 

None of this worked, and I still have \ u260e as output. How can I make it disappear?

+7
source share
3 answers

Using Python 2.7.3, the following works fine for me:

 import re pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M) s = u"bla ble \u260e blo" re.sub(pattern, "", s) 

Output:

 u'bla ble blo' 

As pointed out by @Zack, this works because the string is now in unicode, i.e. the string has already been converted, and the sequence of characters \u260e now perhaps two bytes to write this small black telephone ☎ (

As soon as the string to be searched and the regular expression has the black phone itself, and not the \u260e character \u260e , they both match.

+6
source

If your string is already unicode, there are two easy ways. The second will affect not only ☎, obviously.

 >>> import string >>> foo = u"Lorum ☎ Ipsum" >>> foo.replace(u'☎', '') u'Lorum Ipsum' >>> "".join(s for s in foo if s in string.printable) u'Lorum Ipsum' 
+4
source

You can try with BeatifulSoup as described here , with something like

 soup = BeautifulSoup (html.decode('utf-8', 'ignore')) 
+1
source

All Articles