Remove all forms of urls from given string in Python

Question

Remove all forms of urls from given string in Python

I am new to python and wondered if there is a better solution for matching all forms of urls that can be found on this line. When searching on Google, there seem to be a lot of solutions that extract domains, replace them with links, etc., But none of them removes / removes them from the string. I mentioned a few examples below for reference. Thanks!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.' URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][az]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+| (\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""'']))', '', thestring) print '==' + URLless_string + '=='

Error Log:

 C:\Python27>python test.py File "test.py", line 7 SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

+6

python regex

Prem minister Dec 29 '12 at 11:07

source share

2 answers

There is an error in the code (actually two):

1. You must put a backslash before the penultimate single quote to avoid it:

 URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][az]{2,4} /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))', '', thestring)

2. You should not use str as a name for a variable, because it is a reserved keyword, so name it thestring or something else

For instance:

 thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.' URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][az]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))', '', thestring) print URLless_string

with the result:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.

+6

doru Dec 29 '12 at 11:59

source share

kerim · Accepted Answer · 2012-12-29T11:25:11+0000

Include the encoding line at the top of the source file (the regular expression line contains non ascii characters, such as » ), for example:

 # -*- coding: utf-8 -*- import re ...

Also, combine the regular expression string in triple single (or double) quotation marks - ''' or """ instead of single quotes, since this string already contains quotation marks ( ' and " ).

 r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][az]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""'']))'''

Remove all forms of urls from given string in Python

More articles: