Detect urls in string and wrap with tag "<a href ..."
I am looking to write something similar that it should be easy enough, but for some reason it is not easy for me to catch my head.
I want to write a python function that, by passing a string, will pass that string back with HTML encoding around the urls.
unencoded_string = "This is a link - http://google.com" def encode_string_with_links(unencoded_string): # some sort of regex magic occurs return encoded_string print encoded_string 'This is a link - <a href="http://google.com">http://google.com</a>' Thanks!
The "magic of regular expressions" you only need sub (which makes a replacement):
def encode_string_with_links(unencoded_string): return URL_REGEX.sub(r'<a href="\1">\1</a>', unencoded_string) URL_REGEX could be something like:
URL_REGEX = re.compile(r'''((?:mailto:|ftp://|http://)[^ <>'"{}|\\^`[\]]*)''') This is a fairly common regular expression for URLs: it allows you to use mailto, http and ftp schemes, and after that it pretty simply continues until it comes across an โunsafeโ character (except for the percentage that you want to allow for screens) You can make it more strict if you need to. For example, you may require that the percentages be followed by a valid hexadecimal escape code or only one pound sign (for a fragment) or a forced order between query parameters and fragments. That should be enough to get you started.
Googled solutions:
#---------- find_urls.py----------# # Functions to identify and extract URLs and email addresses import re def fix_urls(text): pat_url = re.compile( r''' (?x)( # verbose identify URLs within text (http|ftp|gopher) # make sure we find a resource type :// # ...needs to be followed by colon-slash-slash (\w+[:.]?){2,} # at least two domain groups, eg (gnosis.)(cx) (/?| # could be just the domain name (maybe w/ slash) [^ \n\r"]+ # or stuff then space, newline, tab, quote [\w/]) # resource name ends in alphanumeric or slash (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending ) # end of match group ''') pat_email = re.compile(r''' (?xm) # verbose identify URLs in text (and multiline) (?=^.{11} # Mail header matcher (?<!Message-ID:| # rule out Message-ID as best possible In-Reply-To)) # ...and also In-Reply-To (.*?)( # must grab to email to allow prior lookbehind ([A-Za-z0-9-]+\.)? # maybe an initial part: DAVID.mertz@gnosis.cx [A-Za-z0-9-]+ # definitely some local user: MERTZ@gnosis.cx @ # ...needs an at sign in the middle (\w+\.?){2,} # at least two domain groups, eg (gnosis.)(cx) (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending ) # end of match group ''') for url in re.findall(pat_url, text): text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]}) for email in re.findall(pat_email, text): text = text.replace(email[1], '<a href="mailto:%(email)s">%(email)s</a>' % {"email" : email[1]}) return text if __name__ == '__main__': print fix_urls("test http://google.com asdasdasd some more text") EDIT: Adjusted to suit your needs.