I wrote a regular expression that automatically detects URLs in the free text that the user enters. This is not such an easy task as it might seem at first glance. Jeff Atwood writes about this in his post .
Its regular expression works, but requires extra code after detection.
I managed to write a regex that does everything all in one go. Here's what it looks like (I split it into separate lines to make it clearer what it does):
1 (?<outer>\()? 2 (?<scheme>http(?<secure>s)?://)? 3 (?<url> 4 (?(scheme) 5 (?:www\.)? 6 | 7 www\. 8 ) 9 [a-z0-9] 10 (?(outer) 11 [-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+(?=\)) 12 | 13 [-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+ 14 ) 15 ) 16 (?<ending>(?(outer)\)))
As you can see, I use named capture groups (used later in Regex.Replace() ), and I also included some local characters (čšžćđ), which also allow us to parse localized URLs. You can easily omit them if you want.
Anyway. Here is what it does (referring to line numbers):
- 1 - determines whether the URL begins with open curly braces (contained inside curly braces) and is stored in the "external" named capture group
- 2 - checks if it starts with a URL scheme, also detects whether the SSL scheme is or not.
- 3 - first start parsing the URL (save it in "url" with the name of the capture group)
- 4-8 -
if expression that says: if "sheme" was present, then www. part is optional, otherwise a link is required for the string (this regular expression detects all lines starting with http or www) - 9 is the first character after
http:// or www. should be either a letter or a number (this can be expanded if you want to cover even more links, but I decided not to do this, think about a connection that starts with some kind of obscure nature). - 10-14 -
if , which says: if there was an “external” (curly braces), capture everything until the last closing curly braces, otherwise capture everything - 15 - closes the named capture group for the URL
- 16 - if there are open curly braces, close the closing curly braces as well and save them in the “final” named capture group.
The first and last lines use \s* , so the user can also write open curly braces and put a space inside before inserting the link.
Anyway. My code that links the replacement to the actual HTML binding elements looks something like this:
value = Regex.Replace( value, @"(?<outer>\()?(?<scheme>http(?<secure>s)?://)?(?<url>(?(scheme)(?:www\.)?|www\.)[a-z0-9](?(outer)[-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+(?=\))|[-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+))(?<ending>(?(outer)\)))", "${outer}<a href=\"http${secure}://${url}\">http${secure}://${url}</a>${ending}", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
As you can see, I use the named capture groups to replace the link to the Anchor tag:
"${outer}<a href=\"http${secure}://${url}\">http${secure}://${url}</a>${ending}"
I could also omit the http (s) part of the binding screen to make the links more friendly, but for now I decided not to.
Question
I would like my links also to be replaced by abbreviations. Therefore, when a user copies a very long link (for example, if they copy a link from Google maps, which usually generates long links), I would like to shorten the visible part of the anchor tag. The link will work, but the visible part of the anchor tag will be reduced to a certain number of characters. I could also add an ellipsis at the end on all possible ones (and make things even more perfect).
Does the Regex.Replace() method Regex.Replace() substitution substitution so I can use one call? It does something similar to the string.Format() method when you want to format values in a string format (decimal numbers, dates, etc.).