Extended Regular Expression: Automatically Detect and Replace URLs Using Bound Tags

Question

Extended Regular Expression: Automatically Detect and Replace URLs Using Bound Tags

I wrote a regular expression that automatically detects URLs in the free text that the user enters. This is not such an easy task as it might seem at first glance. Jeff Atwood writes about this in his post .

Its regular expression works, but requires extra code after detection.

I managed to write a regex that does everything all in one go. Here's what it looks like (I split it into separate lines to make it clearer what it does):

1 (?<outer>\()? 2 (?<scheme>http(?<secure>s)?://)? 3 (?<url> 4 (?(scheme) 5 (?:www\.)? 6 | 7 www\. 8 ) 9 [a-z0-9] 10 (?(outer) 11 [-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+(?=\)) 12 | 13 [-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+ 14 ) 15 ) 16 (?<ending>(?(outer)\)))

As you can see, I use named capture groups (used later in Regex.Replace() ), and I also included some local characters (čšžćđ), which also allow us to parse localized URLs. You can easily omit them if you want.

Anyway. Here is what it does (referring to line numbers):

1 - determines whether the URL begins with open curly braces (contained inside curly braces) and is stored in the "external" named capture group
2 - checks if it starts with a URL scheme, also detects whether the SSL scheme is or not.
3 - first start parsing the URL (save it in "url" with the name of the capture group)
4-8 - if expression that says: if "sheme" was present, then www. part is optional, otherwise a link is required for the string (this regular expression detects all lines starting with http or www)
9 is the first character after http:// or www. should be either a letter or a number (this can be expanded if you want to cover even more links, but I decided not to do this, think about a connection that starts with some kind of obscure nature).
10-14 - if , which says: if there was an “external” (curly braces), capture everything until the last closing curly braces, otherwise capture everything
15 - closes the named capture group for the URL
16 - if there are open curly braces, close the closing curly braces as well and save them in the “final” named capture group.

The first and last lines use \s* , so the user can also write open curly braces and put a space inside before inserting the link.

Anyway. My code that links the replacement to the actual HTML binding elements looks something like this:

 value = Regex.Replace( value, @"(?<outer>\()?(?<scheme>http(?<secure>s)?://)?(?<url>(?(scheme)(?:www\.)?|www\.)[a-z0-9](?(outer)[-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+(?=\))|[-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+))(?<ending>(?(outer)\)))", "${outer}<a href=\"http${secure}://${url}\">http${secure}://${url}</a>${ending}", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);

As you can see, I use the named capture groups to replace the link to the Anchor tag:

 "${outer}<a href=\"http${secure}://${url}\">http${secure}://${url}</a>${ending}"

I could also omit the http (s) part of the binding screen to make the links more friendly, but for now I decided not to.

Question

I would like my links also to be replaced by abbreviations. Therefore, when a user copies a very long link (for example, if they copy a link from Google maps, which usually generates long links), I would like to shorten the visible part of the anchor tag. The link will work, but the visible part of the anchor tag will be reduced to a certain number of characters. I could also add an ellipsis at the end on all possible ones (and make things even more perfect).

Does the Regex.Replace() method Regex.Replace() substitution substitution so I can use one call? It does something similar to the string.Format() method when you want to format values in a string format (decimal numbers, dates, etc.).

+7

c # regex replace

Robert Koritnik May 05 '10 at 6:28

source share

2 answers

You will need to use the Regex.Replace overload, which uses the MatchEvaluator , the delegate that creates the replacement text for you.

See here: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchevaluator.aspx

Technically, this is only possible with regular expressions, doing what Kobe offers. I'm not sure I would like to ask anyone (including myself in a few months) to keep this regular expression.

+1

Thorarin May 05 '10 at 6:35

source share

Kobi · Accepted Answer · 2010-05-05T06:37:20+0000

You can split ${url} into two capture groups - urlhead , with the number of characters you want to display, and urltail with the rest. Here is an example with 10 characters; this is a little simplified to remove the condition, the latter (?<ending>(?(outer)(?=\)))) should take care of this - it backs out and fixes the latter ) if necessary:

 (?<outer>(?<=\())? (?<scheme>http(?<secure>s)?://)? (?<url> (?(scheme) (?:www\.)? | www\. ) [a-z0-9] [-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]{1,10} ) (?<urltail>[-a-z0-9/+&@#/%?=~_()|!:,.;čšžćđ]+) (?<ending>(?(outer)(?=\))))

Please note that I also change outer and ending to search, so they are not fixed and not replaced. The replace string in this case looks like this:

 <a href=\"http${secure}://${url}${urltail}\">http${secure}://${url}</a>

Extended Regular Expression: Automatically Detect and Replace URLs Using Bound Tags

Question

More articles: