Using find / replace and regex to replace keywords with URLs in a string

I have a list of keywords (one word or a couple of words) that I want to replace with some URLs.

how

  • London will be replaced by <a href="http://www.mysite/london-events/london">London</a>

  • Football events in London with <a href="http://www.mysite/footbal-events/london"> Football events in London</a>

  • Lunar events in London with <a href="http://www.mysite/footbal-events/london"> London football events</a>

  • Football events in London with <a href="http://www.mysite/footbal-events/london"> Football events London</a>

  • Party sites in London with <a href="http://www.mysite/party-sites/london"> party sites in London</a>

  • London site with <a href="http://www.mysite/party-sites/london"> London party sites</a>

I put the key / values ​​in the dictionary above, keywords in the key and URLs in the value and replaced as

The content is as follows:

London is a great city and football events in London, but party sites in London are good too. London football events are great along with London sites. Enjoy london!

Code to replace key / values:

 private static string ParsedContents(some arguments list here...) { Dictionary<string, string> keyWords = GetKeywordsAndEntityWithURL(some arguments list here...); StringBuilder parsedContents = new StringBuilder(contents); foreach (var keyWord in keyWords) { string replacedString = Regex.Replace(parsedContents.ToString(), "\\b" + keyWord.Key + "\\b", keyWord.Value, RegexOptions.IgnoreCase); parsedContents.Remove(0, parsedContents.Length); parsedContents.Append(replacedString); } // retrun parsed contents as string. return parsedContents.ToString(); } 

When I run my code, only “London” is replaced by '<a href="http://www.mysite/london-events/london">London</a>' and all the others remain unchanged, but if I delete "London" of keywords, it works great.

Could you please help me in that I can match the whole string.

Replacement content and URLs are fake:

thanks

+4
source share
7 answers

What if you first replace all the longer lines with a URL and instead of setting “London” in the URL, you can set another word, for example “Lxondon”? After you replace all the strings containing London with their corresponding URLs, you can also replace London with your URI. And in the end, you replace “Lxondon” with “London” throughout the text.

This is not a good way to do this, but I think it will work.

+1
source

Since some phrases that you want to associate contain other phrases that you want to associate, and the links themselves also contain these phrases, you need to do this in two steps if you want to avoid complex regular expressions:

Phase 1: Replace each phrase with a unique identifier for a phrase that will not match anything else:

  • You will need to replace longer phrases with shorter phases to make sure that you are not replacing only part of the phrase (for example, “London” London football events “).
  • You can store phrases and URLs that need to be linked in a SortedDictionary, and provide an IComparer<string> that sorts the strings by length and then in alphabetical order. Please note that it is important that strings of the same length are still compared as different, or you cannot store them in a dictionary.
  • As each phrase is replaced, you must generate a link that will replace it and build dictionary matching identifiers with links.
  • If you use string.Replace to replace the phrases that you will need to handle phrases that differ only as different phrases, i.e. "Party sites in London" are different from "Party sites in London" and each must have a separate identifier.

Phase 2: Replace all placeholder IDs with the generated links.

Here is the class for this:

 class TextLinker : IComparer<string> { private SortedDictionary<string, string> phrasesToUrls; public TextLinker() { // Pass self as IComparer to sort dictionary using Compare method. phrasesToUrls = new SortedDictionary<string, string>(this); } public void AddLink(string phrase, string URL) { phrasesToUrls.Add(phrase, URL); } public string Link(string text) { // phase 1: replace phrases to be linked with unique placeholders Dictionary<string, string> placeholdersToLinks = new Dictionary<string, string>(); foreach (KeyValuePair<string, string> pair in phrasesToUrls) { // Replace phrases with placeholders. string placeholder = Guid.NewGuid().ToString(); text = text.Replace(pair.Key, placeholder); // Create dictionary of links by placeholder string link = string.Format( "<a href=\"{0}\">{1}</a>", pair.Value, pair.Key); placeholdersToLinks.Add(placeholder, link); } // Phase 2: replace unique placeholders with links. foreach (KeyValuePair<string, string> pair in placeholdersToLinks) { text = text.Replace(pair.Key, pair.Value); } return text; } public int Compare(string x, string y) { if (x.Length > y.Length) return -1; if (x.Length < y.Length) return +1; // Equal length strings still need to be differentiated, otherwise // they will be treated as the same key by the dictionary. return x.CompareTo(y); } } 

And here is an example of its use:

 string input = "London is a great city and have football events " + "in London but party sites in London are also good. London " + "football events are great along with London party sites. " + "Enjoy London!"; TextLinker linker = new TextLinker(); linker.AddLink( "Football events in London", "http://www.mysite/footbal-events/london"); linker.AddLink( "football events in London", "http://www.mysite/footbal-events/london"); linker.AddLink( "London football events", "http://www.mysite/footbal-events/london"); linker.AddLink( "London", "http://www.mysite/london-events/london"); linker.AddLink( "Party sites in London", "http://www.mysite/party-sites/london"); linker.AddLink( "party sites in London", "http://www.mysite/party-sites/london"); linker.AddLink( "London party sites", "http://www.mysite/party-sites/london"); string output = linker.Link(input); 

You can also overload the AddLink method to automatically generate alternative capitalization phrases.

+2
source

If you need to replace London first, then your other regular expression strings will no longer be present in the text.

Football events in London

Now

Football Events in London

0
source

To talk about other answers, you must first install the longest and more complex string replacement. eg.

Football events in London

London

If you make London as in your example and replace it with Kent, any case of “Football Events in London” will become “Football Events in Kent” and will not satisfy the regular expression.

PS: You might want to consider using this extension method in a string if you use it often.

0
source

What if you do recursion? that is, every time a match is found, you replace it with the text in the dictionary and repeat the process, but only for those parts of the text that were not matched.

0
source

As others have claimed:

  • If you replace “London” before “Football Events in London”, your search for “Football Events in London” will not match “Football Events <a href =" http: // etc. > London <a>
  • If you replace “Football Events in London” before “London”, you will replace London within the existing link for football events in London, which will give you a link in the link ...
  • The dictionary is not ordered , so in any case you can not guarantee that you will receive the desired order if you are just foreach ing.
  • If your search texts are ALSO contained within your URLs, your code will also find them and replace them - this is especially important since you made your regular expression case insensitive.
  • Including leading space in the text of your tags? This is a sign that you are doing something wrong in another place, and you compensate for it by “hacking”.

The moral of the story: to find and replace (even using Regex), I'm not going to cut it, I'm afraid.

There are probably smarter ways to do this, but from the top of my head, here's what to look into the pseudo-code:

 while(!input.EOS) for(longest to shortest key) if(input.indexOf(key) = 0) output += input.replace(key, url) input = remained of input matched = true if !matched then move first word from input to output 

You will have to play a little with it, especially due to problems with spaces (how / where will you match spaces and characters without words?) Here is another tip to get you started: ^\s*(.+?)\s*\b

0
source

One thing you can do is the following:

Combine the keys (from largest to smallest) into one regular expression as such (assuming the dictionary is IDictionary<string, string> ):

 var pattern = string.Join( "|", dictionary.Keys.OrderByDescending(k => k.Length).Select(Regex.Escape).ToArray() ); var regex = new Regex("(" + pattern + ")", RegexOptions.ExplicitCapture); 

Note the use of Regex.Escape in the conversion function: we do not want special regular expression characters in the key to spoof things.

A quick test showed that the .NET regex mechanism will try to match in the order they appear in the template. This means that with the right order, a longer key will be made first, and then the regular expression will move on, looking for new matches.

You can then iterate over the matches and build a new line from the old one when you go, instead of scanning the input line multiple times. These two combined methods eliminate both problems: premature and duplicate matches.

 string input = "..."; // This is your input string. int last = 0; var output = new StringBuilder(input.Length); foreach (Match match in regex.Matches(input)) { output.Append(input.Substring(last, match.Index - last); // Appends text between matches. output.AppendFormat( "<a href=\"{1}\">{0}</a>", match.Value, dictionary[match.Value] ); last = match.Index + match.Length; // Moves the index to the end of this match. } 

Error checking is not enabled. Also, the regular expression itself is likely to benefit from the \b anchors in the form \b(...)\b . But this is untested, and I'm leaving.

0
source

All Articles