Javascript string word wrap for matches in another string

If I have a specific line input, I want to compare with another line and match the matches of the input line with another line using the maximum possible match. What is the best way to combine matches in a tag? This is a non-trivial question.

Basically, I want to combine the input string with another string using the span tag to display the matched parts of the target found in the input string.

  • Match from start of first line of input (largest possible match)
  • Partial matching of the search word should be highlighted (see "barge", "barge" in the example)
  • Special character breaks must match the entered "fred / dred", it will be two words.
  • The input line will change depending on what the user enters.
  • Match the input line from the beginning as a priority
  • Matches every word where it occurs

if the user enters a line with several words, I want to rotate their matches gradually, starting from the beginning, where they appear in the second line. They may or may not have spaces at the beginning / end of the entered line. I would like most to be wrapped.

Example input lines:

"Brown cats cannot be white cats" "blue pigs " "large, charged/marged barge pigs" 

I would like them to be wrapped like this:

 "<span class='wrapper'>Brown cats cannot be white cats</span>" 

in the destination line where there are matches, even partial ones, but with the greatest possible match.

Example line for wrapping:

 "Hi bill, brown cats cannot be white cats and cows are not blue pigs, blue melons are large but not batteries charged barges with white cats carry coal" 

The final line for each input example:

 "Hi bill, <span class='wrapper'>brown cats cannot be white cats</span> and cows are not blue pigs, blue melons are large but not batteries charged barges with <span class='wrapper'>white cats</span> carry coal" "Hi bill, brown cats cannot be white cats and cows are not <span class='wrapper'>blue pigs</span>, blue melons are large but not batteries charged barges with white cats carry coal" "Hi bill, brown cats cannot be white cats and cows are not blue <span class='wrapper'>pigs</span>, blue melons are large but not batteries <span class='wrapper'>charged</span> <span class='wrapper'>barge</span>s with white cats carry coal" 

Possible matches for: "Brown cats cannot be white cats"

 "Brown cats cannot be white cats" "Brown cats cannot be white" "Brown cats cannot be" "Brown cats cannot" "Brown cats" "Brown" "Brown" "cats" "cannot" "be" "white" "cats" 

If I just wrapped every matching word that I could do:

 function replaceWords(wordsy, text) { var re = '(' + wordsy + ')(?![^<]*(?:<\/script|>))', regExp = new RegExp(re, 'ig'), sTag = "<span class='wrapper'>", eTag = "</span>"; return text.replace(regExp, sTag + '$&' + eTag); }; var matchstring = "Brown cats cannot be white cats"; var wrapstring = "Hi bill, brown cats cannot be white cats and cows are not blue pigs, blue melons are large but not batteries charged barges with white cats carry coal"; var words = myValue.split(" "); var i = words.length; while (i--) { wrapstring = replaceWords(words[i], wrapstring ); }; 

This does not satisfy the โ€œgreatest matchโ€ requirement. I would like to get the maximum possible match for any part of the match string that occurs in wrapstring.

Solutions using pure javascript or jquery or combos are acceptable.

EDIT: Some suggested KMP, here is an example of KMP jsfiddle.net/y5yJY/2 , but this does not match its current form, which meets all the criteria and makes one match.

+4
source share
3 answers

I have an interesting solution that should work like your original specification. This was not stress testing, I'm not sure if you want to process a lot of text or not, and it performs quite a lot of regular expressions. Not necessarily the simplest or easiest solution, but it works as intended.

Features and limitations:

  • It handles the most odd cases in the match string, such as repeating words, very similar or repeating phrases, etc.

  • At the moment, you cannot reliably have [ and ] characters in the source string, as they are used internally. If this is a problem, you should swap them for any other character or combination of characters before matching.

  • For a string matching words N , 2*N + 5 , string replacements are performed using regular expressions of varying complexity.

  • It matches words and phrases case insensitive, ignoring any characters other than words. At the same time, it stores as a result the words of the mixed word and characters other than the word.

How it works:

  • First, he searches for each word separately and attaches their index in the line of correspondence to them in square brackets: word[2] . If a word appears several times, it adds all the indexes: word[3][2][1] .

  • He then finds and marks words that are not border boundaries by looking at the indices of the surrounding words. At a separate stage, he removes the indices from these words. At the end, one[1] two[2] three[3] will become one[1] []two three[3] .

  • All that remains is to make some assumptions in a certain order and wrap the words / phrases. Look at the code to see all the replacements that have been made.

It is important that after the first step we never match the words directly, from now on the word is simply called any number of word characters before [index] or any number of word characters after [] . This ensures that we return the correct words / phrases.

Check out this demo . I added a hover effect so you can see which words are grouped and combined.

And here is the crazy code, enjoy!

 var matchstring = 'Brown cats cannot be white cats'; var wrapstring = 'Hi bill, brown cats cannot be white cats and cows are not blue pigs, blue melons are large but not batteries charged barges with white cats carry coal, and the word "cannot" should match '; // Pre-process matchstring to make it a flat list of words // separated by single spaces. matchstring = matchstring.replace(/\W+/g,' '); var wrapStart = '<span class="wrapped">'; var wrapEnd = '</span>'; var matcharray = matchstring.split(' '); var i, reg; // Mark all matched words with indices // one -> one[1] for (i = 0; i < matcharray.length; i++) { reg = new RegExp('\\b' + matcharray[i] + '\\b', 'ig'); wrapstring = wrapstring.replace(reg, '$&[' + i + ']'); } // Mark all inner words // one[1] two[2] three[3] -> one[1] []two[2] three[3] for (i = 1; i < matcharray.length; i++) { reg = new RegExp('\\b(\\w+)([\\]\\d\\[]*\\[' + (i - 1) + '\\][\\]\\d\\[]*)(\\W+)(\\w+)([\\]\\d\\[]*\\[' + i + '\\][\\]\\d\\[]*)(?=\\W+\\w+[\\[\\d\\]]*\\[' + (i + 1) + '\\])', 'ig'); wrapstring = wrapstring.replace(reg, '$1$2$3[]$4$5'); } // Remove indices from inner words // one[1] []two[2] three[3] -> one[1] []two three[3] wrapstring = wrapstring.replace(/\[\](\w+)[\[\d\]]*/g, '[]$1'); // Start tags // one[1] []two three[3] -> {one []two three[3] wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\](\W+)\[\]/g, wrapStart + '$1$2[]'); // End tags // {one []two three[3] -> {one []two three} wrapstring = wrapstring.replace(/\[\](\w+\W+\w+)\[[\[\d\]]+\]/g, '$1' + wrapEnd); // Wrap double words // one[1] two[2] -> {one two} wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\](\W+\w+)\[[\[\d\]]*\]/g, wrapStart + '$1$2' + wrapEnd); // Orphan words // unmatched matched[1] unmatched -> unmatched {matched} unmatched wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\]/g, wrapStart + '$1' + wrapEnd); // Remove left-over tags // []word -> word wrapstring = wrapstring.replace(/\[\]/g, ''); alert(wrapstring); 

Partial Word Matching

As mentioned earlier, after the first step, words are processed only by their added indexes. This means that if we want to make some smart matches instead of whole words, we just need to change the regular expression in the first for loop. This is the part of the code that we will play with in this section:

 reg = new RegExp('\\b' + matcharray[i] + '\\b', 'ig'); 

\b in regular expression means matching word boundaries, that is, the beginning or end of a sequence of word characters. Therefore, the above expression \bword\b gives only whole words, since word should be surrounded by word boundaries. But that should not be so.

If we want to combine all the words in the text starting with our keyword, we can change the line above as follows:

 reg = new RegExp('\\b' + matcharray[i] + '\\w*\\b', 'ig'); 

This results in the regular expression \bword\w*\b . It matches all word character characters, followed by 0 or more additional word characters ( \w* ) surrounded by word boundaries. Note that backslashes must be escaped inside javascript strings ( \\ means single \ ).

Depending on the requirements, we can easily create additional combinations of regular expressions:

  • \bword\w*\b matches words starting with a keyword.
  • \b\w*word\b matches words ending with a keyword.
  • \b\w*word\w*\b matches words containing a keyword.
  • \b(\w*word|word\w*)\b matches words that end or begin with a keyword.

You can even say that you only want to combine minor word changes. For example, \b\w{0,2}word\w{0,2}\b will match only a word if it has no more than two letter prefix and / or suffix. Thus, danger will match endanger , cat will match cats , but can will not match cannot , as it will be 3 extra letters.

Combining complex multiple forms and irregular verbs is not easy, you could create a massive dictionary of irregular words on your server and pre-process the word, so if the user enters foot , using the regular expression \b(foot|feet)\b will match both forms. A simpler solution would be to take care of regular words only. For most words, matching \bword(s|es|)\b will be enough to catch plural numbers; it matches word , words and wordes . For words like fly regular expression \bfl(y|ies)\b will do the job. For words of type index regular expression \bind(ex|exes|ices)\b will correspond to the most common forms.

Since I'm not really a language expert, this time I will just leave it.

Input wildcards

Similar to the above, it is very easy to add wildcard support to the input string. Say we want the user to enter ? to indicate any character. If the input is ?red , do we just need to replace ? to \w in our regular expression. For example, \b\wred\b will match fred and dred .

As above, you can also use multiple wildcards, replace them with \w+ for one or more characters or \w* for zero or more characters. \bf\w+d\b will match fed and feed , \w* will also match fd .

+2
source

How about this: (describes only an algorithm not written in code)

Imagine that you have two lines written on two sheets of paper. Place two sheets so that one on top of the other. Move the top sheet to the left so that its last letter is on top of the first letter of the bottom sheet. Now the coincidence of two matching letters? If so, you have a match of length 1. Write this as the longest match. Then move the top sheet one character to the right. Now the two letters overlap. Do they match? If so, you have a maximum match of size 2. Continue to move the top sheet 1 character to the right and each time find the largest part of the matching, overlapping characters. Always track your biggest match. Continue moving until your top sheet is too far to the right so that its first character matches the other character of the last sheet.

I don't know how easy it is to implement in javascript, but as an algorithm, I think it is sound.

PS- for a bit where you need to find "the largest section of matching, overlapping characters", you can do something like this:

 /* Note: str1 and str2 are the two overlapping portions of the strings */ var largestMatch = 0; var currMatch = 0; for (var i = 0; i < str1.length; i++) { if (str1[i] == str2[i]) currMatch++; else currMatch = 0; largestMatch = Math.max(largestMatch, currMatch); } // largestMatch is the size of the largest section of matched characters 
+1
source

Here is what I did to solve this problem: (looking for improvements, since this is not perfect) (this is wrapped in a jQuery ready-made document) as here: http://jsfiddle.net/KvM47/

 function findStringLimit(searchChar, searchCharIndex, searchedString) { return searchedString.substring(0, searchedString.lastIndexOf(searchChar, searchCharIndex)); }; function replaceWords(wordsy, text) { var re = '(' + wordsy + ')(?![^<]*(?:<\/script|>))', regExp = new RegExp(re, 'ig'), sTag = "<span class='wrappedWord'>", eTag = "</span>"; return text.replace(regExp, sTag + '$&' + eTag); }; var longstring = $('#mystring'); var htmlString =longstring .html(); // instance html myValue = "Brown cats cannot be white cats"; myValue = myValue.replace(/^\s+|\s+$/g, "");//trim whitespace at each end var words = myValue.split(" "); var allPhrases = []; allPhrases.push(myValue); var i = words.length; while (i--) { allPhrases.push(findStringLimit(" ", allPhrases[(words.length - i) - 1].length, allPhrases[(words.length - i) - 1])); }; var i = allPhrases.length; while (i--) { if (allPhrases[i] != "") words = words.concat(allPhrases[i]); }; var i = words.length; while (i--) { htmlString = replaceWords(words[i], htmlString); }; longstring.html(htmlString); 

What to improve:

  • use other characters to separate words, not just spaces.
  • make it more efficient
  • The best detection of "pieces" of strings (two or more words together) in the strings "search" and "match" and their processing.
+1
source

All Articles