I have an interesting solution that should work like your original specification. This was not stress testing, I'm not sure if you want to process a lot of text or not, and it performs quite a lot of regular expressions. Not necessarily the simplest or easiest solution, but it works as intended.
Features and limitations:
It handles the most odd cases in the match string, such as repeating words, very similar or repeating phrases, etc.
At the moment, you cannot reliably have [ and ] characters in the source string, as they are used internally. If this is a problem, you should swap them for any other character or combination of characters before matching.
For a string matching words N , 2*N + 5 , string replacements are performed using regular expressions of varying complexity.
It matches words and phrases case insensitive, ignoring any characters other than words. At the same time, it stores as a result the words of the mixed word and characters other than the word.
How it works:
First, he searches for each word separately and attaches their index in the line of correspondence to them in square brackets: word[2] . If a word appears several times, it adds all the indexes: word[3][2][1] .
He then finds and marks words that are not border boundaries by looking at the indices of the surrounding words. At a separate stage, he removes the indices from these words. At the end, one[1] two[2] three[3] will become one[1] []two three[3] .
All that remains is to make some assumptions in a certain order and wrap the words / phrases. Look at the code to see all the replacements that have been made.
It is important that after the first step we never match the words directly, from now on the word is simply called any number of word characters before [index] or any number of word characters after [] . This ensures that we return the correct words / phrases.
Check out this demo . I added a hover effect so you can see which words are grouped and combined.
And here is the crazy code, enjoy!
var matchstring = 'Brown cats cannot be white cats'; var wrapstring = 'Hi bill, brown cats cannot be white cats and cows are not blue pigs, blue melons are large but not batteries charged barges with white cats carry coal, and the word "cannot" should match '; // Pre-process matchstring to make it a flat list of words // separated by single spaces. matchstring = matchstring.replace(/\W+/g,' '); var wrapStart = '<span class="wrapped">'; var wrapEnd = '</span>'; var matcharray = matchstring.split(' '); var i, reg; // Mark all matched words with indices // one -> one[1] for (i = 0; i < matcharray.length; i++) { reg = new RegExp('\\b' + matcharray[i] + '\\b', 'ig'); wrapstring = wrapstring.replace(reg, '$&[' + i + ']'); } // Mark all inner words // one[1] two[2] three[3] -> one[1] []two[2] three[3] for (i = 1; i < matcharray.length; i++) { reg = new RegExp('\\b(\\w+)([\\]\\d\\[]*\\[' + (i - 1) + '\\][\\]\\d\\[]*)(\\W+)(\\w+)([\\]\\d\\[]*\\[' + i + '\\][\\]\\d\\[]*)(?=\\W+\\w+[\\[\\d\\]]*\\[' + (i + 1) + '\\])', 'ig'); wrapstring = wrapstring.replace(reg, '$1$2$3[]$4$5'); } // Remove indices from inner words // one[1] []two[2] three[3] -> one[1] []two three[3] wrapstring = wrapstring.replace(/\[\](\w+)[\[\d\]]*/g, '[]$1'); // Start tags // one[1] []two three[3] -> {one []two three[3] wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\](\W+)\[\]/g, wrapStart + '$1$2[]'); // End tags // {one []two three[3] -> {one []two three} wrapstring = wrapstring.replace(/\[\](\w+\W+\w+)\[[\[\d\]]+\]/g, '$1' + wrapEnd); // Wrap double words // one[1] two[2] -> {one two} wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\](\W+\w+)\[[\[\d\]]*\]/g, wrapStart + '$1$2' + wrapEnd); // Orphan words // unmatched matched[1] unmatched -> unmatched {matched} unmatched wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\]/g, wrapStart + '$1' + wrapEnd); // Remove left-over tags // []word -> word wrapstring = wrapstring.replace(/\[\]/g, ''); alert(wrapstring);
Partial Word Matching
As mentioned earlier, after the first step, words are processed only by their added indexes. This means that if we want to make some smart matches instead of whole words, we just need to change the regular expression in the first for loop. This is the part of the code that we will play with in this section:
reg = new RegExp('\\b' + matcharray[i] + '\\b', 'ig');
\b in regular expression means matching word boundaries, that is, the beginning or end of a sequence of word characters. Therefore, the above expression \bword\b gives only whole words, since word should be surrounded by word boundaries. But that should not be so.
If we want to combine all the words in the text starting with our keyword, we can change the line above as follows:
reg = new RegExp('\\b' + matcharray[i] + '\\w*\\b', 'ig');
This results in the regular expression \bword\w*\b . It matches all word character characters, followed by 0 or more additional word characters ( \w* ) surrounded by word boundaries. Note that backslashes must be escaped inside javascript strings ( \\ means single \ ).
Depending on the requirements, we can easily create additional combinations of regular expressions:
\bword\w*\b matches words starting with a keyword.\b\w*word\b matches words ending with a keyword.\b\w*word\w*\b matches words containing a keyword.\b(\w*word|word\w*)\b matches words that end or begin with a keyword.
You can even say that you only want to combine minor word changes. For example, \b\w{0,2}word\w{0,2}\b will match only a word if it has no more than two letter prefix and / or suffix. Thus, danger will match endanger , cat will match cats , but can will not match cannot , as it will be 3 extra letters.
Combining complex multiple forms and irregular verbs is not easy, you could create a massive dictionary of irregular words on your server and pre-process the word, so if the user enters foot , using the regular expression \b(foot|feet)\b will match both forms. A simpler solution would be to take care of regular words only. For most words, matching \bword(s|es|)\b will be enough to catch plural numbers; it matches word , words and wordes . For words like fly regular expression \bfl(y|ies)\b will do the job. For words of type index regular expression \bind(ex|exes|ices)\b will correspond to the most common forms.
Since I'm not really a language expert, this time I will just leave it.
Input wildcards
Similar to the above, it is very easy to add wildcard support to the input string. Say we want the user to enter ? to indicate any character. If the input is ?red , do we just need to replace ? to \w in our regular expression. For example, \b\wred\b will match fred and dred .
As above, you can also use multiple wildcards, replace them with \w+ for one or more characters or \w* for zero or more characters. \bf\w+d\b will match fed and feed , \w* will also match fd .