Regular Expression (Javascript) - Take a Scrambled Word and Find a Stitched Match

I have a list of all the words in the English dictionary (270,000 words) stored in a variable called theList . I have a scrambled word word that I want to decrypt by matching it with a list of words. Initially, I thought the following code would do the trick, but it doesn't work so well.

 var theList; // Contains all the words in the English dictionary. var word = "iexospensr"; // The word I want to unscramble. var matches = word.match(new RegExp("^["+word+"]{"+word.length+"}$", "gim")); 

I would expect "EXPRESSION" as a result without decryption, but instead I get much more results (see below).

EERINESSES,EXPRESSERS,EXPRESSION,IRONNESSES,ISOSPORIES,NONPERSONS,NONPROSSES,NOSINESSES,OPENNESSES,OPPRESSION,OPPRESSORS,ORNERINESS,PENSIEROSO,PENSIONEER,PENSIONERS,PEPPERONIS,PERSIENNES,PERSONISES,PIPINESSES,PIXINESSES,POORNESSES,PORINESSES,POSSESSION,POSSESSORS,PREEXPOSES,PREPOSSESS,PREPPINESS,PRESENSION,PRIORESSES,PRISSINESS,PROPENSION,PROPERNESS,REINSPIRES,REPRESSERS,REPRESSION,REPRESSORS,RESERPINES,RESPONSERS,RESPONSORS,RIPENESSES,ROPINESSES,ROSINESSES,SERENENESS,SEXINESSES,SIXPENNIES,SNIPPINESS,SORENESSES,SPINNERIES

Perhaps if I could find a way to tell the regular expression to examine each letter in the word string only once, regardless of the order of the letters. Thus, the end result will be an array of combinations of these letters, not permutations (which I have now).

Any help would be appreciated.


EDIT: I think the way is: 1. find all combinations of the scrambled word 2. match them with a list of words for validation

If you have a better solution (in terms of performance), this will help.


The best solution to this problem is to reorder the anagram alphabetically and the entire list of words and match the word with each item in the list.

Here is the code:

  var textList; // the entire dictionary var list = textList.match(/^.*$/gim); var sortedList = []; list.forEach(function(element, index, array) { sortedList[index] = element.split("").sort().join(""); }); function unscramble(word) { word = word.toUpperCase().split("").sort().join(""); var matches = []; for (var i = 0; i < list.length; i++) { if (word.indexOf(sortedList[i]) >= 0) { if (!matches[list[i].length]) matches[list[i].length] = []; matches[list[i].length].push(list[i]); } } return matches; } 
+4
source share
8 answers

Do not use regular expressions for this, there are simpler ways if you split the dictionary into words instead of making a big big line:

  • A scrambled word is determined by the frequency of occurrence of letters:

     //WARNING, untested code alphabet = 'qwertyuiopasdfghjklzxcvbnm'; function empty_frequences(){ var freqs = {}; var i=; for(i=0; i<alphabet.length; i++){ freqs[alphabet[i]] = 0; } return freqs; } function frequences(str){ var freqs = empty_frequences(); var i; for(i=0; i<str.length; i++){ freqs[str[i]] += 1; } } 
  • Use this fact to find all matches in the dictionary.

     function matcher(word){ //returns a function that matchs against this word var word_freqs = frequences(word); function do_the_match(word2){ var freqs2 = frequences(word2); var i, c; for(i=0; i<alphabet.length; i++){ c = alphabet[i] if(freqs[c] > freqs2[c]){return false;} //change > to != to allow only strict anagrams } return true; } return do_the_match; } function main(word, dict){ var mf = matcher(word); var i, matcheds = []; for(i=0; i<dict.length; i++){ if(mf(dict[i])){ matcheds.push(dict[i]); } } return matcheds; } 
+1
source

I think a better approach would not use regular expressions. Instead, it will check each member of the list against your scrambled word, skipping the characters of the word and looking to see if that character exists in the word in the list. Each time he finds a symbol, he can mark that symbol as "already used."

Here is something to indicate the position of the symbol as "used":

 function checkUsed(o, which) { if (o[which] != null) { o[which] = 1; return false; } return true; } var usedMap = []; if (checkUsed(usedMap, 5) == false) { ... } 
+2
source

Here is an idea for you. Building initial search data will be slow, but matching should be simple. However, you only need to build the dictionary once and load it! Recounting every time is a waste of time.

  • I assume that you only use the Latin alphabet (that is, what is written in English), everything is case-insensitive, and you do not use numbers ... etc. therefore you only have the characters AZ.

  • For each word in your dictionary, create a "hash" based on counting each occurrence of letters. The hash array will contain 26 positions. Each position will count the time during which a specific character occurs for that position. (for example, A is in the first position / index of the array 0, Z at the 26th / index 25)
    To cheat a little, you can save the results as a couple of lines. Few, if any, words have 9 repetitions of one letter, so one โ€œnumberโ€ per letter should work fine. For example: "the" becomes "00001001000000000001000000"; "hat" becomes "10000001000000000001000000"; "this" becomes "10000001000000000002000000".

  • Download the pre-computed dictionary. Use the hash value as a key in a key-value pair and as a collection value. Add each word with the same key to the end of the collection for that key.

  • Run the same hashing algorithm on the scrambled word and find the key. Print the collection referenced by the key.

EDIT 1: If building the dictionary up is not viable, use a variation on this where you create an associative array / dictionary with the letter as the key and the number of times it finds as the value. Before calculating this, compare the lengths, if the strings have different lengths, then do not bother to compare, because you know that they do not correspond to each other. After calculating these arrays for the source (scrambled) and the target (possible match), compare the keys and values โ€‹โ€‹in your associative array.

EDIT 2: To a large extent on the same lines as above, sort the characters inside the line for both source and destination lines.

+1
source

Just for fun:

 > var words = 'exceptional extraordinary retinas retains retsina antsier nastier retrains starfish'; > words.match(/\b([aeinrst])(?!\1)([aeinrst])(?!\1|\2)([aeinrst])(?!\1|\2|\3)([aeinrst])(?!\1|\2|\3|\4)([aeinrst])(?!\1|\2|\3|\4|\5)([aeinrst])(?!\1|\2|\3|\4|\5|\6)([aeinrst])\b/ig) [ 'retinas', 'retains', 'retsina', 'antsier', 'nastier' ] 

Please note that I cannot figure out how to make the above method work if you have two identical letters, for example. I can not match "boo" :)

+1
source

If the search should be quick, and building up at the beginning does not present a big problem, then using Trie is the most effective solution I know about. I could explain this, but the WP article is actually very good and gives code samples.

A histogram solution is probably best if you are interested in matching two given lines.

+1
source

I don't know if regex is the best tool for this job. The regular expression that you create will be

 "^[iexospensr]{10}$" 

which matches any 10-letter word made up of any letter in the character class [iexospensr] .

Perhaps if I could find a way to tell the regular expression to consider each letter in a string word only once, regardless of the order of the letters.

You can do this with word.length various regular expressions, but some of your letters are repeated. You would come closer if you sorted the letters in a scrambled word, and then searched for words that have the correct number of repetitions of each letter. For example, two e, two s, one x, etc.

0
source

Regular expressions, although powerful, are not the solution for everything.

In some cases, like this, itโ€™s just better to create your own solution: Start by deleting all the words that do not match the required length, then start comparing the letters.

Depending on the length of your dictionary, you can create various optimizations.

0
source

I should have seen this Q & A for a long time. I am doing this and I want to share my solution to the problem.

Solution: Step 1: Alphabetically sorting the scrambled word (Note: or even the scrambled page of the book, for that matter)

Step 2: Create your WORD or PAGE list with an extra column for the sorted word (Note: you can use this column if you want)

Step 3: Do your respective process. This should find the scrambled word from the search list.

I did some research on finding an arbitrary scrambled no. words per page and creating a list containing these scrambled words, taking into account the scrambled letters.

0
source

All Articles