Scrabble wildcard word search

I had a problem, and it seems that I have problems, but I did not have the opportunity to find a working solution for me.

Im currently creating a mobile web application using C #, MySQL, HTML5 and Javascript. The application will be used to help users find possible words to play during games such as Scrabble.

Ive got the problem: How to get the correct words from a MySQL database containing a dictionary from user letter input?

More details: - Users can enter any number of letters, as well as use wildcards (representing any letter). - If the user enters "TEST", the result cannot contain words with more than 1 E and S and words with more than 2 T, the result with "TESTER" in it will be bad. - The result cannot contain words with more letters than entered.

UPDATE: Trie seems to be the solution to my problem suggested by Eric Lippert here .
The problem is that I'm new to both C # and MySQL, so here are some of the following questions:

  • How to create Trie from my MySQL dictionary? (400k + words)
  • How to save Trie for quick and future access?
  • How do I access Trie and extract words from it using C #?

Thank you for help!

+6
source share
3 answers

How to get the right words from a MySQL database containing a dictionary from user letter input?

Not. A relational database table is not a suitable data structure to solve this problem as efficiently as you need.

Instead, you create a trie data structure from a dictionary (or, if you're really a buff, you create a dawg - oriented acyclic word graph that is a kind of compressed trie.)

Once you have trie / dawg, it becomes very inexpensive to test every word in a dictionary against a given rack, because you can "cut" entire huge branches of the dictionary that cannot be combined in a rack.

Let's look at a small example. Suppose you have the dictionary "OP, OPS, OPT, OPTS, POT, POTS, SOP, SOPS, STOP, STOPS". From this, you create this trie: (Nodes with $ are those marked as "the word may end here").

^root^ / | \ OPS | | / \ P$ OOT / \ | | | T$ S$ T$ P$ O | | | | S$ S$ S$ P$ | S$ 

and you have an OPS rack - what are you doing?

First you say: โ€œCan I go down the O branch?โ€ Yes you can. So now the problem is the match of โ€œPSโ€ with branch O. Can you go down the subbrand? Yes. Does he have a word end marker? Yes, so the OP is a coincidence. Now the problem is that "S" matches the OP branch. Can you go down the T branch? Not. Can you go down the S branch? Yes. You now have an empty rack and you must map it to the OPS branch. Does he have a word end marker? Yes! So that is OPS. Now go back to the root.

Can you go down the P branch? Yes. Now the problem is to map the OS to branch P. Go down the PO branch and map to S - this will not work. Rollback to the root.

And again, you see how this happens. In the end, we go down the SOP branch and find the end of the word in SOP, so "SOP" corresponds to this stance. We do not go down the ST branch because we do not have T.

We tried all possible words in the dictionary and found that OP, OPS and SOP all match. But we never had to research OPTS, POTS, STOP or STOPS because we didn't have T.

Do you see how this data structure makes it very efficient? Once you have determined that you do not have letters on the rack to start a word, you do not need to examine vocabulary words starting from that beginning. If you have software but no T, you do not need to investigate POTSHERD or POTATO, POTASH or POTLATCH or POTABLE; all these expensive and fruitless searches go away very quickly.

Adaptation of the system to work with "wild" plates is quite simple; if you have OPS ?, then just run the search algorithm 26 times, on OPSA, OPSB, OPSC ... It should be fast enough to make it 26 times cheap (or do it 26 x 26 times if you have two spaces.)

This is the main algorithm used by professional Scrabble AI programs, although, of course, they also have to deal with things such as the position of the board, rack management, etc., which complicates the algorithms somewhat. This simple version of the algorithm will be fast enough to generate all possible words on the rack.

Do not forget that, of course, you only need to calculate trie / dawg once if the dictionary does not change over time. This can take a long time to create a trie from the dictionary, so you can do it once, and then figure out how to store the trie on disk in a form suitable for quick rebuild from disk.

You can optimize memory usage by building DAWG from trie. Note that there are many repetitions, because in English many words end the same way, just as many words begin the same way. Three do a great job of sharing nodes at the beginning, but with the lousy job of sharing them at the end. For example, you may notice that the "S $ without children" pattern is extremely common and turns trie into:

  ^root^ / | \ OPS | | / \ P$ OOT / \ | | | T$ | T$ P$ O | \ | | | \ \| / P$ \ |/ | \ | / \ | / \ | / \| / |/ | S$ 

Saving a whole bunch of nodes. And then you can notice that now two words end with OP $ -S $, and two words end with T $ -S $, so you can compress it further:

  ^root^ / | \ OPS | | / \ P$ O \ T / \| \ | | | \| | | O | T$ | \ | P$ \ | / \| / | / |/ S$ 

And now we have a minimal DAWG for this dictionary.

Further reading:

http://dl.acm.org/citation.cfm?id=42420

http://archive.msdn.microsoft.com/dawg1

http://www.gtoal.com/wordgames/scrabble.html

+23
source

Here is how I could solve the problem (assuming, of course, that you have control over the database and can change tables / add tables or even control the initial load of the database).

My solution would be to use 2 tables -> one table will just be a list of all possible letter combinations from your dictionary with the letters of the components sorted alphabetically. (IE TEST will be ESTT, TESTER will be ERSTT, DAD will be ADD).

The second table will have each word and a key reference for the table.

Table One - LetterInWord

 Index Letters 1 ESTT 2 ESTTER 3 EST 4 ADD 5 APST 

In the first table, you insert the letters of the words in alphabetical order - the test becomes estt

Table Two - Words

 Index LetterInWordIndex Word 1 1 TEST 2 2 TESTER 3 3 SET 4 4 ADD 5 4 DAD 6 5 SPAT 7 5 PAST 

In table 2, you insert a word with the corresponding word and a link to the index.

This will be a one-to-many relationship โ†’ One entry in the LetterInWord table can have multiple entries in the word table

Searching for a non-discriminatory card: Say my introductory letters: SETT Sort them alphabetically.

Then in the search you select all the "Letters" from LetterInWord, where Letters = value and join on table Words - your output in one query is a list of all words containing only letters

Now for wildcards: Let's say my input letters are EST * Remember the length - 4 Highlight the wildcards - you will get EST (make sure you sort it alphabetically) Now find all cases where the letters contain EST and the letters Length <= 4 connected in the Words table

This will return TEST, REST, SET, etc.

I'm not sure if this is the most efficient method, but it works. I used it in the past to search for words from dictionaries, and it has reasonable performance with minimal complexity.

0
source

It will be very difficult to do if you have a dictionary. If you have the opportunity to create a new table or new columns, I would:

Create a table with a column for the word plus 26 columns (one for each letter) Run the stored proc / backend process, which takes into account the occurrences of each letter in the word and puts them in the corresponding column.

Then (ignoring wildcards) you can do

Select a word from the dictionary, where tcount <= 2 and ecount <= 1 and scount <= 1

for wildcards you could do and lengths <= number_of_letters

In fact, always use the length clause, because then you can index it for better performance.

During the request, everything else will be exceptionally slow

-one
source

All Articles