Anagram permutations and spell checking

West · March 10, 2014, 4:23am

Hello,

Trying to write an anagram game based on the (UK) TV show Countdown. From @Mark’s scramwords thread http://twolivesleft.com/Codea/Talk/discussion/1506/scramwords-a-word-game/p1 I initially implemented the dictionary as 26 long lookup strings. This worked great for checking individual words, but I wanted to use it for finding the longest word from 9 (semi) random letters. Creating all the permutations and checking them against the strings was slow.

I then implemented @Andrew_Stacey’s key in a table lookup approach http://twolivesleft.com/Codea/Talk/discussion/comment/16821#Comment_16821
Much quicker, but still taking around 7 seconds to find the longest possible word (I also abandon future look ups of a word length if a word of that length has already been found e.g. If I find “count” as a five letter word in “dowcounnt” then I won’t check the spelling of any further 5 letter combos.

Any suggestions? My inter web research has led me to think a trie structure may be the way forward. Anyone tried this?

Ignatz · March 10, 2014, 6:53am

@West - something like this?

https://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

West · March 10, 2014, 6:58am

@Ignatz - yes or, https://en.wikipedia.org/wiki/Trie or https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton or https://en.wikipedia.org/wiki/GADDAG

Have you tried implementing it or something similar? I’m not really up to speed on them

Ignatz · March 10, 2014, 7:31am

@West - nope, I haven’t

I tend to think that trie could be slow too, because it requires hacking through all combinations of the 9 letters. It is surely much faster to work backwards from the longest words to the shortest words. The trick is to minimise the amount of letter matching.

A very fast way seems to be to create a bit pattern 26 long for each word in the dictionary, and OR this with the bit pattern of the 9 letters. That is, when we get a bit library in Codea.

As we can’t do this, an alternative might be to pre-sort the letters in each dictionary word (requiring you to hold two versions of each word), putting the least used letters like Z,X etc at the front. Then you start matching these letters to your 9 letters (sorted the same way) and because you have put the least popular letters first, you are likely to fail quickly and minimise the number of comparisons required to reject candidate words. You would sort your dictionary words by length and then work backwards toward the shorter words, until you found your first match.

Alternatively, a crude version of the bit mapping would be to pre-count the number of letters of each dictionary word that fall into (say) 3 groups of letters, such as eta, oinsr, hldcumfpgwybvkxjqz, chosen to have approximately equal frequency. This gives you a 3 digit number for each word, which is like a crude hash, and which could be tested against the equivalent number for the 9 letters. The challenge is to quickly test if each of the digits for the 9 letters, are at least as great as the digits in the dictionary word - in which case you do a full letter by letter comparison. The idea is to discard as many faulty matches as quckly as possible without having to do the full comparison.

Ignatz · March 10, 2014, 7:54am

@West - actually, this effort I made a while back works very fast. It only searches down to 4 letters, but is quick. See near the bottom of the thread. The important part is the function GetContaining.

http://twolivesleft.com/Codea/Talk/discussion/2454/word-list-for-anyone-designing-word-games/p1

West · March 10, 2014, 8:33am

@Ignatz - thanks I’ll have a look tonight.

Ignatz · March 10, 2014, 9:12am

@West - if you run the code, once it’s loaded the words, put nine letters into the text box at top left and press the 9 letter word puzzle button to get a list of all words contained in the letters you chose.

dabuek · March 10, 2014, 10:25am

One way to do it is to assign a prime number to every letter of the alphabet. A=2, B=3, C=5, … This would make Z=101, since that’s the 26th prime number. Since every number has a single unique prime factoring, every unique set of letters will have the same product not shared with any non-anagram.

Technically the largest number you would need would be 101^9, which would be for ZZZZZZZZZ. This requires 60 bits to represent as unsigned integer so that wouldn’t work in LUA which I think uses doubles to represent numbers. However if you re-order the letters in frequency order you can probably make it so there is no valid word that couldn’t be represented in 53 bits, making it expressible with a double with no loss of accuracy (and even this step is probably not necessary since the alphabet is fairly front-loaded).

This method obviously does require you to precompute products for every 9-letter word in your dictionary.

Ignatz · March 10, 2014, 11:06am

@dabuek - funnily enough, I thought of that too, and calculated how big a number you needed, I thought too big for Lua. But it’s worth a bit of work to check.

Jmv38 · March 10, 2014, 12:20pm

interesting discussion, interesting links. Thank you guys!

dave1707 · March 10, 2014, 1:46pm

@West I tried just a normal double “for” loop to compare a set of letters to the letters in each word. I compared 9 letters to the 300,249 words in my dictionary and it took 13 seconds. It took 8 seconds to compare 5 letters, and 5 seconds for 3 letters. I don’t know how this compares to what you’re doing. How many words are in your dictionary and how many letters are being compared.

EDIT: Made changes and now the 9 letter compare takes 3-5 seconds.

West · March 10, 2014, 5:55pm

@Ignatz - thanks I’ll have a look tonight.

For any given 9 letter input, I check up to 986410 possible permutations (based on the java implementation here http://stackoverflow.com/questions/361/generate-list-of-all-possible-permutations-of-a-string)

I only check the spelling until a word of that length is found. Note sure of the exact dictionary size - based on TWL06 so about 180k. Lookup time to get a word of each length 2-9 about 7s

Ignatz · March 10, 2014, 6:11pm

@West, @dave1707 - I only have 100000 words of 4-9 letters, but it takes about 1 second to find all matches. If I restricted it to the first match it would be much quicker.

West · March 10, 2014, 6:48pm

@Ignatz wow! That’s fast! The puzzle seems to have a constraint of a must have of 1 letter in it though - doesn’t this reduce the lookup significantly?

Ignatz · March 10, 2014, 8:06pm

@West - no, that doesn’t affect speed at all, I test that afterwards

Here is my function, adapted for general use. It takes a second or two to get down to 4 letter words.

--returns longest word in table WordList whose letter are contained in string Letters
--assumes words are sorted by length from shortest to longest
--requires precalculation of a table called LengthChange which stores the starting index of
--each change in word length, eg if the first 5 letter word is the 25,000th word, then
--LengthChange[5]=25000
--this is of course so we don't waste time searching dictionary words longer than the
--number of letters we have been given

function DoContaining(WordList,Letters)
    local str=string.upper(Letters) --my dictionary is upper case
    local u=string.len(str)
    --set starting point of search
    local j
    if u==0 then print("too few letters") return end
    if u>=9 then j=#WordList else j=LengthChange[u+1] end
    local x
    for i=j,1,-1 do --work backward
        local txt1=str
        local txt2=WordList[i]
        for j=1,string.len(txt2) do
            txt1,x=string.gsub(txt1,string.sub(txt2,j,j),"",1)
            if x==0 then fail=true break end
        end
        if x>0 then print(WordList[i]) return end
    end
    print( "No words found")
end

JakAttak · March 10, 2014, 11:52pm

@West, @Ignatz, @dave1707, where do you get your dictionaries? I’m also working on a word game (basically a more efficient version of Anagrams with some extra functionality), and a dictionary would be very helpful.

EDIT: Ignatz, I looked at yours, but it takes a long time to initialize running with it included. Are all of them this slow?

Ignatz · March 11, 2014, 12:02am

@JakAttak - I got mine years ago, I believe it consists of the Scrabble dictionary

There’s a few seconds to initialise at the start because the words are highly compressed (to about the same as if you zipped them), enabling you to include them directly in the code. You could write code that only did this once and then stored the words locally.

JakAttak · March 11, 2014, 1:11am

@Ignatz, i stored the table in project data (that took about 30 min… 8-X ) and now i just load it up with loadstring. This is faster, but still not quite as fast as I would prefer (about three-four seconds total)

I wonder how fast Dave and Mark’s solutions are, as they claim dictionaries with much more words.

Ignatz · March 11, 2014, 1:15am

@JakAttak - I would be surprised if theirs is any faster, because loading from local storage is as about as fast as you will get.

A lot of programs have a few second delay at the beginning, so I wouldn’t be too concerned about that.

West · March 11, 2014, 2:32am

@JakAttak I used TWL06 (google to find downloads) but there are other lists too. I converted to 26 files and read them all in but it was a real pain with the word wrap bug.

@Ignatz, there is a typo in the DoContaining you posted above - the call to if u=0 should be if u==0

Also, I’m struggling to adapt it to run without the constraint that the anagram must contain the first letter