String Algorithm Question - Beginning of a Word

Question

String Algorithm Question - Beginning of a Word

I have a problem, and I'm not too sure how to solve it without reducing the path of inefficiency. Let's say I have a list of words:

Apple
Ape
Arc
Abraid
Bridge
Braide
Bray
Boolean

What I want to do is process this list and find out that each word begins with a certain depth, for example.

a - Apple, Ape, Arc, Abraid
ab - Abraid
ar -Arc
ap - Apple Ape
b - Bridge, Braide, Bray, Boolean
br - Bridge, Brad, Bray
bo - Boolean

Any ideas?

+4

string algorithm

Matthew h Aug 7 '10 at 20:27

source share

5 answers

You can use the Trie structure.

  (root) / a - b - r - a - i - d / \ \ pre / \ \ pec / l / e

Just find the node you want and get all its descendants, for example if I want ap- :

  (root) / a - b - r - a - i - d / \ \ [p] re / \ \ pec / l / e

+11

quantumSoup Aug 7 '10 at 20:29

source share

I don’t know what you’re thinking about when you say the “route of inefficiency”, but a rather obvious solution comes to mind (perhaps the one you are thinking about). Trie looks like a framework for this kind of problem, but it is expensive in terms of memory (there is a lot of duplication), and I'm not sure if this will speed things up in your case. Perhaps memory usage would pay off if the information was received many times, but your answer tells you that you want to generate the output file once and save it. Thus, in your case, Trie will only be generated for passing once. I do not think this makes sense.

My suggestion is to simply sort the list of words in lexical order and then go through the list in order of max. start length.

 create a dictionary with keys being strings and values being lists of strings for(i = 1 to maxBeginnigLength) { for(every word in your sorted list) { if(the word length is no less than i) { add the word to the list in the dictionary at a key being the beginning of the word of length i } } } store contents of the dictionary to the file

+2

Maciej hehl Aug 7 '10 at 21:11

source share

Using this implementation of PHP trie , you get about 50%. He got some things that you don’t need, and he doesn’t have a “prefix search” method, but you can write it yourself quite easily.

 $trie = new Trie(); $trie->add('Apple', 'Apple'); $trie->add('Ape', 'Ape'); $trie->add('Arc', 'Arc'); $trie->add('Abraid', 'Abraid'); $trie->add('Bridge', 'Bridge'); $trie->add('Braide', 'Braide'); $trie->add('Bray', 'Bray'); $trie->add('Boolean', 'Boolean');

He creates the following structure:

 Trie Object ( [A] => Trie Object ( [p] => Trie Object ( [ple] => Trie Object [e] => Trie Object ) [rc] => Trie Object [braid] => Trie Object ) [B] => Trie Object ( [r] => Trie Object ( [idge] => Trie Object [a] => Trie Object ( [ide] => Trie Object [y] => Trie Object ) ) [oolean] => Trie Object ) )

+1

John kugelman Aug 7 '10 at 20:53

source share

If the words were in the database (Access, SQL), and you wanted to get all the words starting with "br", you can use:

 Table Name: mytable Field Name: mywords "Select * from mytable where mywords like 'br*'" - For Access - or "Select * from mytable where mywords like 'br%'" - For SQL

0

Claudio Aug 7 '10 at 20:57

source share

Jim dennis · Accepted Answer · 2010-08-07T22:26:47+0000

Perhaps you are looking for something like:

#!/usr/bin/env python def match_prefix(pfx,seq): '''return subset of seq that starts with pfx''' results = list() for i in seq: if i.startswith(pfx): results.append(i) return results def extract_prefixes(lngth,seq): '''return all prefixes in seq of the length specified''' results = dict() lngth += 1 for i in seq: if i[0:lngth] not in results: results[i[0:lngth]] = True return sorted(results.keys()) def gen_prefix_indexed_list(depth,seq): '''return a dictionary of all words matching each prefix up to depth keyed on these prefixes''' results = dict() for each in range(depth): for prefix in extract_prefixes(each, seq): results[prefix] = match_prefix(prefix, seq) return results if __name__ == '__main__': words='''Apple Ape Arc Abraid Bridge Braide Bray Boolean'''.split() test = gen_prefix_indexed_list(2, words) for each in sorted(test.keys()): print "%s:\t\t" % each, print ' '.join(test[each])

That is, you want to generate all the prefixes that are present in the list of words between one and some number that you specify (2 in this example). Then you want to create an index of all the words corresponding to each of these prefixes.

I am sure there are more elegant ways to do this. For a quick and easy-to-explain approach, I just built this from a simple upstream functional decomposition of an apparent specification. Of the final values of the result are lists, each of which corresponds to a given prefix, then we start with a function to filter out such matches from our inputs. If the keys of the final result are all the prefixes between 1 and some N that appear on our input, then we have a function to extract them. Then our specification. this is a very simple nested loop around this.

Of course, this socket loop can be a problem. Such things are usually equated with an efficiency of O (n ^ 2). As shown, this will iterate over the original list C * N * N times (C is a constant number representing prefixes of length 1, 2, etc., and N is the length of the list).

If this decomposition provides the desired semantics, we can look at increasing efficiency. An obvious approach would be lazy to generate dictionary keys, since we repeat once over the list ... for each word, for each prefix length, generate a key ... attach this word to the list / value stored on this key. and move on to the next word.

There is still a nested loop ... but this is a short loop for each key / prefix length. This alternative design has the advantage of allowing us to iterate over word lists from any iterative, not just a list in memory. Thus, we could iterate over the lines of the file, the results obtained from the database query, etc. --- Without imposing memory overhead on saving the entire source list of words in memory.

Of course, we still keep the dictionary in mind. However, we can also change this, separate the logic from input and storage. When we add each input to different prefix / key values, we don’t care whether they are lists in a dictionary or strings in a set of files, or values that are retrieved from (and discarded back to) DBM or other key / value storage (for example, some type CouchDB or another "clustered / noSQL database".

The implementation of this is left as an exercise for the reader.

String Algorithm Question - Beginning of a Word

More articles: