Returning a substring of strings of 10,000 ascii strings

Question

Returning a substring of strings of 10,000 ascii strings

My college is gaining momentum, so I began to prepare for the interview to get the assignment, and I came across this interview question while I was preparing for the interview

You have a set of 10,000 ascii lines (loaded from a file)
The string is entered from stdin.
Write a pseudocode that returns (to stdout) a subset of the strings in (1) that contain the same individual characters (regardless of order), like the input in (2). Optimize your time.
Suppose you need to call this function repeatedly. Initializing an array of strings once and storing in memory in order. Please avoid solutions that need to loop through all 10,000 lines.

Can someone provide me with a common pseudo-code / algorithm how to solve this problem? I scratch my head, thinking about a solution. I am mostly familiar with Java.

+6

java substring algorithm

AKIWEB Nov 16 '12 at 22:27

source share

3 answers

They say that they are optimized for time, so I think that we will safely abuse the space as much as we want.

In this case, you can perform an initial pass along 10,000-lines and build a mapping from each of the unique characters present in 10,000 to their index (and not to the set of their indices). So you can ask a question display that contains the character "x"? Call this mapping M> (order: O (nm) when n is the number of rows and m is their maximum length)

To optimize the time again, you can reduce the input string stdin to unique characters and put them in the Q queue. (Order O (p), p is the length of the input string)

Start a new disjoint set, say S. Then let S = Q.extractNextItem.

Now you can iterate over the remaining unique characters and find which sets contain all of them.

So far (Q is not empty) (loops O (p)) {

S = S intersects Q.extractNextItem (close to O (1) depending on your implementation of disjoint sets)

}

voila, return S.

Total time: O (mn + p + p * 1) = O (mn + p)

(Early in the morning here, I hope the time analysis was right)

0

dakotapearl Nov 16 '12 at 10:54

source share

As the Bohemian says, the trie tree is definitely the way to go!

This is similar to how address book searches work on the phone. Start punching the numbers, and then filter the address book based on the number, as well as any of the three (or even more if you use international characters) letters that will represent the number.

0

Jay pete Nov 16 '12 at 23:09

source share

Bohemian · Accepted Answer · 2012-11-16T22:48:36+0000

Here is the O (1) algorithm!

Initialization:

For each line, sort the characters by removing duplicates - for example, "trees" become "erst"
load the sorted word into the trie tree using sorted characters, adding a link to the original word in the list of words stored in each node passed

Search:

sort the input line in the same way as initialization for source lines
follow the original trie line with the characters at the end of the node, return all the words to which they refer.

Returning a substring of strings of 10,000 ascii strings

More articles: