Data structure for matching patterns on big data

Main problem

I have a final vocabulary containing, for example, 10 characters [AJ]. What these symbols mean is irrelevant. They can be the basics of DNA, phonemes, words, etc.

An element is a sequence of characters. In this problem, all elements have the same length (say, 6). For example.

ACBADJ 

I have a large (5M) table containing counts of all elements of length 6 taken from some known data. For example.

 ACBADJ 55 BCBICF 923 ABCDEG 478 

Given the new sequence with one unknown symbol, my task is to guess the symbol. In the following example, a missing character ? .

 BCB ? CF 

A simple solution to guess ? is to browse my table and search for the item with the most that matches the BCB ? CF pattern BCB ? CF BCB ? CF

Questions

  • What is a good data structure for storing my element frequency table so that I can handle space-time efficiently enough? I prefer to use less memory if the calculation during the request is reasonable. (I will have many such tables, and so the 5M number will be just an approximation.)

  • What are some implementation details that can go a long way in processing speed?

Things I thought of:

  • Make a line from each sequence and use regular expressions. Caution: 1. O (n) is not acceptable. (2) Regular expressions are slow. (3) Lines (at least in java) are bloated.

  • Let Lucene handle indexing. Disable tfidf. Use the search phrase. Potentially use count values ​​for scoring so Lucene also takes care of sorting.

  • Use a prefix and suffix trying to index each item.

  • Use db (probably in memory) with all the data in a single / separate column to handle the search.


Update

  • In my actual application, I will work with sequences of 5,6,7,8,9,10 in length stored separately. I simplified the problem by limiting it to a fixed length. Therefore, a limitation / preference for a solution that uses less memory.
  • My dictionary size may be less than 20.
+7
source share
5 answers

Based on the comment that there will be only 1 unknown, you can do the following:

But your data is in a hash table. When you need to find a pattern, generate all combinations of wildcards, since your vocabulary is limited, this will mean that you will find 20 patterns. It sounds like a hack, but if you think the consequences for other methods are hard to evaluate, it's hard to beat. Hash table search is O (1), 20 queries are O (1).

This method is not recommended if the number of wildcards can increase, although it may still work well for 2 or 3.

A double trie array will also work and may reduce the amount of space to store your strings, but performance will suffer.

+2
source

The try solution seems to be the best: with the number of lines in the sheets, you can easily design a function that will return all possible lines with one missing character in O (log n) time, and then you simply iterate over this small number of lines, looking for the maximum number of entries. If you use characters from A to Z, there will be no more than 26 such lines, so repetition will not take up much space.

AFAIK, Lucene uses this mechanism inside its wildcard search , so you can combine your characters, index them with KeywordAnalyzer (to exclude stems), and then search as "ACB? DJ". The only limitation here is that Lucene cannot handle requests with the first β€œ?”, But you can get around it by adding one extra char at the beginning (just a trick to get around Lucene's checks) or by using another index for reverse words ( will increase performance for wildcard words as the first char).

And finally, if you first need to calculate the number of occurrences, you can use some machine learning schemes, such as decision trees , to handle all the work. There have been cases where decision trees were used to compress the database and speed up the search, so you can do the same. Use strings as instances, character position as attributes, and characters as attribute values. Then run some algorithm, such as C4.5 (you can use the Weka implementation called J48) with a minimal classification of trim and launch - the algorithm will do the rest!

+3
source

To uniquely characterize a new sequence, two pieces of information are needed: a sequence (string) of five known characters and the position of an unknown character. If your alphabet has 10 characters, then there can be no more than 10 ^ 5 = 100,000 unique five-character lines.

Depending on your memory resources, this may be small enough to fit into a hash table whose records provide a search structure to find the best combination (position, symbol). For example:

 --------- | BCBCF | --> { 0: <heap of symbols partially ordered by frequency>, ... } --------- 

This should provide a fairly effective search for a new sequence: combine the known characters, look at the sequence in the hash table, find the position of the unknown character, and then return the character that is at the top of the corresponding heap character.

If you can guarantee that the search structure will be stable (without entering new data) before any search is performed, you can squeeze higher efficiency out of it by replacing each of the indexed ones with the heap position with the only character that was on the top of the heap . (A heap structure is only necessary during the search phase if new information arrives that can change symbol frequencies.)

+1
source

I was among the "everyone who missed the obvious" here.

Just use any quick key / value search that is available to you. And look at all your possible meanings. This is a small set, and it does not take much time. Everything that is still not enough to store your data 6 times will be slower.

If you had a great vocabulary possible, then my previous answer would be appropriate.


Here is my old (and bad) answer.

I would bind them to a database with multiple concatenated indexes. How many of you are up to you.

At least I would have 2. I would have an index on (col1, col2, col3, col4, col5, col6) and (col4, col5, col6, col1, col2, col3) . This would mean that no matter which column was missing, there would be a way to get your data and view only at most 1/1000 records. If you want, you can instead index (col1, col2, col3, col4, col5, col6) , (col3, col4, col5, col6, col1, col2) and (col5, col6, col1, col2, col3, col4) to limit the search to 1/10000. It uses twice as much memory, but 10 times faster. (Warning, I will not guarantee that MySQL will successfully determine which index it should use. I would hope that other databases would get it right, but would not check it.)

If you do not want to use a database, you can use balanced binary trees in the same way as I suggested using the indexes above. For any given search, select a tree that has the missing item as deep as possible. Do a range search. Filter the returned data only for the rows of interest. Actually this is what a good database should do above with these indexes.

0
source

db will be a simple solution, but another solution is a tree in which each node selects one character, and the sheet will contain an array of possible results and calculations. Then there are only 5 steps in the tree to match one line. But to create a tree, it takes N * C time, where N is the number of elements, and C is the number of characters in each element. Wildcards are just a node in the tree that simultaneously removes one character from the input, but retains the possible results.

0
source

All Articles