How to find duplicate subsequences of numbers in a large string of numbers?

Question

How to find duplicate subsequences of numbers in a large string of numbers?

Can someone help me solve my problem?

the problem is this:

Assumptions 1: we have an undefined number of substrings (s1, s2, s3, ...) that each of these substrings is a sequence of 100 numbers (an integer from 20,000,000 to 80,000,000) that they were randomly selected. We do not have any knowledge of the numbers that make these substrings and the number of substrings. here the order of numbers in the substring is more important, rather than the relationship between them.

Assumptions 2: we have a long and long string, including millions of numbers, this long string consists of repeating the substring specified in assumption 1. The name of this string is "S" ,.

We simplify the example as shown below: Each substring contains four numbers instead of 100 numbers, and each number is from 20 to 80 instead of 20,000,000 and 80,000,000: We have the string "S", our algorithm should find the substring s1 and s2 and s3 from the string "S".

S= 71,59,32,51,45,22,53,25,66,72,71,26,32,28,45,72,59,51,53,66,59,51,53,66,59,51,53,66,22,59,51,25,72,32,26,53,28,66,45,72,71,32,45,72,71,32,45,72, ... .

The result of this algorithm is as follows:

 S1= 59,51,53,66 S2= 22,25,26,28 S3= 71,32,45,72

NOTE. If we are lucky, substrings can come in the string "without" and not be repeated one after another.

I want the algorithm to find the number of substring (s1, s2, s3s, ...) And also find the substring (s1, s2, s3, ...) that will make the string "S".

Thank you very much.

+5

java algorithm sequence

user3588552 Apr 28 '16 at 4:22

source share

2 answers

Vishalz · Answer 1 · 2016-04-28T14:01:23+0000

Hope this works:

 import java.util.*; public class ComputeSubSequence { public static void main(String[] args) { String rootString = "59,22,51,25,53,66,26,28,59,51,22,53,25,66,71,26,32,28,45,59,72,51,71,53,66,32,45,72,22,25,26,59,51,28,71,53,32,66,45,72"; Integer sizeOfSubString = 4; List < String > rootList = new ArrayList < String > (Arrays.asList(rootString.split("\\s*,\\s*"))); Set < String > setValue = new LinkedHashSet < String > (); Set < Integer > setValueNew = new LinkedHashSet < Integer > (); HashMap < Integer, String > map = new LinkedHashMap < Integer, String > (); for (String string: rootList) { map.put(Integer.valueOf(string), Integer.valueOf(Collections.frequency(rootList, string)).toString()); setValue.add(Integer.valueOf(Collections.frequency(rootList, string)).toString()); } for (String string: setValue) { for (Map.Entry < Integer, String > entry: map.entrySet()) { if (entry.getValue().contains(string)) { setValueNew.add(entry.getKey()); } } } List < Integer > listOfNames = new ArrayList < Integer > (setValueNew); Integer j = 0; Integer i = 0; Integer count = 1; for (i = sizeOfSubString; i <= listOfNames.size(); i = i + sizeOfSubString) { System.out.println("S" + count + "=" + listOfNames.subList(j, i).toString().replace("]", "").replace("[", "")); count++; j = j + sizeOfSubString; } } }

Cort ammon · Answer 2 · 2016-04-30T05:38:21+0000

Look at the Knuth Morris Pratt or Boyer-Moore algorithm . It’s hard to say what you are asking for without any details, but as you know, these are very fast search algorithms. For Knut Morris Pratt:

As a rule, the algorithm becomes faster as the pattern search becomes longer.

I know that Stack Exchange usually prefers answers that have answers rather than links, but the algorithms are complex enough to serve links better. The key to their effectiveness is that they recognize that any unsuccessful match provides more additional information about other matches, which should also fail. This allows them to work in superlinear time: they can actually search in O (n) time without actually comparing each character in the string. He does this, realizing that when a match fails, there is more information available than just "that one match failed." It also talks a lot about nearby matches that could or could not have happened. This allows them to skip test characters that they can prove can never be part of a match.

How to find duplicate subsequences of numbers in a large string of numbers?

More articles: