Most efficient way to find substring C of string B in string A in LINQ

Question

Most efficient way to find substring C of string B in string A in LINQ

Having 2 lines of type:

string a = "ATTAGACCTGCCGGAA"; string b = "GCCGGAATAC";

I would just like to remove the part that is common on both lines and then concatenate it. I have to say that I need to remove only the left matching part so that I get

input

  ATTAGACCTGCCGGAA GCCGGAATAC

Exit

 ATTAGACCTGCCGGAATAC

At first I thought of using a pattern, and then seacrh for it, however this is not possible, since I do not know the pattern in advance (the length of the matching characters is variable)

Then I thought about finding the whole line b in a , and if there were no succes, remove the char in line a (the last one, since I want to keep the left-most unsurpassed line), and then loop until I have no more characters in b how

 string a = "ATTAGACCTGCCGGAA"; string b = "GCCGGAATAC"; int times = b.Length; string wantedString = string.Empty; string auxString = b; while (times > 0) { if (!a.Contains(auxString)) { //save last char and then delete it from auxString wantedString += auxString[auxString.Length - 1]; auxString = auxString.TrimEnd(auxString[auxString.Length - 1]); } else break; times--; } //reverse string char[] reversedToAppend = wantedString.ToCharArray(); Array.Reverse(reversedToAppend); string toAppend = new string(reversedToAppend);

so the answer will simply be a + toAppend ;
Is there a way to make this more efficient? (maybe in LINQ?)

Edit

As @lavin c correctly points out, it can happen anywhere in a , being the prefix b. for example, if a=AAT and b=AAG , the code should return AATG . the reason is that the common line starting on the left is c=AA . We will remove this from b and then get a=AAT with the result G

 AAT AAG

as a result

 AATG

Another example might be:

 a=ATTTGGGCCGCGCGCGAAAACCCCGCG b= AACCCCGCGCGCA

here

 c= AACCCCGCG

therefore the result should be

 result = ATTTGGGCCGCGCGCGAAAACCCCGCGCGCA

+8

string-matching c # algorithm

cMinor Jul 16 '15 at 21:12

source share

5 answers

lavin · Answer 1 · 2015-07-17T04:41:18+0000

(all arrays and strings are based on 0 in this answer)

First, I want to indicate that the problem with the OP is confused. Assuming c is a common part of a and b , the OP input and output example assumes that c must be a suffix a and prefix b at the same time. I see that some of the answers above have accepted this understanding of the problem.

However, the original implementation provided by OP suggests that c can occur anywhere in a , being the prefix of b , because you are using a.Contains(auxString) . This means that for a=AAT and b=AAG your code will return AATG . However, other people's answers will return AATAAG .

So, there are two possible interpretations of your problem. Please clarify.

Secondly, assuming that the size of the first row a is N and the second row b is M , unlike the O(N*M) solution provided in the original solution and the existing answers, O(N+M) can be achieved using any of the following: KMP, Suffix Array, Suffix Tree, Z-algorithm.

I will briefly describe how to use the Z-algorithm to solve this problem here, since it seems to be much less mentioned in stackoverflow compared to others.

For details of the Z-algorithm, see http://www.cs.umd.edu/class/fall2011/cmsc858s/Lec02-zalg.pdf

Basically, for a string S length L it computes an array Z length L , in which Z[i] is equal to the longest common prefix S and S[i:] ( S[i:] means the substring S , starting at position i ).

For this task, we combine the rows a and b in d=b+a ( b before a ) and computes the array Z joined row d . Using this Z array, we can easily determine the longest prefix b , which also occurs in a .

For a possible interpretation of one of the problems in which c must be the suffix a and the prefix b :

 max_prefix = 0 for i in range(M, N+M): if Z[i] == N+M - i: if Z[i] > max_prefix: max_prefix = Z[i]

and the answer will be as follows:

 a+b[max_prefix:]

For a possible interpretation of two problems in which c must be the prefix b and can be anywhere in a :

 max_prefix = 0 for i in range(M, N+M): if Z[i] > max_prefix: max_prefix = Z[i]

again the answer will be:

 a+b[max_prefix:]

The difference in these two cases is as follows:

  if Z[i] == N+Mi:

To understand this line, remember that Z[i] is the longest common prefix for the lines d and d[i:] , then:

Note that d=b+a
We list i from M to M+N-1 , which is the range of a to d . So d[i:] is equal to a[iM:] . And the length a[iM:] is N-(iM)=N+Mi
Since d starts with b , checking if Z[i] N+Mi is equal, whether a[iM:] checks the prefix b . If they are really equal, then we found the common string c , which is the prefix b , as well as the suffix a .
Without this line, we only know that we found line c , which is the prefix b , and occurs in a , starting at position i , and the end of a not guaranteed to be reached.

Enigmativity · Answer 2 · 2015-07-17T06:24:16+0000

This works to find the first point at which b overlaps the tail of a :

 string a = "ATTAGACCTGCCGGAA"; string b = "GCCGGAATAC"; var index = ( from n in Enumerable.Range(0, a.Length) where a.Skip(n).SequenceEqual(b.Take(a.Length - n)) select n ) .DefaultIfEmpty(-1) .First();

In this example, it returns 9 .

Final result:

 var output = a + b.Substring(a.Length - index);

What is rated as:

 ATTAGACCTGCCGGAATAC

All this suggests that overlap occurs at the end of a and at the beginning of b .

Orece · Answer 3 · 2015-07-16T23:01:49+0000

Linq really does not help you.

If n and m are the length of the left and right messages, it looks like you will have a solution O (nm) ...

A fist compresses your messages.

Since there are only 4 possible letters, you can encode them into 2 bits.

What is it, 4 letters bytes. (instead of 2 bytes by letter).

In one 32-bit comparison, you check 16 letters instead of 2.

Then (enter the mystical drunken late thinking), do two parallel and incremental FFTs, reading the data from the ends you want to combine (from the end for the left message and the beginning for the correct one), when the FFT matches, you are likely. Check it out.

The actual implementation of this will be more likely:

Read the data from the ends you want to combine (from the end for the left message and the beginning for the correct one), and while you read the “letters” of the two messages:
- Make a sum of the data. L[n-1]+L[n-2]+L[n-3]+L[n-4]+.. and R[0]+R[1]+R[2]+R[3]+..
- Build an alternative amount. L[n-1]-L[n-2]+L[n-3]-L[n-4]+.. and R[0]-R[1]+R[2]-R[3]+..
- Build a 2-alternative amount. L[n-1]+L[n-2]-L[n-3]-L[n-4]+.. and R[0]+R[1]-R[2]-R[3]+..
- and a few more (4,8,16 alternative amounts).

When you have a match. Check this.

If real DNA gives a lot of false positive matches, write an article about it.

[EDIT]

The amount will match. OK. But the alternative amount will correspond only in absolute value.

If messages ... 4 5 6 and 5 6 7 ...

The sum of the first two values will be 5 + 6 = 11 in both cases.

But the alternative amount will be -5 + 6 = 1 and 5 - 6 = -1.

For the 2.4 ... alternative amount, you will have a problem ...

You need another operation when the order does not matter. Like multiplication and xor.

toplel32 · Answer 4 · 2015-07-16T21:38:29+0000

Not sure if I understand this question. I guess the following: Take 2 lines, A and B , if there is a match C , then D = A + (B - C) .

 class Program { static void Main(string[] args) { Test test = new Test(); string a = "ATTAGACCTGCCGGAA"; string b = "GCCGGAATAC"; string match = test.Match(a, b); // GCCGGAA if (match != null) { string c = a + b.Remove(b.IndexOf(match), match.Length); // ATTAGACCTGCCGGAATAC Console.WriteLine(c); } } } class Test { public string Match(string a, string b) { if (a == null) { throw new ArgumentNullException("a"); } if (b == null) { throw new ArgumentNullException("b"); } string best = null; for (int i = 0; i < b.Length; i++) { string match = Match(a, b, i); if (match != null && (best == null || match.Length > best.Length)) { best = match; } } return best; } private string Match(string a, string b, int offset) { string best = null; for (int i = offset; i < b.Length; i++) { string s = b.Substring(offset, (i - offset) + 1); int index = a.IndexOf(s); if (index != -1) { best = s; } } return best; } }

If you want a better version, modify Test to return the index.

James curran · Answer 5 · 2015-07-16T22:02:13+0000

Here is mine. I think this is the most concise, and I see no way to make it more efficient.

 string Merge(string a, string b) { // Assumes A is always longer. for(int i =b.Length; i >0; --i) { if (a.EndsWith(b.Substring(0,i))) return a + b.Substring(i); } return null; }

Most efficient way to find substring C of string B in string A in LINQ

input

Exit

Edit

More articles: