Faster algorithm for finding a unique element between two arrays?

EDIT . For someone new on this question, I posted an answer explaining what is happening. The accepted answer is the one that I consider to be the best answer to my question, as originally published, but for more information, see My answer.

NOTE This problem was originally pseudocode and used lists. I adapted it to Java and arrays. Therefore, although I would like to see any solutions that use Java-specific tricks (or tricks in any language!), Just remember that the original problem is language-independent.

Problem

Let's say that there are two unsorted integer arrays a and b , with a valid repetition of elements. They are identical (with respect to the contained elements), except that one of the arrays has an additional element. As an example:

 int[] a = {6, 5, 6, 3, 4, 2}; int[] b = {5, 7, 6, 6, 2, 3, 4}; 

Create an algorithm that takes these two arrays as input and outputs a single unique integer (in the above case, 7).

Solution (still)

I came up with this:

 public static int getUniqueElement(int[] a, int[] b) { int ret = 0; for (int i = 0; i < a.length; i++) { ret ^= a[i]; } for (int i = 0; i < b.length; i++) { ret ^= b[i]; } return ret; } 

The “official” decision is presented in the class:

 public static int getUniqueElement(int[] a, int[] b) { int ret = 0; for (int i = 0; i < a.length; i++) { ret += a[i]; } for (int i = 0; i < b.length; i++) { ret -= b[i]; } return Math.abs(ret); } 

So both conceptually do the same thing. And considering that a has length m and b has length n, both solutions have O (m + n) runtimes.

Question

Later I had to talk with my teacher, and he hinted that there is an even faster way to do this. Honestly, I do not see how; to find out if an element is unique, it seems you need to at least take a look at each element. At the same time, at least O (m + n) ... right?

So is there a faster way? And if so, what is it?

+58
java arrays algorithm
Oct 05 '13 at 23:44
source share
9 answers

This is probably the fastest way to do this in Java using the HotLick suggestion in the comments. He makes the assumption that b.length == a.length + 1 , so b is a larger array with an additional "unique" element.

 public static int getUniqueElement(int[] a, int[] b) { int ret = 0; int i; for (i = 0; i < a.length; i++) { ret = ret ^ a[i] ^ b[i]; } return ret ^ b[i]; } 

Even if an assumption cannot be made, you can easily expand it to include the case where either a or b can be a large array with a unique element. It is still O (m + n), although only the overhead of the cycle / destination is reduced.

Edit:

Due to the details of the language implementation, this is (surprisingly) the fastest way to do this in CPython.

 def getUniqueElement1(A, B): ret = 0 for a in A: ret = ret ^ a for b in B: ret = ret ^ b return ret 

I tested this with the timeit module and found interesting results. It turns out that longhand ret = ret ^ a really faster in Python than the abbreviated ret ^= a . Iterating over loop elements is much faster than repeating indexes and then doing subscript operations in Python. That's why this code is much faster than my previous method when I tried to copy Java.

I assume that the moral of this story is that there is no right answer, because the question is fictitious in any case. As the OP shows in another answer below, it turns out that you can't move faster than O (m + n), and his teacher just pulled his leg. Thus, the problem boils down to finding the fastest way to iterate over all the elements in two arrays and accumulate the XOR of all of them. And this means that it depends entirely on the language implementation, and you need to do some testing and play to get the true “fastest” solution in any implementation that you use, because the general algorithm will not change.

+28
Oct 06 '13 at 2:56
source share

Well, here we go ... we apologize to anyone who is expecting a quicker solution. It turns out that my teacher had a little fun with me, and I completely missed what he was saying.

I should start by explaining what I had in mind:

he hinted that there is an even faster way to do this

The essence of our conversation was as follows: he said that my XOR approach was interesting, and we talked for some time about how I came to my decision. He asked me if I thought my decision was optimal. I said what I did (for the reasons mentioned in my question). Then he asked me: "Are you sure?" with a look on his face I can only describe as "smug." I hesitated, but said yes. He asked me if I could come up with a better way to do this. I was very similar: "Do you mean a faster way?" but instead of giving me a direct answer, he told me to think about it. I said I will.

So I thought about it, of course, that my teacher knew what I did not know. And after I came up with nothing for the day, I came here.

What my teacher really wanted me to do was defend my decision as optimal, and not try to find a better solution. As he put it: creating a nice algorithm is the easy part, the hard part proves that it works (and that it is the best). He thought it was pretty funny that I spent so much time on Find-A-Better-Way Land instead of developing a simple O (n) proof that would take significantly less time (we ended this up, see below, if you're interested )

So, I think I learned a big lesson here. I will accept the answer of Shashank Gupta, because I think that he really can answer the original question, although the question was erroneous.

I will leave you guys with the neat little single-line Python I found by typing the proof. This is not more efficient, but I like it:

 def getUniqueElement(a, b): return reduce(lambda x, y: x^y, a + b) 

Very informal "proof"

Let's start with the original two arrays from question a and b :

 int[] a = {6, 5, 6, 3, 4, 2}; int[] b = {5, 7, 6, 6, 2, 3, 4}; 

We will say here that a shorter array has a length n , then a longer array should have a length n + 1 . The first step to proving linear complexity is to combine the arrays into a third array (we will call it c ):

 int[] c = {6, 5, 6, 3, 4, 2, 5, 7, 6, 6, 2, 3, 4}; 

whose length is 2n + 1 . Why do this? So, now we have one more problem in its entirety: searching for an element that occurs an odd number of times in c (hence the "odd number of times" and "unique" are perceived as the same thing). This is actually a fairly popular interview question , and apparently my teacher got an idea of ​​his problem, so now my question has practical meaning. Hurrah!

Suppose there is an algorithm faster than O (n), such as O (log n). This means that it will only get access to some elements of c . For example, the O (log n) algorithm may only need to check log (13) ~ 4 elements in our array of examples to determine a unique element. Our question is: is this possible?

First, let's see if we can remove any of the elements (by "deleting", I mean that you do not need access). How about if we remove 2 elements so that our algorithm checks only the subframe c with a length of 2n - 1 ? This is still linear complexity, but if we can do this, perhaps we can improve it even further.

So, we choose two elements c completely randomly for removal. There are actually a few things that can happen here that I will summarize in the following cases:

 // Case 1: Remove two identical elements {6, 5, 6, 3, 4, 2, 5, 7, 2, 3, 4}; // Case 2: Remove the unique element and one other element {6, 6, 3, 4, 2, 5, 6, 6, 2, 3, 4}; // Case 3: Remove two different elements, neither of which are unique {6, 5, 6, 4, 2, 5, 7, 6, 6, 3, 4}; 

What does our array look like? In the first case, 7 is still the only element. In the second case, there is a new unique element, 5. And in the third case, there are 3 unique elements ... yes, this is a complete mess there.

Now the question is: can we define a unique element c simply by looking at this subarray? In the first case, we see that 7 is the only element of the subarray, but we cannot be sure that it is also the only element of c ; two deleted elements could be equal to 7 and 1. A similar argument applies to the second case. In the case of 3 with 3 unique elements, we cannot say which two are not unique in c .

It becomes clear that even with 2n - 1 access, there 2n - 1 not enough information to solve the problem. And so the optimal solution is linear.

Of course, the real proof will use induction, and not use the proof, for example, but I will leave it to someone else :)

+14
Oct 08 '13 at 1:26
source share

You can store the amount of each value in a collection, such as an array or hash map. O (n), then you can check the values ​​of another collection and stop as soon as you know that you have a match. This may mean that you usually do an average search on half of the second array.

+7
Oct 05 '13 at
source share

This is a bit faster:

 public static int getUniqueElement(int[] a, int[] b) { int ret = 0; int i; for (i = 0; i < a.length; i++) { ret += (a[i] - b[i]); } return Math.abs(ret - b[i]); } 

This is O (m), but order does not tell the whole story. The cyclic part of the “official” solution has about 3 * m + 3 * n operations, and a slightly faster solution has 4 * m.

(counting the cycle "i ++" and "i <a.length" as one operation).

Al.

+3
Oct. 06 '13 at 0:50
source share

Assuming that only one element was added, and the arrays were identical for a start, you can press O (log (base 2) n).

The rationale is that any array is searched for binary output O (log n). Except that in this case you are not looking for a value in an ordered array, you are looking for the first element that does not match. In this case, [n] == b [n] means that you are too low, and [n]! = B [n] means that you can be too high if only [n-1] == b [n -one].

The rest is basic binary search. Check the middle element, determine which division should have the answer, and do a sub-search in this section.

+1
Oct. 06 '13 at 4:41
source share

Let's say that there are two unsorted integer arrays a and b, with the possibility of repeating elements. They are identical (by the contained elements) except one of the arrays has an additional element ..

You may notice that I emphasized two points in the original question, and I add the additional assumption that the values ​​are non-zero .

In C # you can do this:

 int[, , , , ,] a=new int[6, 5, 6, 3, 4, 2]; int[, , , , , ,] b=new int[5, 7, 6, 6, 2, 3, 4]; Console.WriteLine(b.Length/a.Length); 

Cm? Regardless of the additional element , you will always know this by simply dividing their length.

With these statements, we do not store this series of integers as values ​​in arrays, but as their sizes .

Like any shorter series of integers, the longer, the larger should be only one additional integer. Thus, regardless of the order of integers, without additional, the total size of these two multi-dimensional arrays is the same. The extra time size exceeds the size longer, and to divide by a shorter size, we know what an extra integer is.

This solution will only work for this particular case, as I quoted from your question. You might want to port it to Java.

This is just a trick, as I thought this question is a trick. We will definitely not consider it as a solution for production.

+1
Oct. 06 '13 at 8:37
source share

Caution, it is incorrect to use the notation O (n + m). There is only one size parameter that is equal to n (in the asymptotic sense, n and n + 1 are equal). You should just say O (n). [For m> n + 1, the problem is different and more complicated.]

As indicated by others, this is optimal since you must read all the values.

All you can do is reduce the asymptotic constant. There is little room for improvement, since the obvious solutions are already very effective. The single cycle in (10) is probably hard to beat. To entertain a little it should be improved (slightly), avoiding the branch.

If your goal is pure performance, you should turn to non-portable solutions such as vectorization (using AXV instructions, 8 ints at a time) and parallelization to multicores or GPGPU. In a good old dirty C and 64-bit processor, you can match data with an array of 64-bit int and xor elements two pairs at a time;)

+1
Oct 09 '13 at 7:14
source share

I think it looks like Matching problems with nuts and bolts .

You can achieve this, possibly in O (nlogn). Not sure if in this case less than O (n + m).

0
Oct 06 '13 at 4:49
source share

There is no simple algorithm. Those presented in the question are in O (n). Any arithmetic "trick" to solve this requires that at least each element of both arrays be read once, so we remain in O (n) (or worse).

Any search strategy found in a real subset of O (n) (for example, O (log n)) will require sorted arrays or some other sorted pre-assembly structure (binary tree, hash). All sorting algorithms known to mankind, at least O (n * log n) (Quicksort, Hashsort), are on average worse than O (n).

Therefore, from a mathematical point of view, the algorithm does not exist faster. There may be some code optimizations, but they will not make much difference since the execution time will be linear with the length of the array.

0
Oct 06 '13 at
source share



All Articles