C ++ How to combine sorted vectors into a sorted vector / deduce the smallest element from all of them?

I have a collection of about a hundred sorted vector<int> . Although most vectors have a small number of integers in them, some of the vectors contain large (> 10K) of them (thus, the vectors do not necessarily have the same size).

What I would like to do is essentially iterate over the smallest integer that is contained in all of these sorted vectors.

One way to do this is to combine all of these sorted vectors into a sorted vector and simply repeat. Thus,

Question 1: What is the fastest way to merge sorted vectors into a sorted vector?

I am sure that, on the other hand, there are faster / smarter ways to do this without merging and re-sorting all of this - perhaps by choosing the smallest integer iteratively from this collection of sorted vectors; without merging them first .. like this:

Question 2: What is the fast / best way to replenish the smallest element from a group of sorted vector<int> ?


Based on the answers below and comments on the question, I applied an approach in which I queue the priorities of iterators for sorted vectors. I'm not sure if this is efficient performance, but it seems to be very memory efficient. I think this question is still open, as I'm not sure that we have installed the fastest way.

 // compare vector pointers by integers pointed struct cmp_seeds { bool operator () (const pair< vector<int>::iterator, vector<int>::iterator> p1, const pair< vector<int>::iterator, vector<int>::iterator> p2) const { return *(p1.first) > *(p2.first); } }; int pq_heapsort_trial() { /* Set up the Sorted Vectors */ int a1[] = { 2, 10, 100}; int a2[] = { 5, 15, 90, 200}; int a3[] = { 12 }; vector<int> v1 (a1, a1 + sizeof(a1) / sizeof(int)); vector<int> v2 (a2, a2 + sizeof(a2) / sizeof(int)); vector<int> v3 (a3, a3 + sizeof(a3) / sizeof(int)); vector< vector <int> * > sorted_vectors; sorted_vectors.push_back(&v1); sorted_vectors.push_back(&v2); sorted_vectors.push_back(&v3); /* the above simulates the "for" i have in my own code that gives me sorted vectors */ pair< vector<int>::iterator, vector<int>::iterator> c_lead; cmp_seeds mycompare; priority_queue< pair< vector<int>::iterator, vector<int>::iterator>, vector<pair< vector<int>::iterator, vector<int>::iterator> >, cmp_seeds> cluster_feeder(mycompare); for (vector<vector <int> *>::iterator k = sorted_vectors.begin(); k != sorted_vectors.end(); ++k) { cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() )); } while ( cluster_feeder.empty() != true) { c_lead = cluster_feeder.top(); cluster_feeder.pop(); // sorted output cout << *(c_lead.first) << endl; c_lead.first++; if (c_lead.first != c_lead.second) { cluster_feeder.push(c_lead); } } return 0; } 
+8
c ++ sorting mergesort vector
source share
3 answers

One option is to use std :: priority queue to support a bunch of iterators, where iterators bubble a bunch depending on the values ​​that they dot on.

You might also consider using duplicate std :: inplace_merge . This includes combining all the data into a large vector and storing offsets at which each individual sorted block starts and ends, and then passes to inplace_merge. It will probably be faster than solving the heap, although I think in principle the complexity is equivalent.

Update: I implemented the second algorithm that I just described. Repeatedly merges. This code is on ideone .

This works by first concatenating all sorted lists together into one long list. If there were three source lists, this means that there are four “offsets” that make up four points in the full list, between which the elements are sorted. The algorithm then pulls three of them at a time, combining two corresponding neighboring sorted lists into one sorted list, and then remembers two of these three offsets that will be used in new_posts.

This is repeated in a loop when pairs of adjacent sorted ranges are combined together until only one sorted range remains.

Ultimately, I think that a better algorithm would include first combining the shortest pairs of neighboring ranges.

 // http://stackoverflow.com/questions/9013485/c-how-to-merge-sorted-vectors-into-a-sorted-vector-pop-the-least-element-fro/9048857#9048857 #include <iostream> #include <vector> #include <algorithm> #include <cassert> using namespace std; template<typename T, size_t N> vector<T> array_to_vector( T(*array)[N] ) { // Yes, this works. By passing in the *address* of // the array, all the type information, including the // length of the array, is known at compiler. vector<T> v( *array, &((*array)[N])); return v; } void merge_sort_many_vectors() { /* Set up the Sorted Vectors */ int a1[] = { 2, 10, 100}; int a2[] = { 5, 15, 90, 200}; int a3[] = { 12 }; vector<int> v1 = array_to_vector(&a1); vector<int> v2 = array_to_vector(&a2); vector<int> v3 = array_to_vector(&a3); vector<int> full_vector; vector<size_t> offsets; offsets.push_back(0); full_vector.insert(full_vector.end(), v1.begin(), v1.end()); offsets.push_back(full_vector.size()); full_vector.insert(full_vector.end(), v2.begin(), v2.end()); offsets.push_back(full_vector.size()); full_vector.insert(full_vector.end(), v3.begin(), v3.end()); offsets.push_back(full_vector.size()); assert(full_vector.size() == v1.size() + v2.size() + v3.size()); cout << "before:\t"; for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) { cout << ", " << *v; } cout << endl; while(offsets.size()>2) { assert(offsets.back() == full_vector.size()); assert(offsets.front() == 0); vector<size_t> new_offsets; size_t x = 0; while(x+2 < offsets.size()) { // mergesort (offsets[x],offsets[x+1]) and (offsets[x+1],offsets[x+2]) inplace_merge(&full_vector.at(offsets.at(x)) ,&full_vector.at(offsets.at(x+1)) ,&(full_vector[offsets.at(x+2)]) // this *might* be at the end ); // now they are sorted, we just put offsets[x] and offsets[x+2] into the new offsets. // offsets[x+1] is not relevant any more new_offsets.push_back(offsets.at(x)); new_offsets.push_back(offsets.at(x+2)); x += 2; } // if the number of offsets was odd, there might be a dangling offset // which we must remember to include in the new_offsets if(x+2==offsets.size()) { new_offsets.push_back(offsets.at(x+1)); } // assert(new_offsets.front() == 0); assert(new_offsets.back() == full_vector.size()); offsets.swap(new_offsets); } cout << "after: \t"; for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) { cout << ", " << *v; } cout << endl; } int main() { merge_sort_many_vectors(); } 
+4
source share

The first thing that comes to mind is to make a heap structure containing iterators for each vector ordered by the value that they are pointing to now. (of course, each entry must also contain a trailing iterator)

The current item is at the root of the heap, and to move forward, you simply expose it or increase its key. (the latter can be done by popping, increasing, then clicking)

I believe that this should have asymptotic complexity O(E log M) , where E is the total number of elements and M is the number of vectors.

If you really throw away all of the vectors, you can make a bunch of pointers to your vectors, you can also treat them as heaps to avoid a performance penalty when erasing from the front of the vector. (or you can copy everything in deque )


Combining them all together by merging pairs at a time has the same asymptotic complexity if you are careful with the order. If you order all the vectors in a fully balanced binary tree, then step off in pairs as you move through the tree, then each element will be copied log M times, which will also lead to the O(E log M) algorithm.

For added actual effectiveness, instead of the tree, you need to re-combine the smallest two vectors until you have only one left. (again, pointing to the vectors on the heap is a path, but this time ordered in length)

(in fact, you want to order a “copy cost” instead of a length. An extra thing to optimize for certain types of values)


If I were to guess, the fastest way would be to use the second idea, but with N-ary merging instead of pairwise merging for some suitable N (which, I assume, will be either a small constant, or about the square root of the number of vectors) and execute N-ary fusion, using the first algorithm above to list the contents of N vectors at once.

+2
source share

I used the algorithm given here and was a little distracted; conversion to templates. I encoded this version in VS2010 and used the lambda function instead of a functor. I don’t know if this is in some ways “better” than in the previous version, but maybe it will be useful to someone?

 #include <queue> #include <vector> namespace priority_queue_sort { using std::priority_queue; using std::pair; using std::make_pair; using std::vector; template<typename T> void value_vectors(const vector< vector <T> * >& input_sorted_vectors, vector<T> &output_vector) { typedef vector<T>::iterator iter; typedef pair<iter, iter> iter_pair; static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) > *(p2.first); }; priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda); size_t total_size(0); for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k) { cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) ); total_size += (*k)->size(); } output_vector.resize(total_size); total_size = 0; iter_pair c_lead; while (cluster_feeder.empty() != true) { c_lead = cluster_feeder.top(); cluster_feeder.pop(); output_vector[total_size++] = *(c_lead.first); c_lead.first++; if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead); } } template<typename U, typename V> void pair_vectors(const vector< vector < pair<U, V> > * >& input_sorted_vectors, vector< pair<U, V> > &output_vector) { typedef vector< pair<U, V> >::iterator iter; typedef pair<iter, iter> iter_pair; static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) > *(p2.first); }; priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda); size_t total_size(0); for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k) { cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) ); total_size += (*k)->size(); } output_vector.resize(total_size); total_size = 0; iter_pair c_lead; while (cluster_feeder.empty() != true) { c_lead = cluster_feeder.top(); cluster_feeder.pop(); output_vector[total_size++] = *(c_lead.first); c_lead.first++; if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead); } } } 

The priority_queue_sort::value_vectors sorts only vectors containing values; whereas priority_queue_sort::pair_vectors sorts vectors containing data pairs according to the first data element. Hope someone can use this someday :-)

0
source share

All Articles