Check for duplicates in a large row vector

Question

Check for duplicates in a large row vector

I am trying to find duplicate string instances where I have a vector of ~ 2.5 million rows. ~

I am currently using something like:

std::vector<string> concatVec; // Holds all of the concatenated strings containing columns C,D,E,J and U. std::vector<string> dupecheckVec; // Holds all of the unique instances of concatenated columns std::vector<unsigned int> linenoVec; // Holds the line numbers of the unique instances only // Copy first element across, it cannot be a duplicate yet dupecheckVec.push_back(concatVec[0]); linenoVec.push_back(0); // Copy across and do the dupecheck for (unsigned int i = 1; i < concatVec.size(); i++) { bool exists = false; for (unsigned int x = 0; x < dupecheckVec.size(); x++) { if (concatVec[i] == dupecheckVec[x]) { exists = true; } } if (exists == false) { dupecheckVec.push_back(concatVec[i]); linenoVec.push_back(i); } else { exists = false; } }

This is good for small files, but obviously ends in a very long time, as the file size increases due to the nested loop and the number of lines contained in dupecheckVec increases.

What could be a less terrible way to do this in a large file?

+6

c ++

rbj Mar 30 '11 at 13:47

source share

4 answers

You can determine what O (n logn) is, and then any equal elements should be sequential, so you can just check for the next element, which is only O (n). While your naive solution is O (n ^ 2).

+5

Puppy Mar 30 '11 at 13:50

source share

You can use a hash table that uses strings as keys and integers as values (count). Then just iterate over the list of rows and increase the value for each row by 1. Finally, iterate over the hash table and save these rows with a score of 1

[UPDATE] Another solution:

Use a hash table with row as key and row index position in vector / array
For each row in the vector:
- If the line is in the hashtable [optional: delete entry and] continue
- Otherwise, put the index position of the current row in the hash table, using the row as the key, and continue
Iterating over a hash table and using indexes to retrieve unique rows

This solution gives you the indices of all rows, filtering out duplicates. If you only need rows that don't have duplicates, you need to delete the hash table entry if the row is already in use in hastable.

+4

das_weezul Mar 30 '11 at 13:53

source share

Use std::unique see this

0

jonsca Mar 30 '11 at 13:50

source share

Mike seymour · Accepted Answer · 2011-03-30T13:57:49+0000

If you don't mind reordering the vector, then this should do it in O(n*log(n)) time:

 std::sort(vector.begin(), vector.end()); vector.erase(std::unique(vector.begin(), vector.end()), vector.end());

To preserve order, you can instead use a vector of pairs (line-number, line *): sort by line, uniquify using a comparator that compares the contents of the line and finally sorts by line number according to the lines:

 struct pair {int line, std::string const * string}; struct OrderByLine { bool operator()(pair const & x, pair const & y) { return x.line < y.line; } }; struct OrderByString { bool operator()(pair const & x, pair const & y) { return *x.string < *y.string; } }; struct StringEquals { bool operator()(pair const & x, pair const & y) { return *x.string == *y.string; } }; std::sort(vector.begin(), vector.end(), OrderByString()); vector.erase(std::unique(vector.begin(), vector.end(), StringEquals()), vector.end()); std::sort(vector.begin(), vector.end(), OrderByLine());

Check for duplicates in a large row vector

More articles: