Build a unique line number in java

We have a requirement to read / write over 10 million lines per file. Also we do not want duplicates in the file. Since the lines will be flushed to the file as soon as they are read, we will not store it in memory.

We cannot use hashcode due to collisions in the hash code, due to which we could skip the line as a duplicate. Two other approaches that I found in my search:

1. Use a message digest algorithm such as MD5, but it can be too expensive to calculate and store.

2.Use the checksum algorithm. [I'm not sure if this creates a unique key for the string - can someone confirm the confirmation)

Is there any other approach. Thanks.

+6
java hashcode key checksum
source share
6 answers

If you agree with the microscopic risk of collisions, you can use some hash function, such as MD5, as you think, and rely on hashes.

Another alternative, possibly with a large amount of memory, is to store already encountered lines in trie (a special type of tree).


Update. Another alternative would be to use a Bloom filter . However, this still depends on the hash, but can be adjusted to account for the arbitrarily small probability of collisions.

+7
source share

Storing 10 million lines in memory is really a lot, so I understand the reason to write it to a file immediately, and not store it in, for example, a TreeSet<String> , but where would you like to store 10 million unique numeric keys with which you want to compare? If you want to keep it unique and numerical (which has a lot of weak base / radius than letters), you cannot make the key shorter than the string itself, so you will not save the memory. Or maybe at the highest level with data compression, such as GZIP, but this will only add a lot of overhead. MD5 is also inappropriate since two different lines can produce the same hash.

I really do not see a better solution for this than to use a decent DBMS (SQL database) in which you set the column as UNIQUE and handle the restriction violation accordingly. RDBMS is highly optimized for such tasks.

If you really can't view the database, you need to re-read the file for any existing record before writing / flashing. Maybe not very fast, but certainly memory efficiency.

+6
source share

It is not possible to create a function that will create a unique key for a string that is shorter than that string.
There are data structures that can solve your problem. A B-tree can fit if the data is large enough. There may be more effective ways depending on the nature of your input.

+1
source share

Reliably removing duplicates is much more complicated than sorting a file. As another answer indicates, there is no guaranteed way to pinpoint duplicates accurately without storing a full copy of each row in memory, which seems to be exactly what you are trying to avoid.

You can save the in-memory or on-disk hash index and use them to extract the actual lines from the file vault for comparison, but this essentially duplicates what the database can do for you.

An alternative is the subsequent processing of the file after its completion. The UNIX sort command is pretty good in large files ( How can the UNIX sort command sort a very large file? ), So I expect the standard UNIX command line approach to work is reasonable:

  sort my-file-of-strings.txt | uniq > my-filtered-file-of-strings.txt 

(Note that files must be sorted first before moving to uniq to remove duplicates).

If you do not have available tools (or equivalents), you can always try to implement some kind of external merge.

+1
source share

If the strings from the fixed pool of possible strings (N), you can use the minimum perfect hashing to create an array of 0 ... N-1. A zero in a slot defined by an ideal hash function means that the string is not yet visible.

Otherwise, the only effectively correct tool without a lot of memory and the proposed solutions is to re-read the file before deciding to write a line to it.

You can do this as efficiently as possible with the parts displaying memory in a file.

0
source share

I really believe that the best solution is, as someone has already suggested, to use a database.

If for some reason you cannot use the database, you can still use the hash code. Of course there will be clashes. Just add the code so that when it detects a duplicate hash code, your program checks the file to determine if it is a genuine duplicate or collision.

0
source share

All Articles