Handling large string lists in java

I have a task where I need to go through several billion lines and check if each of them is unique. All lines themselves cannot be placed in the RAM memory of the PC. In addition, the number of rows is likely to be greater than Integer.MAX_VALUE.

I guess the best way to handle this amount of data is to put the hash codes of each of the lines in some kind of HashTable.

So here are my questions:

  • What should be used instead of String.hashCode() ? (the return value is int, but I will probably need a lot of time)
  • What is the fastest way / framework for working with lists of this size? Basically, I need the ability to quickly check if the list contains an element or not.
+7
source share
2 answers

You changed your mind about the problem, all this can be done very simply with one MySQL table, which saves data to disk, and does not keep everything in memory. That amount of data was never intended to effectively manage a stand-alone application.

 CREATE TABLE TONS_OF_STRINGS ( unique_string varchar(255) NOT NULL, UNIQUE (unique_string) ) 

Just loop the values ​​(assuming a comma-separated list here) and try inserting each token. Each failed token is a duplicate.

 public static void main(args) { Connection con = DriverManager.getConnection("jdbc:mysql://localhost/database","username","password"); FileReader file = new FileReader("SomeGiantFile.csv"); Scanner scan = new Scanner(file); scan.useDelimiter(","); String token; while ( scan.hasNext() ) { token = scan.next(); try { PreparedStatement ps = con.prepareStatement("Insert into TONS_OF_STRING (UNIQUE_STRING) values (?)"); ps.setString(1, token); ps.executeUpdate(); } catch (SQLException e) { System.out.println("Found duplicate: " + token ); } } con.close(); System.out.println("Well that was easy, I'm all done!"); return 0; } 

Remember to clear the table when you are done, but this is a lot of data.

+4
source

It is not enough to simply store 32 or 64-bit hash codes, because two different lines (of several billions) can easily have the same hash code. If you have two lines with the same hash code, you need to compare the actual lines to make sure they are actually equal.

Here is how I solved this problem:

  • Read the file / line stream:

    • Read each line

    • Calculate the hash code for a string

    • Write the hash code and line to a temporary file with a suitable field separator between

  • Use a good external sorting program to sort the temporary file using the hashcode field as the primary sort key and the string field as the secondary sort key.

  • Read the temporary file one line at a time. If two consecutive lines have the same hashcode field and different string fields, you have found a duplicate line.

Note. This approach will work equally well with 32 or 64 bit hash codes.

+3
source

All Articles