String.intern () doesn't seem to scale very well, as you add more lines. It seems like O (n) with the number of rows in the pool.
Random rand = new Random(); for(int i=0;i<100;i++) { long start = System.nanoTime(); for(int j=0;j<100000;j++) Long.toString(rand.nextLong()).toString().intern(); long time = System.nanoTime() - start; System.out.printf("Took %,d ns on average to intern() a random string%n", time/100000); }
prints
Took 1,586 ns on average to intern() a random string Took 3,843 ns on average to intern() a random string Took 7,551 ns on average to intern() a random string Took 13,436 ns on average to intern() a random string Took 20,226 ns on average to intern() a random string Took 27,609 ns on average to intern() a random string Took 35,098 ns on average to intern() a random string Took 42,439 ns on average to intern() a random string Took 50,801 ns on average to intern() a random string Took 20,975 ns on average to intern() a random string Took 4,634 ns on average to intern() a random string Took 10,512 ns on average to intern() a random string Took 16,914 ns on average to intern() a random string Took 23,601 ns on average to intern() a random string Took 30,230 ns on average to intern() a random string Took 36,184 ns on average to intern() a random string Took 43,266 ns on average to intern() a random string
Instead, I use an array as a string pool.
private static void testHashArray(String[] strings2, int size) { String[] pool = new String[size]; int hit=0, miss=0; long start2 = System.nanoTime(); for (String s : strings2) { int hash = (s.hashCode() & 0x7fffffff) % pool.length; String s2 = pool[hash]; if (s.equals(s2)) { hit++; } else { miss++; } if (s2 != s) pool[hash] = s; } long time2 = System.nanoTime() - start2; System.out.printf("Hash size: %,d took %.3f second. Hit/miss %,d/%,d %n", size, time2 / 1e9, hit, miss); } public static void main(String... args) { Random rand = new Random(); // a million unique strings. String[] strings = new String[1000 * 1000]; for (int i = 0; i < strings.length; i++) strings[i] = String.valueOf(rand.nextLong()); // random selection of Strings String[] strings2 = new String[10 * 1000 * 1000]; int totalSize = 0; for (int i = 0; i < strings2.length; i++) { int idx = (int) Math.pow(strings.length, rand.nextFloat()); String s = strings[idx]; strings2[i] = s; totalSize += s.length() + 16; // with overhead } System.out.printf("Original size %,d%n", totalSize); Set<String> uniqueStrings = Collections.newSetFromMap(new IdentityHashMap<String, Boolean>()); uniqueStrings.addAll(Arrays.asList(strings2)); System.out.printf("Unique strings %,d%n", uniqueStrings.size()); long start = System.nanoTime(); HashMap<String,String> map = new HashMap(); for(String s: strings2) map.put(s,s); long time = System.nanoTime() - start; System.out.printf("Took %.3f second to map strings%n", time/1e9); testHashArray(strings2, 10192); testHashArray(strings2, 101929); testHashArray(strings2, 1019291); }
prints
Original size 353,293,201 Unique strings 766,222 Took 0.979 second to map strings Hash size: 10,192 took 0.357 second. Hit/miss 5,213,210/4,786,790 Hash size: 101,929 took 0.309 second. Hit/miss 7,202,094/2,797,906 Hash size: 1,019,291 took 0.254 second. Hit/miss 8,789,382/1,210,618
If the intern is slow, how about completing it after loading it in the background thread. After loading the server, you can put () lines when a duplicate is detected.
Do you really need to save 130 MB? I know this sounds great, but will memory be used for anything else?
For a faster form on intern () you can use a fixed-size array.