How can I avoid the string.intern () conflict and reduce the occupied memory?

Question

How can I avoid the string.intern () conflict and reduce the occupied memory?

I am parsing a rather large (200 MB) XML file, which leads to a tree of objects, each of which defines a bunch of parameters (key = value). This data structure works in Tomcat webapp and is used to find these parameters.

A few months ago, we discovered a heap memory problem on this server. We could solve this by running the keys and parameter values (most of which were very redundant), which reduced the amount of memory from more than 150 MB to 20 MB.

Today I am reviewing the server because people are complaining about startup time. I profile the server and see that parsing XML with XPP3 takes 40 seconds, where String.intern () takes more than 30 seconds.

I know this is a compromise. And I know that I can do internment. Because XML parsing is single-threaded since a simple HashMap can do the job. But you know, it seems strange.

Has anyone cracked numbers to see if String.intern should be called in favor of another solution?

So the question is? How can I get as few conflicts as possible for such problems?

Thanks Stefan

+4

java string memory string-interning contention

Stefan schubert-peters Jul 28 '11 at 9:48

source share

4 answers

String.intern () doesn't seem to scale very well, as you add more lines. It seems like O (n) with the number of rows in the pool.

Random rand = new Random(); for(int i=0;i<100;i++) { long start = System.nanoTime(); for(int j=0;j<100000;j++) Long.toString(rand.nextLong()).toString().intern(); long time = System.nanoTime() - start; System.out.printf("Took %,d ns on average to intern() a random string%n", time/100000); }

prints

 Took 1,586 ns on average to intern() a random string Took 3,843 ns on average to intern() a random string Took 7,551 ns on average to intern() a random string Took 13,436 ns on average to intern() a random string Took 20,226 ns on average to intern() a random string Took 27,609 ns on average to intern() a random string Took 35,098 ns on average to intern() a random string Took 42,439 ns on average to intern() a random string Took 50,801 ns on average to intern() a random string Took 20,975 ns on average to intern() a random string Took 4,634 ns on average to intern() a random string Took 10,512 ns on average to intern() a random string Took 16,914 ns on average to intern() a random string Took 23,601 ns on average to intern() a random string Took 30,230 ns on average to intern() a random string Took 36,184 ns on average to intern() a random string Took 43,266 ns on average to intern() a random string

Instead, I use an array as a string pool.

 private static void testHashArray(String[] strings2, int size) { String[] pool = new String[size]; int hit=0, miss=0; long start2 = System.nanoTime(); for (String s : strings2) { int hash = (s.hashCode() & 0x7fffffff) % pool.length; String s2 = pool[hash]; if (s.equals(s2)) { hit++; } else { miss++; } if (s2 != s) pool[hash] = s; } long time2 = System.nanoTime() - start2; System.out.printf("Hash size: %,d took %.3f second. Hit/miss %,d/%,d %n", size, time2 / 1e9, hit, miss); } public static void main(String... args) { Random rand = new Random(); // a million unique strings. String[] strings = new String[1000 * 1000]; for (int i = 0; i < strings.length; i++) strings[i] = String.valueOf(rand.nextLong()); // random selection of Strings String[] strings2 = new String[10 * 1000 * 1000]; int totalSize = 0; for (int i = 0; i < strings2.length; i++) { int idx = (int) Math.pow(strings.length, rand.nextFloat()); String s = strings[idx]; strings2[i] = s; totalSize += s.length() + 16; // with overhead } System.out.printf("Original size %,d%n", totalSize); Set<String> uniqueStrings = Collections.newSetFromMap(new IdentityHashMap<String, Boolean>()); uniqueStrings.addAll(Arrays.asList(strings2)); System.out.printf("Unique strings %,d%n", uniqueStrings.size()); long start = System.nanoTime(); HashMap<String,String> map = new HashMap(); for(String s: strings2) map.put(s,s); long time = System.nanoTime() - start; System.out.printf("Took %.3f second to map strings%n", time/1e9); testHashArray(strings2, 10192); testHashArray(strings2, 101929); testHashArray(strings2, 1019291); }

prints

 Original size 353,293,201 Unique strings 766,222 Took 0.979 second to map strings Hash size: 10,192 took 0.357 second. Hit/miss 5,213,210/4,786,790 Hash size: 101,929 took 0.309 second. Hit/miss 7,202,094/2,797,906 Hash size: 1,019,291 took 0.254 second. Hit/miss 8,789,382/1,210,618

If the intern is slow, how about completing it after loading it in the background thread. After loading the server, you can put () lines when a duplicate is detected.

Do you really need to save 130 MB? I know this sounds great, but will memory be used for anything else?

For a faster form on intern () you can use a fixed-size array.

+1

Peter Lawrey Jul 28 '11 at 12:00

source share

We had a problem with the parsed string in the checked "Name" object. This has been done throughout the application and needs to be optimized both in memory and speed.

After several test runs, we eventually finished processing the decisions of char arrays, both during parsing and when implementing Name.

String.toCharArray () to retrieve an array of strings, or you can use String.charAt (poses) . For fast copying between arrays we used System.arrayCopy .

Parsing was faster than using a cache to search.

0

Tomas f Jul 28 '11 at 10:11

source share

Here's another thought, although it may sound a little on the cooky side. Have you ever thought of writing a code generator that simply parses your XML file and spits out Java code that populates the map using the actual lines (which get internalized at compile time).

Something like that

 public final class ConfigurationData { public static String get(String key) { return map.get(key); } private static final Map<String,String> MAP; static { MAP = new HashMap<String,String>([[[ number of records to load up ]]]); MAP.put([[[key 1]]], [[[ value 1 ]]]); MAP.put([[[key 2]]], [[[ value 2 ]]]); ... } }

This follows the same concept as precompiled JSPs to save at the first fine for the user, but he adds another build step and becomes a deployment if there is a configuration file change (which should be monitored anyway).

0

Archimedes trajano Nov 02 '11 at 5:42

source share

Tassos bassassos · Accepted Answer · 2011-07-28T10:06:15+0000

Add an additional indirectness step: add a second HashMap that stores the keys, and first find the keys there before inserting them into memory structures. This will give you much more flexibility than String # intern ().

However, if you need to parse this XML file of 200 MB each time tomcat is started, and the extra 10 seconds make people grumble (do they restart tomcat so often?) - this does pop-up flags (did you consider that you were using a database, even Apache Derby, to save the analyzed data?).

How can I avoid the string.intern () conflict and reduce the occupied memory?

More articles: