Multithreaded string processing explodes #threads

I am working on a multi-threaded project where we have to parse some text from a file into a magic object, do some processing of the object and aggregate the result. The old version of the code parsed text in a single thread and processed the object in the thread pool using Java ExecutorService. We did not get the performance improvement that we wanted, and it turned out that the parsing took longer than we thought about the processing time for each object, so I tried to move the parsing to workflows.

This should have worked, but what really happens is that the time per object explodes as a function of the number of threads in the pool . This is worse than linear, but not as bad as exponential.

I reduced it to a small example that (on my machine anyway) shows the behavior. The example does not even create a magical object; it just does string manipulation. There are no interdependencies between the streams that I see; I know that it’s split()not very effective, but I can’t imagine why this would be in bed in a multi-threaded context. Did I miss something?

I work on Java 7 on a 24-core computer. The lines are long, ~ 1 MB each. There featurescan be dozens of elements and 100k + elements in edges.

Input Example:

1    1    156    24    230    1350    id(foo):id(bar):w(house,pos):w(house,neg)    1->2:1@1.0    16->121:2@1.0,3@0.5

16 :

$ java -Xmx10G Foo 16 myfile.txt

:

public class Foo implements Runnable {
String line;
int id;
public Foo(String line, int id) {
    this.line = line;
    this.id = id;
}
public void run() {
    System.out.println(System.currentTimeMillis()+" Job start "+this.id);
    // line format: tab delimited                                                                
    // x[4]
    // graph[2]
    // features[m]      <-- ':' delimited                                              
    // edges[n]
    String[] x = this.line.split("\t",5);
    String[] graph = x[4].split("\t",4);
    String[] features = graph[2].split(":");
    String[] edges = graph[3].split("\t");
    for (String e : edges) {
        String[] ee = e.split(":",2);
        ee[0].split("->",2);
        for (String f : ee[1].split(",")) {
            f.split("@",2);
        }
    }                                                                    
    System.out.println(System.currentTimeMillis()+" Job done "+this.id);
}
public static void main(String[] args) throws IOException,InterruptedException {
    System.err.println("Reading from "+args[1]+" in "+args[0]+" threads...");
    LineNumberReader reader = new LineNumberReader(new FileReader(args[1]));
    ExecutorService pool = Executors.newFixedThreadPool(Integer.parseInt(args[0]));
    for(String line; (line=reader.readLine()) != null;) {
        pool.submit(new Foo(line, reader.getLineNumber()));
    }
    pool.shutdown();
    pool.awaitTermination(7,TimeUnit.DAYS);
}
}

:

  • . , , ArrayList<String>. , . , --, ?
  • , , .: (

:

, indexOf(), :

private String[] split(String string, char delim) {
    if (string.length() == 0) return new String[0];
    int nitems=1;
    for (int i=0; i<string.length(); i++) {
        if (string.charAt(i) == delim) nitems++;
    }
    String[] items = new String[nitems];
    int last=0;
    for (int next=last,i=0; i<items.length && next!=-1; last=next+1,i++) {
        next=string.indexOf(delim,last);
        items[i]=next<0?string.substring(last):string.substring(last,next);
    }
    return items;       
}

, , , . , ...

+4
2

Java 7 String.split() String.subString() , "" Strings, String, .

, split() a String , ( , ) . , , ( Java 8).

, , "", String.split() ( ) .

+1

All Articles