High level Java optimization

There are many questions, answers, and opinions on how to make Java optimization at a low level for while, while, and while loops, and even this is necessary.

My question is rather high-level optimization in design. Suppose I have to do the following:

for a given line input, count the appearance of each letter in the line.

This is not a serious problem when a line consists of several sentences, but what if we want to count the appearance of each word in a file of 900,000 words. build cycles just waste time.

So, what is a high-level design pattern that can be applied to this problem.

I assume that my main point is that I tend to use loops to solve many problems, and I would like to give up the habit of using loops.

early

Sam

ps If possible, you can create some kind of pseudo-code to solve the problem with a file of 900,000 words, I better understand the code better than I understand English, which, I believe, is the same for most visitors to this site.

+5
source share
6 answers

a word problem is one of the most common problems in the Big Data world; it's a kind of hello world of frameworks like Hadoop. You can find enough information on the Internet about this issue.

.

-, 900000 - , -, . , , :

h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
  h[w] = w in h ? h[w]++ : 1
}

, , - , :

Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file

Unix. .

, , map-reduce, hadoop, .

, , - , , " ", , , , .

Java. :

import java.util.Scanner;
public class WordGenerator {
    /**
     * Tokenizes standard input into words, writing each word to standard output,
     * on per line.  Because it reads from standard input and writes to standard
     * output, it can easily be used in a pipeline combined with sort, uniq, and
     * any other such application.
     */
    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        while (input.hasNext()) {
            System.out.println(input.next().toLowerCase());
        }
    } 
}

:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator

hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.

uniq :

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq

hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo

, , , :

Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));

echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq

hey
moe
nyuk
soitenly
why
woo

; , .:)

+10

- O (n). AFAIK , HashMap . HashMap , , .

pseduo- ( )

HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
    char c = str.charAt(i);
    if (map.containsKey(c)) map.put(c, map.get(c) + 1);
    else map.put(c, 1);
}
+3

, . IMO, - ( , , ).

+1

, 900 000 - . 8 3 , 24 .;)

int[] . 65536 .

StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
    words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();

long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
    charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());

Took 111 ms to count 139,715,647 characters

11 .

.

public static void main(String... args) throws InterruptedException, ExecutionException {
    StringBuilder words = new StringBuilder();
    Random rand = new Random();
    for (int i = 0; i < 10 * 1000 * 1000; i++)
        words.append(Long.toString(rand.nextLong(), 36)).append(' ');
    final String text = words.toString();

    long start = System.nanoTime();
    // start a thread pool to generate 4 tasks to count sections of the text.
    final int nThreads = 4;
    ExecutorService es = Executors.newFixedThreadPool(nThreads);
    List<Future<int[]>> results = new ArrayList<Future<int[]>>();
    int blockSize = (text.length() + nThreads - 1) / nThreads;
    for (int i = 0; i < nThreads; i++) {
        final int min = i * blockSize;
        final int max = Math.min(min + blockSize, text.length());
        results.add(es.submit(new Callable<int[]>() {
            @Override
            public int[] call() throws Exception {
                int[] charCount = new int[Character.MAX_VALUE];
                for (int j = min; j < max; j++)
                    charCount[text.charAt(j)]++;
                return charCount;
            }
        }));
    }
    es.shutdown();
    // combine the results.
    int[] charCount = new int[Character.MAX_VALUE];
    for (Future<int[]> resultFuture : results) {
        int[] result = resultFuture.get();
        for (int i = 0, resultLength = result.length; i < resultLength; i++) {
            charCount[i] += result[i];
        }
    }
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}

Took 45 ms to count 139,715,537 characters

.

+1

, , , . , , . .

. , . .

, , , , , .

0

. / . - .

, . H, , , .

SEDA - , map-reduce - .

, - API.

0

All Articles