a word problem is one of the most common problems in the Big Data world; it's a kind of hello world of frameworks like Hadoop. You can find enough information on the Internet about this issue.
.
-, 900000 - , -, . , , :
h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
h[w] = w in h ? h[w]++ : 1
}
, , - , :
Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file
Unix. .
, , map-reduce, hadoop, .
, , - , , " ", , , , .
Java. :
import java.util.Scanner;
public class WordGenerator {
public static void main(String[] args) {
Scanner input = new Scanner(System.in);
while (input.hasNext()) {
System.out.println(input.next().toLowerCase());
}
}
}
:
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator
hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.
uniq :
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq
hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo
, , , :
Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));
echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq
hey
moe
nyuk
soitenly
why
woo
; , .:)