Large-scale analysis of collaborative document analysis

I have about 1000 files. Each of them contains about 20,000 documents. I also have a list of about 1,000,000 words.

I want to calculate how many times each word happens with any other words. So, there is a sparse matrix of size 1M X 1M.

To speed up the calculation, I work on each file separately by following these steps:

1- Each core on my machine processes one file and displays a file of the following format

WordId1 WordId2 Frequency 

2- After executing each file, I merge 1000 files into one file.

This is my current approach, but it takes so much time, and I suppose there must be a very effective way to do this so that your comments are welcome.

+6
source share
3 answers

I think you can get reasonable performance by carefully crafting the parts. The problem part is memory. With enough memory, you could avoid recording and merging.

When processing a single document, you can convert it to BitSet when each bit is set, if the corresponding word is present.

Your attitude is symmetrical, so I hope you save (a, b, count) with a < b .

Counting requires something like Multiset<Pair<String, String>> , but there are more memory storage structures. Your words are numbered, so each can be represented int , and a pair can be represented using long . Maybe something like LongIntHashMap . You need concurrency so that you can use the aromatics for records or split the map into parts of N (after some hashing with N more than the number of cores) and synchronize. Simply create something on top of AtomicIntegerArray .

You did not say whether there is a chance that your result will fit into memory, but if so, this can lead to tremendous acceleration.

Requested explanation

Lines are numbered from 0 to one million, which corresponds to an int value. Two of these numbers fit together in long , which can be used as a key for TLongIntHashMap . For each document, you define all the corresponding String lines, get the corresponding long and increase the value in TLongIntHashMap .

Here only an increment is required when locking. Since this lock prevents concurrency, I suggested using multiple cards, each with its own lock. The increment can be grouped so that multiple operations can be performed with a single lock.

A better solution would be to use one TIntIntHashMap for each word. Imagine that you put all the words (represented as int s) found in the document in Set. Then you can make a loop as follows

 for (int w1 : words) { getLock(w1).lock(); TIntIntHashMap map = getMap(w1); for (int w2 : words) { if (isLess(w1, w2) map.increment(w2); } getLock(w1).unlock(); } 

Here isLess is an arbitrary antisymmetric non-reflexive relation used to avoid preserving both (a, b) and (b, a) . While just w1 < w2 , this will result in rather unbalanced values ​​( getMap(0) will probably be large and getMap(1000000) will be empty). Using ((w1 - w2) ^ ((w1 + w2) << 31)) < 0 should be done.

+2
source

I made some statistics like this, I divided the work into two steps

step1: multi-threaded calculation: calculate the partition identifier of each pair and directly output the corresponding partition file (partition_id = (md5 from the pair) / partition_count, the partition process is a key point) (I tried hash_map for data statistics (when the size is larger than thread_hold, it outputs map_data to a file, which saves a lot of disk space, and I put the output file on another disk, which speeds up the process)

step2: multi-threaded merge: merge the output of count using the map use step1 (this process runs in memory, if you don't have enough memory, select a larger state_section)

notes: this is a simple operation using mapreduce, step1 is the phrase of the maps, and step2 is the reduction of the phrase, the key process is the partiotion process, which corresponds to the part of the section before the reduction of the process in hadoop

+2
source

Here you click the fundamental laws of complexity. You are trying to process a huge number of documents for a massive number of words and create massive data arrays.

It will always be slow.

Some things that can speed it up:

  • Forget the list of a million words. Instead, just accept any word when you find it in the text, you can always filter it later. If you need to filter the list, then make sure that the list is in the appropriate form (for example, HashSet), which allows you to quickly check.

  • This behavior is more likely to be related to IO than to CPU binding, so try running it on fast SSDs - or if the files are small enough, install a RAM disk and run it. Do some monitoring to determine where the bottlenecks are.

  • The processing of each set of files, as you have already determined, is very parallel so that you can see it not only on several cores, but also on several machines.

Try something (overhead for the database can slow down): Instead of merging at the end, you can simply compile the results to process one document together in memory. Once you are finished processing, perform one insert into the database. The database then allows you to dynamically query for results using sum (), etc., to find the totals for each word combination. This actually gives you a more flexible / useful result than just a flat file, and avoids a separate merge step.

+1
source

All Articles