Calculation of token counters on a huge dataset

I need to move a huge amount of text (> 2 Tb, a complete Wikipedia dump) and save two counters for each token I see (each counter increases depending on the current event). The only operation I need for these counters is increasing. In the second phase, I have to calculate two floats based on these counters and save them.

He must follow these steps:

  • Go to a huge amount of text and increase the counters of two for each word found, depending on the current event.
  • Go through all the markers and, for each of them , calculate two additional floats based on these counters.
  • Allow requests (getting values ​​for any given token).

Requirements and other information:

  • It should scale to O markers (10 ^ 8).
  • The end result must be requested very quickly!
  • When navigating through the texts, only an increase in two counters will be made. This is a one-time processing, so there will be no requests during processing. Update only values.
  • No need for dynamic / updatable schema.

I am trying to use CouchDB and MongoDB without too good results.

What do you think is the best approach to this problem?

Thanks!

EDIT 1: I was asked to try Patricia trie and check if all the keys are in memory (I suspect they do not). A possible solution could be a custom Patricia trie with an additional operator to increase the value of each key in one step.

EDIT 2: Clarified what I mean by "huge":> 2 TB of text. Additional explanations.

EDIT 3: Unique token rating. As Mike Dunlaway suggested, I tried to make a quick assessment of unique tokens. In the first 830 MB of the data set, unique tokens grow linearly to 52134. If the number of unique tokens does not grow more slowly after processing more data (which is most likely), then there should be O (10 ^ 8) unique tokens.

EDIT 4: Java and Python solutions are preferred, but any other language is fine too.

EDIT 5: Typically, tokens will only contain printable ASCII characters, but they can contain any printable Unicode character. I will try the same process as in the lower and upper case intact; and only for lowercase.

+4
source share
6 answers

Strategy, not decision;

It is impossible to scroll through the input data in one process, that is, I don’t see how to parallelize the initial operation if the file is not in the parallel I / O system, and even then I think it can be difficult to process the 7z file in parallel.

However, you can try to implement a process that reads input data and writes pieces of it to your file system, preferably on sufficiently different disks, so that the processes you are about to start do not have all the queues for the same read / write heads.

Once the first fragment has been written, you start the process on a different core (you have a multi-core one, is it possible a cluster or a network of workstations?) To start digesting this piece. This process writes partial results to file (s).

As soon as the second fragment is written, you start the process on another kernel ...

... you will get an image

After all the input has been processed, you then develop tasks to combine the results with the outputs of the tasks that process each fragment. You would do it in some kind of cascade (for example, if you had 32 pieces and 16 processors, you could combine 2 pieces, then 8 of them combined 2 combined pieces, etc.).

My best guess is that you should be fine with flat files for this, I'm not sure if the extra database power is worth the extra cost (in terms of performance and programming complexity). You might want to write the final results to a database to support queries.

EDIT: well, if all your queries have the form “get me the counters for the token XXX”, you can leave with a binary search through one sorted text file. I do not suggest that you do this, but this could indicate the direction of the solution. Forgetting at the moment that tokens can begin with any character (this is just a matter of the alphabet), you can have 26 files, one for tokens starting with A, one for tokens starting with B, etc.

Or you can create an index in the main file with entries for A (offset 0 from the beginning of the file) B (offset 12456 from the beginning) and so on.

Personally, I would play a little with a one-time text file for the initial-letter approach until I had a working solution, and then find out if it is enough. But I have access to large clusters with a bunch of disk and a RAM chip, and your platform may dictate a different, perhaps more complex approach.

+1
source

As I understand it, you only want to count tokens. The first solution might just be to use a hash map in memory. Characters 52-100k (and the length of words in English is about 5.1) + 4 bytes for each token to calculate the amount of not so much data. You can easily save the card in the memory of the development machine.

The second solution is to use apache lucene to store new tokens - if you do not have 1M records, you do not need to separate the index - and the counter value that I would save in a database, for example sqllite (because updating the lucene index is not a good idea).

To speed up the process - for both solutions - I would simply split your dataset into a k * 100 dataset and run them separately on different machines (or in parallel), and then combine their results. The results of your count, you can summarize without any problems.

Your use case is a classic example in apache hadoop tutorials, but I think it will be too complicated to deploy it.

+1
source

If you have a lot of memory, you can just use regular redis to store counters (10 ^ 8 unique tokens with two counters each will take about 12 GB, I think).

If you don't have a lot of memory, you can still use redis, but with a little hashing strategy and vm_enabled to match the memory:

You may have tokens divided by the first and second letters (aa, ab, ac ... zz) for the hash name, and the actual identifier of the word + token as a hash key, and counte as a value. It will look like this:

hash ab - absence_c1 5 - absence_c2 2 - abandon_c1 2 - abandon_c1 10 hash st - stack_c1 10 - stack_c2 14 

But in this approach, since redis cannot "incr" on hashes, you get the previous value and their incr and set it back this way (pseudo-code):

 var last = redis("hget st stack_c1") var actual = last + 1 redis("hset st stack_c1 actual") 

Using this hash pattern and with vm enabled redis will maintain low memory usage, albeit quite quickly. I was able to store 2 million tokens with 15 characters each, using less than 100 MB of RAM and almost 4 GB of disk.

+1
source

High level solution:

  • Analyze the input by outputting the lines "[token] + X + Y" into the output files of 1st-N (each of these pending output files is small enough to be processed in memory.)
  • [For each file] read it in memory, output the sorted file with "[token] [count1] [count2] ..." lines
  • During the query, do a binary search in the correct file

More: Here is the Python pseudo-code for step 1)

 NUM_SHARDS = 1000 # big enough to make each file fit in memory output_files = [open("file" + str(n), "w") for n in xrange(NUM_SHARDS)] for token in input_stream: shard_id = hash(token) % NUM_SHARDS output_files[shard_id].write(token + " +0 +1\n") # TODO: output the correct +X and +Y as needed 

Here is the Python pseudo-code for step 2)

 input_files = [open("file" + str(n)) for n in xrange(NUM_SHARDS)] for file in input_files: counts = {} # Key: token Value: { "count1": 0, "count2": 1 } # read the file, and populate 'counts' for line in file: (token, count1, count2) = line.split(" ") # make sure we have a value for this token counts.setdefault(token, { "count1": 0, "count2": 0 }) counts[token]["count1"] += int(count1) counts[token]["count2"] += int(count2) # TODO: compute those floats, and stuff those inside 'counts' also # now write 'counts' out to a file (in sorted order) output_file = open(file.name + ".index", "w") for token, token_counts in sorted(counts.items()): output_file.write(token + " " + token_counts["counts1"] + " " + token_counts["counts2"] + "\n") # TODO: also write out those floats in the same line 

The following is the Python code for step 3):

 # assume 'token' contains the token you want to find shard_id = hash(token) % NUM_SHARDS filename = "file" + str(shard_id) + ".index" binary_search(token, open(filename), 0, os.path.getsize(filename)) # print out the line in 'file' whose first token is 'token' # begin/end always point to the start of a line def binary_search(token, file, begin, end): # If we're close, just do brute force if end - begin < 10000: file.seek(begin) while file.tell() < end: line = file.readline() cur_token = line.strip().split(" ")[0] if cur_token == token: print line return True return False # not found # If we're not close, pivot based on a line near the middle file.seek((begin + end) / 2) partial_line = file.readline() # ignore the first fractional line line = file.readline() cur_token = line.strip().split(" ")[0] if cur_token == token: print line return True elif cur_token < token: return binary_search(token, file, file.tell(), end) else: # cur_token > token return binary_search(token, file, begin, file.tell() - len(line)) 
+1
source

Well, if MongoDB and CouchDB do not work for you, then you basically have one problem: not enough power .

Look at the laundry list:

It should scale to O markers (10 ^ 8).

How much RAM do you have? You are talking about hundreds of millions of tokens, and you are talking about the flow of a 7zip file. If you want to quickly "increase", you must be able to store the entire data structure in memory, or it will all go very slowly.

The end result must be requested very quickly!

How fast? Microseconds, milliseconds, hundreds of milliseconds? If you want to request 500M records on a computer with 8 GB of RAM, you pretty much pounced. The data just doesn't fit, it doesn't matter which database you use.

Dataset> 2Tb

OK, let's say that your computer can average about 50 MB / s of continuous bandwidth and that your proc can actually decompress data at that rate. At this pace, you're talking about 11+ hours of processing, just to transfer data (did you want to do this on the weekend?)

50 MB / s throughput for 11 hours is not a small potato, but a real drive. And if you try to write something to disk while this happens (or OS swaps), it will deteriorate quickly.

Look from the point of view of the database, MongoDB can handle both updating the interface and the internal request. But he must click on the disk every time, and this will significantly expand your 11-hour work time.

This total runtime will only worsen and worsen if you cannot process the entire database in memory and the entire stream in memory.

My point ...

quite simply, you need more energy.

If you do not use this operation with 24 GB + RAM, then everything you do will be slow. If you do not have 24 GB + RAM, then your final data set will not be "lightning fast", at best it will be "200 ms fast." You can index 500M rows and expect to find a record if you cannot save the index in RAM.

If you do not run this operation with huge hard drives, then the decree will seem slow. I mean, you're talking about hours and hours of high-performance, steady readings (and probably writing).

I know that you need help, I know that you put generosity on this question, but it is very difficult to solve the following problem:

I am trying to use CouchDB and MongoDB without too good results.

when it sounds like you didn't put together the right gear to solve the problem.

+1
source

Should you use a database and not read a text file?

A simple C-type compiled language can run a simple parser in a fraction of the time it takes to read a file, so it should basically be "attached to I / O." This will be a program similar to unix wc , word count.

It seems that math is trivial and should not even be noticeable.

EDIT: Well, I don’t understand that you wanted to create a dictionary of unique tokens and count each one. In this case, a trigger or hash dictionary is sufficient. The size of the storage will depend on the typical length of the tokens and on the number of different tokens. It might be like the idiom unix sort | uniq sort | uniq .

0
source

All Articles