How can I group a large data set

Question

How can I group a large data set

I have a simple text file containing two columns, both integers

1 5 1 12 2 5 2 341 2 12

etc.

I need to group the data set by the second value, so there will be an output.

 5 1 2 12 1 2 341 2

Now the problem is that the file is very large, about 34 GB in size, I tried to write a python script to group them into a dictionary with a value as an array of integers, but it is too long. (I assume that allocating array('i') and extending them to append takes a lot of time.

Now I plan to write a pig script that I plan to run on a pseudo-distributed chaop machine (a large instance of Amazon EC3 High Memory Large).

 data = load 'Net.txt'; gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet store gdata into 'res.txt';

I wanted to know if there was an easier way to do this.

Update: the question of saving such a large file in memory is out of the question. In the case of the python solution, what I planned was to carry out 4 runs in the first run, only the values of the second col from 1 to 10 million, which are considered in the next run from 10 million to 20 million and so on, will be taken into account. but it turned out to be very slow.

The pig / hadoop solution is interesting because it stores everything on disk [Well, most of it].

For a better understanding, this data set contains information about users connecting ~ 45 million Twitter users, and the format in the file means that the user ID indicated by the second number corresponds to the first.

The solution I used:

 class AdjDict(dict): """ A special Dictionary Class to hold adjecancy list """ def __missing__(self, key): """ Missing is changed such that when a key is not found an integer array is initialized """ self.__setitem__(key,array.array('i')) return self[key] Adj= AdjDict() for line in file("net.txt"): entry = line.strip().split('\t') node = int(entry[1]) follower = int(entry[0]) if node < 10 ** 6: Adj[node].append(follower) # Code for writting Adj matrix to the file:

+4

python data-structures hadoop apache-pig

largescaled Aug 4 '10 at 8:27

source share

5 answers

Paulmcg · Answer 1 · 2010-08-04T08:47:49+0000

Assuming you have ~ 17 characters per line (a number that I randomly chose to make the math easier), you have about 2 billion entries in this file. If you do not work with large physical memory in a 64-bit system, you will split your page file, trying to save all this in memory in one dict. And it's just to read it as a data structure - you can assume that after creating this structure, you actually plan to do something with it.

With such a simple data format, I should think that you'd better do something in C instead of Python. Cracking this data should not be difficult, and you will have less cost overhead. At a minimum, there will be 8 Gb for storing 2 billion 4-byte integers (unless you can make some simplifying assumptions about the possible range of values that you currently list as 1 and 2 - if they fit in bytes or short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).

sarnold · Answer 2 · 2010-08-04T09:00:18+0000

If I had to solve this on my current hardware, I would probably write a few small programs:

The first will work with 500 megabyte file fragments, replacing columns and writing the result to new files. (You will get 70 or more.) (It does not take up much space.)

Then I would name the supplied OS sort(1) for each small file. (This may take several gigs of memory.)

Then I would write a merge sort program that combines lines from all 70-odd subfiles. (It does not take up much space.)

Then I will write a program that will go through a large sorted list; you will have a bunch of lines, for example:

 5 1 5 2 12 1 12 2

and you will need to return:

 5 1 2 12 1 2

(It does not take up much space.)

By parsing it into smaller pieces, we hope that you can save RSS to the extent that it matches a reasonable machine - this will require more disk I / O, but on something other than amazing hardware, using swap will catch attempts to process this in one big program.

John la rooy · Answer 3 · 2010-08-04T09:56:20+0000

Perhaps you can make a multi-pass file.

Does each of them fulfill a range of keys, for example, if you select a range size of 100

1st pass - work out all the keys from 0-99
2nd passage - work out all the keys from 100-199

3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..etc.

for your sample, the first pass will output

 5 1 2 12 1 2

and the 4th pass will give

 341 2

Select a range size so that the dict you create fits into your RAM

I would not use a multiprocessor system to try to speed it up using several cores, if you do not have a very fast hard drive, this should be related to IO, and you just finish the disk partitioning

Jérémie · Answer 4 · 2010-08-04T14:26:28+0000

If you are working with a 34 GB file, I assume that the hard drive, both in terms of storage and access time, is not a problem. How to read pairs in sequence, and when you find a pair (x, y), open the file "x", add "y" and close the file "x"? As a result, you will have one file for each Twitter user ID, and each file containing all the users to which it is connected. Then you can combine all these files if you want to get the result in the specified output format.

WHAT SUCH IT HAPPENED, I really think that: (a) for such a large dataset, the exact resolution is not suitable and that (b) there may be a better way to measure communications, so perhaps you would like to tell us about your ultimate goal.

Indeed, you have a very large graph, and many effective methods have been developed to study the shape and properties of huge graphs. Most of these methods are built to work with online streaming algorithms.

For example, a method called triangle counting, combined with probabilistic power estimation algorithms, efficiently and quickly provides information about clicks contained in your graph. For a better idea about the aspect of counting triangles and how this relates to graphs, see, for example, this (randomly selected) article .

Varun srivastava · Answer 5 · 2011-03-26T06:40:34+0000

I had a similar requirement, and you just need another pig expression to remove the redundancy in 5 (1,5) (2,5).

 a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int); b = GROUP a BY user; x = FOREACH b GENERATE group.user, a.following; store x INTO 'following-list';

How can I group a large data set

More articles: