I have a simple text file containing two columns, both integers
1 5 1 12 2 5 2 341 2 12
etc.
I need to group the data set by the second value, so there will be an output.
5 1 2 12 1 2 341 2
Now the problem is that the file is very large, about 34 GB in size, I tried to write a python script to group them into a dictionary with a value as an array of integers, but it is too long. (I assume that allocating array('i') and extending them to append takes a lot of time.
Now I plan to write a pig script that I plan to run on a pseudo-distributed chaop machine (a large instance of Amazon EC3 High Memory Large).
data = load 'Net.txt'; gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet store gdata into 'res.txt';
I wanted to know if there was an easier way to do this.
Update: the question of saving such a large file in memory is out of the question. In the case of the python solution, what I planned was to carry out 4 runs in the first run, only the values ββof the second col from 1 to 10 million, which are considered in the next run from 10 million to 20 million and so on, will be taken into account. but it turned out to be very slow.
The pig / hadoop solution is interesting because it stores everything on disk [Well, most of it].
For a better understanding, this data set contains information about users connecting ~ 45 million Twitter users, and the format in the file means that the user ID indicated by the second number corresponds to the first.
The solution I used:
class AdjDict(dict): """ A special Dictionary Class to hold adjecancy list """ def __missing__(self, key): """ Missing is changed such that when a key is not found an integer array is initialized """ self.__setitem__(key,array.array('i')) return self[key] Adj= AdjDict() for line in file("net.txt"): entry = line.strip().split('\t') node = int(entry[1]) follower = int(entry[0]) if node < 10 ** 6: Adj[node].append(follower)