Read the file with the shared tab with the first column as the key and the rest as values

I have a tab delimited file with 1 billion rows of them (suppose 200 columns instead of 3):

abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232 

I want to create a dictionary where the row in the first column is the key and the rest are the values. I did it this way, but it is computationally expensive:

 import io dictionary = {} with io.open('bigfile', 'r') as fin: for line in fin: kv = line.strip().split() k, v = kv[0], kv[1:] dictionary[k] = list(map(float, v)) 

How else can I get the desired dictionary? In fact, a numpy array would be more appropriate than a float list for a value.

+8
python dictionary numpy pandas csv
source share
5 answers

You can use pandas to load df, then build a new df as desired, and then call to_dict :

 In [99]: t="""abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232""" df = pd.read_csv(io.StringIO(t), sep='\s+', header=None) df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values) df.to_dict() Out[99]: {'abc': {0: -0.12300000000000001, 1: -0.98080000000000001, 2: 0.23123000000000002}, 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232}, 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}} 

EDIT

A more dynamic method and one that would reduce the need to create a temporary df:

 In [121]: t="""abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232""" # determine the number of cols, we'll use this in usecols col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1] col_len # read the first col we'll use this in names cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values # now read and construct the df using the determined usecols and names from above df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols) df.to_dict() Out[121]: {'abc': {0: -0.12300000000000001, 1: -0.98080000000000001, 2: 0.23123000000000002}, 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232}, 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}} 

Further update

Actually you do not need to read first, the column length can be implicitly obtained by the number of columns in the first column:

 In [128]: t="""abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232""" cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols) df.to_dict() Out[128]: {'abc': {0: -0.12300000000000001, 1: -0.98080000000000001, 2: 0.23123000000000002}, 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232}, 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}} 
+4
source share

You can use the csv module to read the file to ride on line splitting, then use np.array to convert the float values ​​to a numpy array object:

 import csv import numpy as np dictionary = {} with open('bigfile.csv', 'rb') as csvfile: spamreader = csv.reader(csvfile, delimiter='\t',) for row in spamreader: k, v = row[0], row[1:] #in python3 do k,*v = row dictionary[k] = np.array(map(float, v)) 
+2
source share

You can use the numpy.genfromtxt() function if you specify the number of columns:

 import numpy as np a = np.genfromtxt('bigfile.csv',dtype=str,usecols=(0)) b = np.genfromtxt('bigfile.csv',dtype=float,delimiter='\t',usecols=range(1,4)) #^enter # of cols here d = dict(zip(a,b.tolist())) #if you want a numpy array, just remove .tolist() print d 

Output:

 {'abc': [-0.123, 0.6524, 0.325], 'bar': [0.23123, -0.123124, -0.1232], 'foo': [-0.9808, 0.874, -0.2341]} 

Note. To programmatically find the cols number, you could do:

 with open('bigfile.csv', 'r') as f: num_cols = len(f.readline().split()) 

And then use num_cols for the usecols parameter.

+2
source share

One way to use Pandas . Assuming you do df = pd.read_csv(file) and df is like

 In [220]: df Out[220]: k a1 a2 a3 0 abc -0.12300 0.652400 0.3250 1 foo -0.98080 0.874000 -0.2341 2 bar 0.23123 -0.123124 -0.1232 

I added dummy column names, you have the flexibility to change this when reading the csv file

Then you can do the following.

 In [221]: df.set_index('k').T.to_dict('list') Out[221]: {'abc': [-0.12300000000000001, 0.65239999999999998, 0.32500000000000001], 'bar': [0.23123000000000002, -0.123124, -0.1232], 'foo': [-0.98080000000000001, 0.87400000000000011, -0.2341]} 
0
source share

Sorry, this is not an answer, but too long for a comment.

You say you have 1 billion rows with 200 columns of float. This means minimal memory.

10 9 * 200 * 8 = 1.6 10 12 bytes

This gives more than 1.5 G, not counting the overhead for a dict.

Of course, you can use numpy arrays instead of lists of floats, but each array is small (200 elements), so I doubt very much that amplification will be important.

IMHO, for so much data, you should not consider the loading phase no matter how you process the data, and if you really need diction from a billion records of 200 float values ​​each, you current implementation is correct, since it is a numpy array.

You can get an important gain in further processing if you could have all the data in one numpy array and use numpy for part of the processing, but without knowing more, this is just an assumption.

0
source share

All Articles