How can I effectively load such ASCII files using python?

I have large formatted ASCII files formatted as follows:

xyz num_line index 1 float 2 float ... num_line float x2 y2 z2 num_line2 index2 1 float 2 float ... num_line2 float ... 

The number of blocks can reach thousands and the number of rows in each block up to hundreds.

Here is an example of what I get:

 0.0 0.0 0.0 4 0 1 0.5 2 0.9 3 0.4 4 0.1 0.0 0.0 1.0 4 1 1 0.2 2 0.2 3 0.4 4 0.9 0.0 1.0 2.0 5 2 1 0.7 2 0.6 3 0.9 4 0.2 5 0.7 

And what I want from this (as a numpy matrix):

 0.5 0.2 0.7 0.9 0.2 0.6 0.4 0.4 0.9 0.1 0.9 0.2 nan nan 0.7 

Of course I can use:

 my_mat = [] with open("myfile", "r") as f_in: niter = int(f_in.readline().split()[3]) while niter: curr_vect = zeros(niter) for i in xrange(niter): curr_vect[i] = float(f_in.readline().split()[1]) my_mat.append(curr_vect) line = f_in.readline() if line is not None: niter = int(line.split()[3]) else: niter = False my_mat = array(my_mat) 

The problem is that it is not very effective and too complicated for what is. I already know about numpy loadtxt and genfromtxt , but they don't seem to apply there.

I am looking for something faster and easier to read. Any idea?

EDIT:

Please forgive me, my question was not complete, and some of you have lost time because of me. Here is a real example of such a block:

 3.571428571429E-02 3.571428571429E-02-3.571428571429E-02 1 35 1 -0.493775207966779 2 0.370269037864060 3 0.382332033744703 4 0.382332033744703 5 0.575515346181205 6 0.575515346181216 7 0.575562530624028 8 0.639458035564442 9 0.948445367602052 10 0.948445367602052 11 0.975303238888803 12 1.20634795229899 13 1.21972845646758 14 1.21972845646759 15 1.52659950368213 16 2.07381346028515 17 2.07629743909555 18 2.07629743909555 19 2.15941179949552 20 2.15941179949552 21 2.30814240005132 22 2.30814240005133 23 2.31322868361483 24 2.53625115348660 25 2.55301153157825 26 2.55301153157826 27 2.97152031842301 28 2.98866790318661 29 2.98866790318662 30 3.24757159459268 31 3.27186643004142 32 3.27186643004143 33 3.37632477135298 34 3.37632477135299 35 3.55393884607834 
+2
source share
3 answers
 import numpy as np from itertools import groupby,izip_longest def f1(fname): with open(fname) as f: return np.matrix(list(izip_longest( *(map(lambda x: float(x[1]),v) for k,v in groupby(map(str.split,f), key=lambda x: len(x) == 2) if k), fillvalue=np.nan))) d1('testfile') 

of

 matrix([[ 0.5, 0.2, 0.7], [ 0.9, 0.2, 0.6], [ 0.4, 0.4, 0.9], [ 0.1, 0.9, 0.2], [ nan, nan, 0.7]]) 

EDIT:

In terms of performance, I tested it against the np.genfromtxt @TheodrosZelleke solution and it seems to be about five times faster.

+1
source

You can use numpy.genfromtxt:

  • read column one labeled linebreak \n
  • provides custom converter function

Example:

 import numpy as np from StringIO import StringIO # your data from above as string raw = '''0.0 0.0 0.0 4 0 1 0.5 ... 5 0.7 ''' 

here is the converter:

 def custom_converter(line): token = line.split() if len(token) == 2: return float(token[1]) else: return np.NaN 

upload data:

 data = np.genfromtxt(StringIO(raw), delimiter='\n', converters={0: custom_converter}) print data 

which prints:

 [ nan 0.5 0.9 0.4 0.1 nan 0.2 0.2 0.4 0.9 nan 0.7 0.6 0.9 0.2 0.7] 

Now you create the final data structure:

 delims, = np.where(np.isnan(data)) max_block = np.max(np.diff(delims)) nblocks = delims.size final_data = np.empty([max_block, nblocks]) + np.NaN delims = delims.tolist() delims.append(data.size) low = delims[0] + 1 for i, up in enumerate(delims[1:]): final_data[0: up-low , i] = data[low:up] low = up + 1 print final_data 

which prints

 [[ 0.5 0.2 0.7] [ 0.9 0.2 0.6] [ 0.4 0.4 0.9] [ 0.1 0.9 0.2] [ nan nan 0.7]] 
+2
source

This would be much faster if you made sure that each block has the same number of lines (even if they are filled with zeros). That way you can just change the array read with loadtxt . But given this limitation, here is an example that might be a little faster:

 import numpy as np data = np.loadtxt("myfile", usecols=(0, 1), unpack=True) nx = np.sum(data[0] == 0) ny = np.max(data[0]) my_mat = np.empty((nx, ny), dtype='d') my_mat[:] = np.nan # if you really want to populate it with NaNs for missing tr_ind = data[0, list(np.nonzero(np.diff(data[0]) < 0)[0]) + [-1]].astype('i') buf = np.squeeze(data[1, np.nonzero(data[0])]) idx = 0 for i in range(nx): my_mat[i, :tr_ind[i]] = buf[idx : idx + tr_ind[i]] idx += tr_ind[i] 

And you can check the result:

 >>> print my_mat.T array([[ 0.5, 0.2, 0.7], [ 0.9, 0.2, 0.6], [ 0.4, 0.4, 0.9], [ 0.1, 0.9, 0.2], [ nan, nan, 0.7]]) 

UPDATE: as TheodrosZelleke pointed out, the above solution fails if x2 (first column) is nonzero. I did not notice this for the first time. Here is an update to get around this:

 # this will give a conversion warning because column number varies blk_sizes = np.genfromtxt("myfile", invalid_raise=False, usecols=(-2,)) nx = blk_sizes.size ny = np.max(blk_sizes) data = np.loadtxt("myfile", usecols=(1,)) my_mat = np.empty((nx, ny), dtype='d') my_mat[:] = np.nan idx = 1 for i in range(nx): my_mat[i, :blk_sizes[i]] = data[idx : idx + blk_sizes[i]] idx += blk_sizes[i] + 1 

(And then take my_mat.T .)

0
source

All Articles