Reading a large formatted text file using NumPy

Question

Reading a large formatted text file using NumPy

I volunteered to help someone convert a finite element mesh from one format to another (i-deas * .unv in Alberta). I used NumPy for extra grid processing, but I am having trouble reading raw text files into NumPy arrays. I tried genfromtxt and loadtxt without success.

Some information:

1) All groups are separated by a header and footer character “-1” in their line.

2) The NODE group has the header "2411" in its own line. I just want to read alternative lines from this group, skipping every line with integers, but reading a line with three Fortran double precision numbers.

3) The connection group ELEMENT has the header "2412" on its own line. All data are integers, and only the first 4 columns are required for reading. Several empty slots will appear in the NumPy array due to the lack of values for 2 and 3 NODE elements.

4) Groups "2477" NODE I think I can deal with myself using regular expressions that will find which lines to read.

5) There will be about 1 million lines of text in the real data file, so I really want it to be vectorized if possible (or something that NumPy does for quick reading).

Sorry if I gave too much information and thanks.

The lines below are examples of parts of the * .unv text file format.

-1 2411 146303 1 1 11 6.9849462399269246D-001 8.0008842847097805D-002 6.6360238055630028D-001 146304 1 1 11 4.1854795755893875D-001 9.1256034628308313D-001 3.5725496189239300D-002 146305 1 1 11 7.5541258490349616D-001 3.7870257739063029D-001 2.0504544370783115D-001 146306 1 1 11 2.7637569971086767D-001 9.2829777518336010D-001 1.3757239038663285D-001 -1 -1 2412 9 21 1 0 7 2 0 0 0 1 9 10 21 1 0 7 2 0 0 0 9 10 1550 91 6 0 7 3 761 3685 2027 1551 91 6 0 7 3 761 2380 2067 39720 111 1 0 7 4 71854 59536 40323 73014 39721 111 1 0 7 4 45520 48908 133818 145014 -1 -1 2477 1 0 0 0 0 0 0 3022 PERMANENT GROUP1 7 2 0 0 7 3 0 0 7 8 0 0 7 7 0 0 7 147 0 0 7 148 0 0 2 0 0 0 0 0 0 2915 PERMANENT GROUP2 7 1 0 0 7 5 0 0 7 4 0 0 7 6 0 0 7 9 0 0 7 11 0 0 -1

+6

python-2.7 numpy

Tim Feb 18 '13 at 10:42

source share

1 answer

Bálint aradi · Accepted Answer · 2013-02-19T09:04:18+0000

The numpy genfromtxt and loadtxt would be quite difficult to use throughout the file, since your data has a very special structure (which varies depending on which node you are in). Therefore, I would suggest the following strategy:

Read line by line line by line, try to determine in which node you are analyzing the line.
If you are in a node that has only a few data (and where, for example, you need to read alternating lines so that you cannot read continuously), read it line by line and process the lines.

When you get into a section with a lot of data (for example, with "real data"), use the numpys fromfile method to read in the data, for example:

 mydata = np.fromfile(fp, sep=" ", dtype=int, count=number_of_elements) mydata.shape = (100000, 3) # Reshape it to the desired shape as fromfile # returns a 1D array.

Thus, you combine the flexibility of linear processing with the ability to quickly read and convert large pieces of data.

UPDATE: The fact is that you open the file, read it in turn, and when you come to a place with a lot of data, you transfer the file descriptor to the file.

Below is a simplified example:

 import numpy as np fp = open("test.dat", "r") line = fp.readline() ndata = int(line.strip()) data = np.fromfile(fp, count=ndata, sep=" ", dtype=int) fp.close()

This will read the data from the test.dat file with such contents as:

 10 1 2 3 4 5 6 7 8 9 10

The first line is read explicitly with fp.read() , processed (the number of integers to read is determined), and then np.fromfile() reads the corresponding data fragment and stores it in the 1D data array.

UPDATE2: Alternatively, you can read all the text into a buffer, and then determine the start and end positions for a large piece of data and directly convert it via np.fromstring :

 fp = open("test.dat", "r") txt = fp.read() fp.close() # Now determine starting and end positions (startpos, endpos) # .. # pass text that portion of the text to the fromstring function. data = np.fromstring(txt[startpos:endpos], dtype=int, sep=" ")

Or, if it is easy to formulate it as one regular expression, you can use fromregex() directly in the file.

Reading a large formatted text file using NumPy

More articles: