Reading a large formatted text file using NumPy

I volunteered to help someone convert a finite element mesh from one format to another (i-deas * .unv in Alberta). I used NumPy for extra grid processing, but I am having trouble reading raw text files into NumPy arrays. I tried genfromtxt and loadtxt without success.

Some information:

1) All groups are separated by a header and footer character β€œ-1” in their line.

2) The NODE group has the header "2411" in its own line. I just want to read alternative lines from this group, skipping every line with integers, but reading a line with three Fortran double precision numbers.

3) The connection group ELEMENT has the header "2412" on its own line. All data are integers, and only the first 4 columns are required for reading. Several empty slots will appear in the NumPy array due to the lack of values ​​for 2 and 3 NODE elements.

4) Groups "2477" NODE I think I can deal with myself using regular expressions that will find which lines to read.

5) There will be about 1 million lines of text in the real data file, so I really want it to be vectorized if possible (or something that NumPy does for quick reading).

Sorry if I gave too much information and thanks.

The lines below are examples of parts of the * .unv text file format.

-1 2411 146303 1 1 11 6.9849462399269246D-001 8.0008842847097805D-002 6.6360238055630028D-001 146304 1 1 11 4.1854795755893875D-001 9.1256034628308313D-001 3.5725496189239300D-002 146305 1 1 11 7.5541258490349616D-001 3.7870257739063029D-001 2.0504544370783115D-001 146306 1 1 11 2.7637569971086767D-001 9.2829777518336010D-001 1.3757239038663285D-001 -1 -1 2412 9 21 1 0 7 2 0 0 0 1 9 10 21 1 0 7 2 0 0 0 9 10 1550 91 6 0 7 3 761 3685 2027 1551 91 6 0 7 3 761 2380 2067 39720 111 1 0 7 4 71854 59536 40323 73014 39721 111 1 0 7 4 45520 48908 133818 145014 -1 -1 2477 1 0 0 0 0 0 0 3022 PERMANENT GROUP1 7 2 0 0 7 3 0 0 7 8 0 0 7 7 0 0 7 147 0 0 7 148 0 0 2 0 0 0 0 0 0 2915 PERMANENT GROUP2 7 1 0 0 7 5 0 0 7 4 0 0 7 6 0 0 7 9 0 0 7 11 0 0 -1 
+6
source share
1 answer

The numpy genfromtxt and loadtxt would be quite difficult to use throughout the file, since your data has a very special structure (which varies depending on which node you are in). Therefore, I would suggest the following strategy:

  • Read line by line line by line, try to determine in which node you are analyzing the line.

  • If you are in a node that has only a few data (and where, for example, you need to read alternating lines so that you cannot read continuously), read it line by line and process the lines.

  • When you get into a section with a lot of data (for example, with "real data"), use the numpys fromfile method to read in the data, for example:

     mydata = np.fromfile(fp, sep=" ", dtype=int, count=number_of_elements) mydata.shape = (100000, 3) # Reshape it to the desired shape as fromfile # returns a 1D array. 

Thus, you combine the flexibility of linear processing with the ability to quickly read and convert large pieces of data.

UPDATE: The fact is that you open the file, read it in turn, and when you come to a place with a lot of data, you transfer the file descriptor to the file.

Below is a simplified example:

 import numpy as np fp = open("test.dat", "r") line = fp.readline() ndata = int(line.strip()) data = np.fromfile(fp, count=ndata, sep=" ", dtype=int) fp.close() 

This will read the data from the test.dat file with such contents as:

 10 1 2 3 4 5 6 7 8 9 10 

The first line is read explicitly with fp.read() , processed (the number of integers to read is determined), and then np.fromfile() reads the corresponding data fragment and stores it in the 1D data array.

UPDATE2: Alternatively, you can read all the text into a buffer, and then determine the start and end positions for a large piece of data and directly convert it via np.fromstring :

 fp = open("test.dat", "r") txt = fp.read() fp.close() # Now determine starting and end positions (startpos, endpos) # .. # pass text that portion of the text to the fromstring function. data = np.fromstring(txt[startpos:endpos], dtype=int, sep=" ") 

Or, if it is easy to formulate it as one regular expression, you can use fromregex() directly in the file.

+4
source

All Articles