Reading and processing a large text file in Matlab

I am trying to read a large text file (several million lines) in Matlab. I originally used importdata (filename), which seemed like a concise solution. However, I need to use Matlab 7 (yes, I know its old), and it seems importdata is not supported. So I tried the following:

while ~feof(fid) fline = fgetl(fid); fdata{1,lno} = fline ; lno = lno + 1; end 

But it is very slow. I assume that it resized the array at each iteration. Is there a better way to do this. Whereas the first 20 lines of input are string-type data, and the rest of the data contains 3 to 6 columns of hexadecimal values.

+4
source share
3 answers

you will need to change the form a little, but for you there will be another option: you can use fread. But, as already mentioned, this essentially blocks you with rectangular imports. So another option is to use textscan. As I mentioned in another note, I am not 100% sure when this was implemented, all I know is that you do not have "importdata ()"

 fid = fopen('textfile.txt') Out = textscan(fid,'%s','delimiter',sprintf('\n')); fclose(fid) 

with textscan you can get an array of cells for each row, which can then be manipulated as you want. And, as I say in my comments, it doesn’t matter whether the lines are the same length or not. Now you can analyze the array of cells faster. But as gnowitz mentions, and he also has a very elegant solution, you may have to take care of memory requirements.

One thing you never want to use in Matlab, if you can avoid it, is loop structures. They are fast in C / C ++, etc., but in Matlab they are the slowest way to get where you are going.

EDIT: just looked at it, and it looks like textscan WAS is literally implemented in version 7 (R14), so if this is what you have, you should make good use of it.

+5
source

I see two options:

  • Instead of growing by 1 each time, you can, for example, double the size of your array only if necessary. This significantly reduces the number of redistributions needed.
  • Take a two-pass approach. The first pass simply counts the number of rows without saving them. The second pass actually fills the array (which was previously allocated to the desired size).
+2
source

One solution is to read the entire contents of the file as a string of characters with FSCANF , split the string into separate cells at the points where newlines appear using MAT2CELL , remove the extra free space at the ends with STRTRIM , then process the string data in each cell by as necessary. For example, using this sample text file 'junk.txt' :

 hi hello 1 2 3 FF 00 FF 12 A6 22 20 20 20 FF FF FF 

The following code will put each row in a cell in an array of cells cellData :

 >> fid = fopen('junk.txt','r'); >> strData = fscanf(fid,'%c'); >> fclose(fid); >> nCharPerLine = diff([0 find(strData == char(10)) numel(strData)]); >> cellData = strtrim(mat2cell(strData,1,nCharPerLine)) cellData = 'hi' 'hello' '1 2 3' 'FF 00 FF' '12 A6 22 20 20 20' 'FF FF FF' 

Now, if you want to convert all hexadecimal data (lines 3 through 6 into my sample data file) from strings to number vectors, you can use CELLFUN and SSCANF :

 >> cellData(3:end) = cellfun(@(s) {sscanf(s,'%x',[1 inf])},cellData(3:end)); >> cellData{3:end} %# Display contents ans = 1 2 3 ans = 255 0 255 ans = 18 166 34 32 32 32 ans = 255 255 255 

NOTE. . Since you are dealing with such large arrays, you need to remember the amount of memory used by your variables. The above solution is vectorized, but can take up a lot of memory. You may need to overwrite or clear large variables, such as strData when creating cellData . Alternatively, you can nCharPerLine over the elements in nCharPerLine and individually process each segment of the larger strData string into the vectors you need, which you can pre-distribute now that you know how many data lines you have (i.e. nDataLines = numel(nCharPerLine)-nHeaderLines; ).

+2
source

All Articles