Julia: How to change a matrix column that has been saved as a binary file?

I work with large data matrices (Nrow x Ncol) that are too large to hold in memory. Instead, it is standard in my area of ​​work to store data in a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I should also be able to modify the column and then save the updated column back to the binary. So far, I have managed to figure out how to save the matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after editing the contents of the column, I cannot figure out how to save this column in a binary file.

As an example, suppose the data file is a 32-bit identification matrix that has been stored on disk.

Nrow = 500 Ncol = 325 data = eye(Float32,Nrow,Ncol) stream_data = open("data","w") write(stream_data,data[:]) close(stream_data) 

Reading the entire file from disk and then rebuilding it back into the matrix are simple:

 stream_data = open("data","r") data_matrix = read(stream_data,Float32,Nrow*Ncol) data_matrix = reshape(data_matrix,Nrow,Ncol) close(stream_data) 

As I already said, the data matrices I work with are too large to be read into memory, and as a result, the code written above is usually impossible. Instead, I need to work with 1 column at a time. Below is a solution for reading 1 column (for example, the 7th column) of a matrix into memory:

 icol = 7 stream_data = open("data","r") position_data = 4*Nrow*(icol-1) seek(stream_data,position_data) data_col = read(stream_data,Float32,Nrow) close(stream_data) 

Please note that the coefficient "4" in the variable "position_data" is due to the fact that I am working with Float32. Also, I don't quite understand what the seek command does here, but it seems to give me the correct result based on the following tests:

 data == data_matrix # true data[:,7] == data_col # true 

To solve this problem, let's say I determined that a loaded column (i.e. the 7th column) needs to be replaced with zeros:

 data_col = zeros(Float32,size(data_col)) 

Now the problem is figuring out how to save this column in a binary without affecting any other data. Naturally, I intend to use "write" to complete this task. However, I am not quite sure how to proceed. I know that I need to start by opening the data stream; however, I'm not sure which mode I should use: "w", "w +", "a" or "a +"? Here is a failed attempt using "w":

 icol = 7 stream_data = open("data","w") position_data = 4*Nrow*(icol-1) seek(stream_data,position_data) write(stream_data,data_col) close(stream_data) 

The original binary (before my unsuccessful attempt to edit the binary) took 650,000 bytes on disk. This is consistent with the fact that the matrix is ​​500x325 in size, and Float32 numbers occupy 4 bytes (i.e. 4 * 500 * 325 = 650,000). However, after my attempt to edit the binary, I noticed that the binary now only takes up 14,000 bytes of space. Some quick mental math data shows that 14,000 bytes correspond to 7 data columns (4 * 500 * 7 = 14000). A quick check confirms that the binary replaced all the original data with a new 500x7 matrix and whose elements are all zeros.

 stream_data = open("data","r") data_new_matrix = read(stream_data,Float32,Nrow*7) data_new_matrix = reshape(data_new_matrix,Nrow,7) sum(abs(data_new_matrix)) # 0.0f0 

What do I need to do / change to modify only the 7th column in the binary?

+6
source share
2 answers

Instead

 icol = 7 stream_data = open("data","w") position_data = 4*Nrow*(icol-1) seek(stream_data,position_data) write(stream_data,data_col) close(stream_data) 

in OP write

 icol = 7 stream_data = open("data","r+") position_data = 4*Nrow*(icol-1) seek(stream_data,position_data) write(stream_data,data_col) close(stream_data) 

i.e. replace "w" with "r+" and everything will work.

Link to open http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and explains the different modes. Preferably open should not be used with the original somewhat confusing, but definitely slower string parameter.

+1
source

You can use SharedArrays to describe:

 data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols)) # do something with data data[:,1]=a[:,1].+1 exit() # restart julia data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols)) @show data[1,1] # prints 1 

Now remember that you must handle the synchronization to read / write from / to this file (if you have asynchronous workers) and that you should not change the size of the array (if you do not know what you are doing).

+1
source

All Articles