I work with large data matrices (Nrow x Ncol) that are too large to hold in memory. Instead, it is standard in my area of ββwork to store data in a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I should also be able to modify the column and then save the updated column back to the binary. So far, I have managed to figure out how to save the matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after editing the contents of the column, I cannot figure out how to save this column in a binary file.
As an example, suppose the data file is a 32-bit identification matrix that has been stored on disk.
Nrow = 500 Ncol = 325 data = eye(Float32,Nrow,Ncol) stream_data = open("data","w") write(stream_data,data[:]) close(stream_data)
Reading the entire file from disk and then rebuilding it back into the matrix are simple:
stream_data = open("data","r") data_matrix = read(stream_data,Float32,Nrow*Ncol) data_matrix = reshape(data_matrix,Nrow,Ncol) close(stream_data)
As I already said, the data matrices I work with are too large to be read into memory, and as a result, the code written above is usually impossible. Instead, I need to work with 1 column at a time. Below is a solution for reading 1 column (for example, the 7th column) of a matrix into memory:
icol = 7 stream_data = open("data","r") position_data = 4*Nrow*(icol-1) seek(stream_data,position_data) data_col = read(stream_data,Float32,Nrow) close(stream_data)
Please note that the coefficient "4" in the variable "position_data" is due to the fact that I am working with Float32. Also, I don't quite understand what the seek command does here, but it seems to give me the correct result based on the following tests:
data == data_matrix # true data[:,7] == data_col # true
To solve this problem, let's say I determined that a loaded column (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
Now the problem is figuring out how to save this column in a binary without affecting any other data. Naturally, I intend to use "write" to complete this task. However, I am not quite sure how to proceed. I know that I need to start by opening the data stream; however, I'm not sure which mode I should use: "w", "w +", "a" or "a +"? Here is a failed attempt using "w":
icol = 7 stream_data = open("data","w") position_data = 4*Nrow*(icol-1) seek(stream_data,position_data) write(stream_data,data_col) close(stream_data)
The original binary (before my unsuccessful attempt to edit the binary) took 650,000 bytes on disk. This is consistent with the fact that the matrix is ββ500x325 in size, and Float32 numbers occupy 4 bytes (i.e. 4 * 500 * 325 = 650,000). However, after my attempt to edit the binary, I noticed that the binary now only takes up 14,000 bytes of space. Some quick mental math data shows that 14,000 bytes correspond to 7 data columns (4 * 500 * 7 = 14000). A quick check confirms that the binary replaced all the original data with a new 500x7 matrix and whose elements are all zeros.
stream_data = open("data","r") data_new_matrix = read(stream_data,Float32,Nrow*7) data_new_matrix = reshape(data_new_matrix,Nrow,7) sum(abs(data_new_matrix))
What do I need to do / change to modify only the 7th column in the binary?