Matlab parfor in database with custom function

I have a set of images (~ 10 ^ 7) that are contained in one huge binary. I want to read and analyze them effectively using the function that I already have. Each iteration of this user-defined function foo in the code takes about 0.1 s, so it takes several days to process the entire database using a simple loop that reads through the database:

 ... for image_number=1:N offset_in_bytes = npoints_per_image*element_size*(image_number-1); fseek(fid, offset_in_bytes, 'bof'); s=fread(fid, npoints_to_load,'ushort'); image=reshape(s,nrows,[]); [outputs]=foo(image) end 

I optimized the foo function as much as I could (vectorized the code when possible, used the correct data classes, etc.). The only thing I haven't done yet is create a mega version. I was thinking of using parfor for this, but I could not get it to work. Each image is difficult, it is independent, the code above reads the data sequentially, so I can not parallelize it. How can I make this code and database available for the parfor option? thanks

+4
source share
2 answers

To run this loop in parallel, you must be able to run the FREAD and foo part at the same time. You can verify that the foo part is running in parallel by replacing FREAD with a dummy call with something like MATLAB RAND .

Please note that working matlabpool work in single-threaded mode. This is especially important if you use all your workers on the same machine. If foo was able to take advantage of MATLAB multithreading, then using the PARFOR loop is likely to make things slower.

I suspect that if you have only one large image file, your file system may or may not provide you with completely parallel access to it. I'm not sure how best to get around this - it almost certainly depends on your specific file system.

0
source

I assume that your images have the same size or size that you know in advance. That is: you do not need to scan all previous images to find out the size of the current one. If it is not, the following will not help.

The code provided will not work with parfor, because you have several employees trying to split a single file descriptor. Matlab parallel tools are designed for use with multiple clusters of computers, so things like file descriptors are not duplicated.

To make your code work with parfor, you need to open and close the file in a loop, for example:

 parfor image_number=1:N fid = fopen(filename, 'r'); offset_in_bytes = npoints_per_image*element_size*(image_number-1); fseek(fid, offset_in_bytes, 'bof'); s=fread(fid, npoints_to_load,'ushort'); image=reshape(s,nrows,[]); [outputs]=foo(image); fclose(fid); end 

If you find that this adds overhead to your process, you can use a nested loop and maybe some buffering:

 chunk_size = ceil(N/10); parfor i = 0:9 fid = fopen(filename, 'r'); %some buffering code here %start this iteration of the parfor loop at this image_number start_num = 1+(i * chunk_size); end_num = min(N, (i+1) * chunk_size); %and end it at this one for image_number=start_num:end_num %your code here end fclose(fid); end 

As already mentioned by Edrick, if you use the big advantage of vector processing in your foo function, this may not speed it up much, since parfor does the make file in single-thread mode.

0
source

All Articles