Efficient disk access for a large number of small .mat files containing objects

I am trying to determine the best way to store a large number of small .mat files, about 9000 objects ranging in size from 2k to 100k, for a total of about half the concert.

A typical use case is that I only need to type a small number (for example, 10) of files from disk at a time.

What I tried:

Method 1. If I save each file separately, I have performance problems (very slow saving time and system sluggishness for some time after), since Windows 7 has processing problems, so the files can be in the folder (And I I think my SSD has this rough time too). However, the end result is fine, I can download what I need very quickly. This is used to save "-v6".

Method 2. If I save all the files in one .mat file and load only the variables I need, access will be very slow (downloading takes about three-fourths of the time required to download the entire file, with slight variation depending on the order of saving). It also allows you to save -v6.

I know that I could split files into many folders, but it seems like such a nasty hack (and will not fix SSDs, I don’t like writing many small files), is there a better way?

Edit: Objects consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, as well as many small identifying properties (char and numeric).

+4
source share
3 answers

Five ideas to consider:

  • Try saving to an HDF5 object - see http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
  • Your method # 2 option is to store them in one or more files, but disable compression.
  • Different types of data. It is also possible that you have some objects that compress or decompress inexplicably poorly. I had such problems with arrays of cells or with arrays of structures. In the end, I found a way around this, but it was a bit, and I don’t remember how to reproduce this particular problem. The solution was to use a different data structure.
  • @SB suggested a database. If all else fails, try this. I do not like to create external dependencies and additional interfaces, but it should work (the main problem is that if the database starts to moan or distort your data, then you will return to square 1). To do this, consider SQLite, which does not require a separate server / client infrastructure. Matlab Central has an interface: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
  • (New) Given that objects are less than 1 GB in size, it may be easier to simply copy the entire set to a RAM disk and then access it. Just remember to copy from the RAM disk if something is saved (or wrap save to save the objects in two places).

Update: OP mentions custom objects. To serialize these two methods:

+2
source

Try saving them as drops in the database.

I would also try a method with several folders - it could work better than you think. It can also help organize files if you need something.

+1
source

The solution I came up with is to store arrays of objects with about 100 objects each. These files tend to be 5-6 megabytes, so downloading is not restrictive, and access is simply loading the correct array (s), and then a subset of them to the desired record (s). This compromise allows you not to write too many small files, however, it allows you to quickly access individual objects and avoid additional database crashes or serialization.

0
source

All Articles