The fastest way to retrieve / store millions of small binary objects

Question

The fastest way to retrieve / store millions of small binary objects

I am looking for a quick (as in huge performance, not a quick fix) solution for saving and retrieving tens of millions of small (about 1 thousand) binary objects. Each object must have a unique identifier to retrieve (preferably a GUID or SHA). Additional requirements are that it must be used with .NET and does not require additional software installation.

I am currently using a single-table SQLite database for this task, but I want to get rid of the overhead of processing simple SQL statements such as SELECT data FROM store WHERE id = id.

I also tested a permanent persistent file system in NTFS, but performance degrades very quickly once it reaches half a million objects.

PS By the way, objects never need to be deleted, and input speed is very low. In fact, every time an object changes, a new version is saved and the previous version is saved. This is actually a requirement to support time travel.

Just adding more information to this stream:

In BLOB or not in BLOB: a large repository of objects in a database or file system http://arxiv.org/abs/cs.DB/0701168

+6

performance database .net data-structures sqlite

Hugo sereno ferreira Jul 18 '09 at 17:47

source share

10 answers

You need to call the prepare function only once for each statement, with the parameter indicated, for example. on ? (therefore SELECT data FROM store WHERE id=? is the expression you prepared); what you do "millions of times" is just a bind parameter to a prepared statement and calling sqlite_step is quick operations. It is worth comparing if blob open might not be faster. IOW, I recommend sticking with SQLite and delving into its low-level interface (from managed C ++, if necessary) for maximum performance - this is really an amazing little engine, and it often surprised me favorably with its performance!

+1

Alex martelli Jul 18 '09 at 18:07

source share

I think a database query is the best choice.

The entire database structure is tuned for this particular case, and parsing and optimization of a simple query is not essential.

You may be able to create a scheme in which you store all the objects in a large blob directly in the file system, and then open the view of the memory card on it and index the identifiers of the objects with an offset in blob, but I doubt that you will see much more performance, than a DB, since that’s essentially what it does.

0

codekaizen Jul 18 '09 at 17:54

source share

Save a separate index (another file) in [Guid → file number + file offset]. Use binary search to extract and go to file n + 1 whenever file n reaches a certain size. Each line in the index file is only 24 bytes (fixed size: guid + file number + offset, split files at 4 GB), and sorting is fast (insertion is sorted at a low speed).

Edit: You have very simple requirements that are easy to optimize. This carefully designed system should be superior to the database, especially if you are careful about block data reads and asynchronous IO. Database queries will always have parsing overhead.

Editing 2: If you need it too (always a good idea), see here for a description of how the concept of a transaction file system can help you with bulletproof things.

0

Sam harwell Jul 18 '09 at 18:04

source share

Did you think you were trying to use a database of objects like db4o ? It can save any CLR objekt object and access them quickly using the query language (supports LINQ!). I didn’t have millions of objects, but access to several thousand was pretty fast, there wasn’t much difference than a similar SQL query with an indexed identifier field.

0

Hrvoje hudo Jul 18 '09 at 18:34

source share

How about a binary file with fixed size blocks of about 2 thousand, with the first 4 bytes being the length of the object ...

the location of the object i is in i * 2048 bytes, and then read 2048 bytes for the object, getting the length of the actual object from the first 4 bytes (unsigned).

0

Osama al-maadeed Jul 18 '09 at 18:54

source share

I like the Earwicker solution. The way I handle this is very similar.

I have done this:

Let's say your guid is 3F2504E0-4F89-11D3-9A0C-0305E82C3301.

Hang on a three-letter hash. aaa-zzz.

Suppose, for the sake of argument, that your guid hashes to "xap".

Your information will be found in the file c: \ store \ x \ xa \ xap \ 3F2504E04F8911D39A0C0305E82C3301.dat

Naturally, there are many variations of this strategy. For example, xap can be a file with all the binary objects added together, with a header or an external file that has pointers and offsets.

0

Nosredna Jul 18 '09 at 19:49

source share

You can check whether HDF5 is suitable for your tasks.

. Network interface

0

zzr Jul 18 '09 at 20:34

source share

I tend to agree with Alex / Alex, if you write your own solution, you invent material that already exists in SQLite, but if you need to ...

You can probably do BTree work here. This is the workhorse of any database, and your problem space is not so bad. 10 million out of 1,000 objects are still only 10 billion bytes, so the file is managed by the OS, and there are many examples of BTree that you can try.

Compared to using the file system directory structure, essentially creating a BTree analog using real BTree will be much faster.

Another solution that might be interesting is Mogilfs , which is a distributed backup file system.

0

jottos Jul 18 '09 at 21:44

source share

I don’t know if SQLite supports indexes or not, but if that happens, you can speed things up by creating an index on the ID field.

If not, then your best bet is B + trees. Thanks

0

mfawzymkh Jul 20 '09 at 4:35

source share

Daniel Earwicker · Accepted Answer · 2009-07-18T18:04:34+0000

You may be able to reduce NTFS performance problems by breaking the object GUID into pieces and using them as directory names. Thus, each directory contains only a limited number of subdirectories or files.

eg. if the identifier is aaaa-bb-cc-ddddeeee , the path to the item will be c:\store\aaaa\bbcc\dddd\eeee.dat , limiting each directory to no more than 64k subelements.

The fastest way to retrieve / store millions of small binary objects

More articles: