How should I implement a huge but simple indexed string list in Delphi?

I am using Delphi 2009. I have a very simple data structure with two fields:

  • A string that is the key field I need to get, and usually has a length of 4 to 15 characters.
  • A string that represents a data field that can be of any size, from 1 character to 10,000 characters.

The difficulty is that I can have several million of these records, so they can have a size of no more than 10 GB. Obviously I'm looking for a solution on disk, not a solution in memory.

My program should randomly retrieve these entries based on the key field. That this part should be made as efficient as possible.

Should I use a database for such a simple structure, and if so, which database is the best to handle this and the easiest to implement?

Alternatively, is there a simple data structure on disk that does not require a full-blown database that will work just as well?


Well, I needed the only answer to bring me back to reality. I was looking for something simpler than a simple database. But when there is no-duh answer to using a database, I understand that I already answered this question with my own answer to another question: The best database for small applications and tools .

My answer was DISQLite3 for the reasons I pointed out there . And this is what I probably have with my implementation.


Some good answers with some features. It's great. I can try several different methods to see what works best.


More contemplation, and I had to change the accepted answer to the GpStructuredStorage solution.

In my case, a million records totaling several gigabytes puts a strain on the database structure. In particular, the B * tree, which is used to store the index in most databases, is fast, but slows down for some operations, such as reindexing a million values.

The only thing you will find faster than B * for the index is a hash table. And that’s exactly what gabr provides, as a complement to the GpStructuredStorage solution. I think this is pretty elegant as it segmented the hash value to create a directory structure at 4 levels.

The main reason I can go to the hash solution is that I only need random access to the keys. I do not need to sort by key. If sorting were necessary, then the profit from the speed of the hash table would be lost, and the database system would be an unreasonable winner.

When I start implementing this, I have to make a comparison of this method with the database. Perhaps I will compare both with Firebird and with SQLite, which will both be worthy opponents.


Another sequel:

I just discovered Synopsis Big Table A. Bouchez , which is designed for speed and exactly matches the specifications of my question. I will try first, when I do my implementation in a few months and will report my results here.


Significantly later follow-up (July 2015)

I have never tried the synopsics of the Big Table. I am stuck with my B * tree so far. But now I upgraded to Delphi XE8 and plan to go with a database solution using FireDAC with SQLite.

+4
source share
7 answers

Synopsis Big Table by A. Bush . See his answer to my other question about SQLite / DISQLite.

It was not even developed when I first asked this question, but now it is a rather mature and fully functional unit.

+1
source

For data over 10 GB, a database is exactly what you need. It will handle indexing for quick data retrieval (your random search), functionality for adding, modifying and deleting data and the actual storage, as well as much more if you want to.

There are dozens of posts related to which databases are available for use in Delphi, including embedded and FOS, such as Firebird .

+5
source

why does everyone boast? just use GpStructuredStorage (4 TB limit and with a little time spent on classes that you can go through), it will take you a few hours to get used to it, but it's worth the time. Hope this helps.

GpStructuredStorage can get the file names very quickly (I tested them), you need to save each record as a file in GpStructuredStorage and get each name as a string in the string list, 1 millionth string (because you mentioned stringlist) needs several MB in RAM is small, use a TStream descendant to write data to a file in GpStructuredStorage, today I don’t have time to write a simple example, but on Saturday or Sunday I will write a tutorial for GPStructuredStorage on my blog.


[Added by gabr - I hope this will not be considered a terrible violation of the network label. It’s just that my thoughts don’t fit into the comment (namely) and that it’s silly to add another answer, just to write it ...]

While GpStructuredStorage can store a lot of data, finding it can be a slow process. What I usually do in such cases is to create a key hash and convert it to hexadecimal (say 00FFA784). Then I convert this hex hash into a folder structure (in this case it will be / 00 / FF / A7 / 84) and save the corresponding data in this folder either as a file, or as attributes, or in combination with them.

This approach speeds up data retrieval but slows down data insertion and is therefore recommended only for most static data. If the data is quite dynamic, I would recommend using a database rather than GpStructuredStorage.

+4
source

You must analyze your data. If a

  • a significant portion of data values ​​is larger than the default file system block size,
  • you don’t want to look up data values ​​using SQL (so no matter what format they are stored in) and
  • you really need random access to the entire database,

then you should check if the performance of your data increases the value. Decompressing data values ​​(especially on a modern machine with multiple cores running in the background thread) should result in only a small performance hit, but the benefit of having to read fewer blocks from the hard drive (especially if they are not in the cache) can be much larger .

But you need to measure, perhaps the database engine saves the compressed data anyway.

+2
source

BerkleyDB is exactly what

+1
source

If you most need large data sets and you have money, just enter 16GB (500-750 Eur) into the machine and make a separate process using your 64-bit compiler (*), which you request for example, shared mem or another method IPC

In this case, you can use the in-memory approach until finally 64-bit Delphi comes out. Since your data seems simple (a map from a char array to a char array), it can be easily exported via IPC.

This, of course, if this approach has any merits for your business (for example, is it a cache or so), which I can not determine from your question.

(*) I recommend FPC, of ​​course :-)

I did it once, up to 5 million objects, 5 GB of data.

I got permission to open the original container types that I made for him, they are:

http://www.stack.nl/~marcov/lightcontainers.zip (warning: very dirty code)

mghie: answer in another cliche: no silver bullet

Databases also have many other assumptions.

  • their generalized approach makes relative inefficient use of memory. First of all, your data set, using the usual methods of storing data, falls into the available memory ranges, which, of course, are usually larger for the server (my bad guess here, by default) than for the client. Database
  • suggest that their results can be reduced to small sets in the database server with relative direct processing and using indexing.
  • they have a relatively high delay.
  • they are relatively bad in some types of processing (for example, multivariate analysis / OLAP, so for this you need to expand the database)

This makes databases relatively poor to use, for example. caches, loadbalancers, etc. Of course, that’s all you need for speed. But the initial question was a little sensitive to me.

In the last assignment, my function in a database-oriented company was to do everything except this, IOW fixes problems when the standard approach cannot crack it (or it requires 4 Oracle server sockets for workstations where there is no budget justified such expenses). The solution / hack written above was a bit OLAPpy and connected to the hardware (rfid firmware programming device), requiring a certain guaranteed response time. Two months of programming, still working and can’t even buy a license for a Windows + oracle server for a price.

+1
source

Since your data is more than 3 GB, you need to make sure that any database engine you select either processes tables that are large or divided into several tables, which I propose to do no matter what the maximum size of a single table is. If you are splitting, make it as uniform as possible when breaking the logical key, so that it can be easily determined which table to use the first or first two characters of the key. This will significantly reduce your search time by excluding any entries that could never match your query.

If you just need raw performance and you only perform read-only data browsing, then you will be better served by the ordered index file (s) using a fixed-size record for your keys that points to your data file. Then you can easily perform a binary search and avoid any database overhead. To get a greater increase in performance, you can preload / cache midpoints in memory to reduce duplicate readings.

A simple fixed-size record for your specifications might look like this:

type rIndexRec = record KeyStr : String[15]; // short string 15 chars max DataLoc : integer; // switch to int64 if your using gpHugeFile end; 

To bootstrap, use the Turbo Power collation found in SysTools , which may be the latest version for Delphi 2009/2010 uploaded to the songbeamers website. The DataLoc will be the stream position of your data record, the record or reading of which may look like this:

 function WriteDataString(aDataString:String;aStream:tStream):integer; var aLen : integer; begin Result := aStream.Position; aLen := Length(aDataString); aStream.Write(aLen,sizeOf(aLen)); aStream.Write(aDataString[1],aLen*sizeOf(Char)); end; function ReadDataString(aPos:Integer;aStream:tStream):String; var aLen : integer; begin if aStream.Position <> aPos then aStream.Seek(aPos,soFromBeginning); result := ''; aStream.Read(aLen,SizeOf(aLen)); SetLength(Result,aLen); if aStream.Read(Result[1],aLen*sizeOf(Char)) <> aLen*SizeOf(Char) then raise Exception.Create('Unable to read entire data string'); end; 

When you create your index records, dataloc will be set to the record position in the data table. It does not matter the order in which records are loaded, if the sorting of index entries is sorted. I used this particular technique to update the database with 6 billion records with monthly updates, so it scales very easily.

EDIT : Yes, the above code is limited to about 2 GB per data file, but it can be expanded using gpHugeFile or segmentation. I prefer to segment into many logical files <2gb each, which will take a little less disk space.

+1
source

All Articles