Tips for creating a very large hash database

Question: What solution or advice will you have to deal with a very large (many terabytes) database indexed by strong hashes with high redundancy?

Some kind of upside down vault?

Is there anything that could be done with Postgres?

If necessary, I am ready to upload my own storage.

(Hint: it must be open source, no Java, it must work on Linux, it must be disk-based, preferably C / C ++ / Python)

Details:

I need to create a very large database where each record has:

  • some arbitrary metadata (some field text), including the primary key
  • single hash (128-bit hash, strong MD5-like)

The volume of records is what I would qualify as quite large: a few from 10 to 100 billion). There is a significant redundancy of hashes per line (more than 40% of records have their own hash, shared with at least one record, some hashes exist in 100 thousand records)

The main use is hash search and then metadata extraction. The second use is to search by primary key and then extract metadata.

This is a type of analytics database, so the overall load is medium, it is mostly read, it writes little, mostly packet records.

The current approach is to use Postgres with a primary key index and an index in the hash column. The table is loaded in batch mode with a hash index disabled.

All indexes are btrees. The index in the hash column grows huge, larger or larger than the table itself. On a 120-GB table, it takes about a day to recreate the index. Query execution is not bad.

The problem is that the predicted size of the target database will exceed 4 TB based on tests with a smaller data set of 400 GB, which is about 10% of the total goal. After loading into Postgres, more than 50% of the storage, unfortunately, is used by the SQL index in the hash column.

It's too much. And I feel that hash redundancy is an opportunity for storage less.

Note also that although this describes the problem, there are several of these tables that need to be created.

+7
source share
1 answer

You can create a table with only an identifier and a Hash, as well as your other data with an index, metadata and hashId. At the same time, you can prohibit writing the same hash to the table up to 100 thousand times.

+5
source

All Articles