Optimization of the generated row for storage in the database

I have a 64-bit integer timestamp and a Sting username to be combined into a single line and, ultimately, is stored in a database column. Leave aside why I cannot store them in separate columns with the appropriate type, my question is how to combine them to improve performance from the base database. It will be sqlite, PostgreSQL or MySQL, not sure yet.

I assume that they will use b-trees as indices, and it would be bad to concatenate like (timestamp-username), because the timestamp usually always progresses, and the tree often needs balancing. username-timestamp should be much better, but each user report will increase with each new record. I also thought of putting a timestamp with the reverse order of the bits.

What else can I do? Some kind of smart xor or something else? What would be the best scheme? Access to the data will ever be by requesting the exact generated string, without ranges, etc.

The only requirements are a relatively quick conversion between the generated string and the source data in both directions.

UPDATE: Please guys, I am referring to information about which row would be best stored as the primary key in the database (one of sqlite, mysql and postgresql). Perhaps the answer is that it does not matter or depends on the DB mechanism. I have no particular problems with the scheme used or with the help of a caching solution. I'm just asking if there is room for improvement and how. I would appreciate some answers on the topic.

UPDATE2: Excellent answer for me yet to be determined: whether to increase b-tree column by column unbalanced? stack overflow

0
source share
2 answers

There is a contradiction in your question. You indicate that you cannot separate and store them in separate columns, but then you are talking about indexing both parts separately - you cannot do this without breaking them.

I see you really have two options:

  • Saving them in separate columns
  • Hash output to reduce index memory

Ideally, you should store them in two columns and create a composite index if you always look for them together in the same order. In this case, it is difficult to give accurate advice without first specifying more information - however, as a rule, a username, a time stamp will make logical sense if you request a user or change it if you want to request a time stamp. You can also create an index for each column if you need to search one or the other.

Hashing the created string

INSERT INTO table (crc_hash_column, value_column_name) values (CRC32(@generated_value), @generated_value) 

will reduce the size to a 32-bit integer (only 4 index bytes per line), which is much smaller than the required VARCHAR or CHAR space.

If you take this approach, then you must take measures to avoid collisions, because of the Paradox of Birthday it will happen, and most likely, as your data set grows. Even in a collision, additional filtering will still provide more performance considering the size of the index than the alternatives.

 SELECT * FROM table WHERE crc_hash_column = CRC32(@search_value) AND value_column_name = @searchvalue 

Using a hash will lead to several more processor cycles, but the CRC32 hash is very fast, so even though you have to rephrase each time you search, this extra work is tiny for the benefits of indexing large amounts of data.

In general, I would prefer the first option, but it is almost impossible to recommend without knowing your precedent.

You should look at both parameters and see if they meet your requirements.

+1
source

What do you say that you can’t store them in separate columns (you can’t even create a new table with a 1: 1 ratio / mirror data in a materialized view using triggers / replace the existing table with the corrected table structure ???? !!! !) means that any solution will be an ugly hack.

Yes, how much data changes and how they are structured will affect the effectiveness of updates. However, the purpose of the index is to speed up the search - you have not provided us with any information on how access to the data or how it may change.

I also thought to put the timestamp in reverse order of bits

Why? this is likely to accelerate index fragmentation than reduce it.

MariaDB supports virtual columns - and indexes on virtual columns, so you can do stupid things, for example, throw normalization rules out of the window, but if you cannot fix a trivial problem in the scheme, then replacing the DBMS is probably a very practical solution.

Honestly, if it’s worth the time and money to develop a bad solution to a problem that already costs as much as the right solution, and is likely to require future costs, then choosing a bad solution is a waste of time and money.

0
source

Source: https://habr.com/ru/post/1416191/


All Articles