How to manage huge operations on MySql

I have a MySql database. I have a lot of records (about 4,000,000,000 lines) and I want to process them to reduce them (reduce to about 1,000,000,000 lines).

Suppose I have the following tables:

  • table RawData strong>: I have more than 5000 rows per second that I want to insert into RawData p>

  • table Processed data : this table is the processed (aggregated) storage for rows that were inserted into RawData. number of minimum rows> 20,000,000

  • table ProcessedDataDetail : I write the details of the ProcessedData table (data that was aggregated)

    users want to browse and search the Processed Data table, which must join more than 8 other tables. Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) is very slow. I used a lot of Indexes. Suppose my data length is 1G, but my index length is 4G :). (I want to drive through these indices, they slow down my process)

How to increase the speed of this process?

I think I need a shadow table from ProcessedData strong>, name it ProcessedDataShadow . then run RawData and summarize them using ProcessedDataShadow , then paste the result into ProcessedData and Processed Data . What is your idea?

(I am developing a project in C ++)

thank you in advance.

+4
source share
2 answers

Without knowing more about your actual application, I have the following suggestions:

  • Use InnoDB if you have not already done so. InnoDB uses row locks and handles concurrent updates / inserts much better. This will be slower if you don't work at the same time, but row locking should probably be for you, depending on how many sources you have for RawData.

  • Indexes usually speed things up, but poorly selected indexes can slow things down. I don’t think you want to get rid of them, but many indexes can make inserts very slow. You can disable indexes when inserting data lots to prevent indexes from being updated for each insert.

  • If you are picking up a huge amount of data that could interfere with data collection, consider using a replicated subordinate database server that you use read-only. Even if it blocks rows / tables, the primary (primary) database will not be affected and the slave will return to speed as soon as it becomes free.

  • Do you need to process data in a database? If possible, it is possible to collect all the data in the application and just insert ProcessedData.

+3
source

You did not indicate what the data structure is, how to consolidate it, how quickly data should be available to users, and how the consolidation process can be simplified.

However, the most immediate problem will drop 5,000 lines per second. You will need a very large, very fast machine (perhaps a fragment of a cluster).

If possible, I would recommend writing a consolidation buffer (using a hash table in memory rather than in the DBMS) to place the consolidated data, even if it is only partially consolidated, and then update from this to the processed data table than try to fill it directly from rawData.

In fact, I would probably consider splitting raw and consolidated data into separate servers / clusters (the MySQL merge mechanism was convenient for providing a single view of the data).

Have you analyzed your queries to find out which indexes you really need? (hint - this script is very useful for this).

+2
source

All Articles