Processing very big data with mysql

Sorry for the long post!

I have a database containing ~ 30 tables (InnoDB engine). Only two of these tables, namely, “transaction” and “shift”, are quite large (the first has 1.5 million rows, and the shift is 23 thousand rows). Now everything is working fine, and I have no problem with the current database size.

However, we will have a similar database (the same data types, design, ..), but much more, for example, the “transaction” table will have about 1 billion records (about 2.3 million transactions per day), and we think about how we should deal with this amount of data in MySQL? (this is both reading and writing). I read a lot of related posts to find out if Mysql (and more specifically, the InnoDB engine) can work well with billions of records, but still I have some questions. Some of these related posts that I read are as follows:

What I have understood so far to improve performance for very large tables:

  • (for tables innoDB - this is my case), increasing innodb_buffer_pool_size (for example, up to 80% of RAM). Also, I found some other MySQL performance tuning settings here in the percona blog
  • with corresponding indexes in the table (using EXPLAN for queries)
  • table splitting
  • MySQL sharding or clustering

Here are my questions / confusions:

  • On separation, I have some doubts about whether we should use it or not. On the one hand, many people have proposed improving performance when the table is very large. On the other hand, I read a lot of posts that say that this does not improve query performance and does not speed up query execution (for example, here and here ). In addition, I read in the MySQL Reference Guide that InnoDB foreign keys and MySQL partitioning are not compatible (we have foreign keys)

  • As for indexes, now they work well, but as I understand it, for very large tables, indexing is more restrictive (as Kevin Bedell said in his answer here ). In addition, indexes speed up reading when writing is slowed down (insert / update). So, for a new similar project, we will have this large database, we must first insert / load all the data, and then create the indexes? (to speed up the insertion)

  • If we cannot use partitioning for our large table (transaction table), what is an alternative to improve performance? (except for the parameters of the MySQl variable, such as innodb_buffer_pool_size ). Should we use mysql clusters? (we also have many associations)

EDIT

This is the show create table statement for our largest table named "transaction":

  CREATE TABLE `transaction` ( `id` int(11) NOT NULL AUTO_INCREMENT, `terminal_transaction_id` int(11) NOT NULL, `fuel_terminal_id` int(11) NOT NULL, `fuel_terminal_serial` int(11) NOT NULL, `xboard_id` int(11) NOT NULL, `gas_station_id` int(11) NOT NULL, `operator_id` text NOT NULL, `shift_id` int(11) NOT NULL, `xboard_total_counter` int(11) NOT NULL, `fuel_type` int(11) NOT NULL, `start_fuel_time` int(11) NOT NULL, `end_fuel_time` int(11) DEFAULT NULL, `preset_amount` int(11) NOT NULL, `actual_amount` int(11) DEFAULT NULL, `fuel_cost` int(11) DEFAULT NULL, `payment_cost` int(11) DEFAULT NULL, `purchase_type` int(11) NOT NULL, `payment_ref_id` text, `unit_fuel_price` int(11) NOT NULL, `fuel_status_id` int(11) DEFAULT NULL, `fuel_mode_id` int(11) NOT NULL, `payment_result` int(11) NOT NULL, `card_pan` text, `state` int(11) DEFAULT NULL, `totalizer` int(11) NOT NULL DEFAULT '0', `shift_start_time` int(11) DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `terminal_transaction_id` (`terminal_transaction_id`,`fuel_terminal_id`,`start_fuel_time`) USING BTREE, KEY `start_fuel_time_idx` (`start_fuel_time`), KEY `fuel_terminal_idx` (`fuel_terminal_id`), KEY `xboard_idx` (`xboard_id`), KEY `gas_station_id` (`gas_station_id`) USING BTREE, KEY `purchase_type` (`purchase_type`) USING BTREE, KEY `shift_start_time` (`shift_start_time`) USING BTREE, KEY `fuel_type` (`fuel_type`) USING BTREE ) ENGINE=InnoDB AUTO_INCREMENT=1665335 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT 

Thank you for your time,

+6
source share
2 answers
  • Can MySQL intelligently execute queries over billions of rows? - MySQL can process billions of rows. Reasonably dependent on requests; look at them.

  • Is InnoDB (MySQL 5.5.8) the right choice for several billion rows? - 5.7 has some improvements, but 5.5 is pretty good, despite the fact that he is almost 6 years old and on the verge is no longer supported.

  • The best data warehouse for billions of rows. If you mean Engine, then InnoDB.

  • How big the MySQL database is before performance starts to deteriorate - again, it depends on the queries. I can show you a 1K row table that will melt; I have been working with billionth tables that are buzzing.

  • Why can MySQL be slow with large tables? - range scanning leads to I / O, which is the slow part.

  • Can Mysql handle tables that store about 300 million records? - again, yes. The limit is about a trillion lines.

  • (for tables innoDB - this is my case), increasing the value of innodb_buffer_pool_size (for example, up to 80% of RAM). Also, I found some other MySQL performance tuning settings here on the percona blog - yes

  • with the corresponding indexes in the table (using EXPLAN as requested) - well, look at them. There are many mistakes that can be made in this critical area.

  • table partitioning - "Partitioning is not a panacea!" I describe it on my blog

  • MySQL Sharding is now DIY

  • MySQL clustering. Currently, the best answer is some version of Galera (PXC, MariaDB 10, DIY w / Oracle).

  • Partitioning does not support FOREIGN KEY or "global" UNIQUE .

  • UUIDs, on the scale you are talking about, do not just slow down the system, but actually kill it. Type 1 UUIDs can be a workaround.

  • Insertion and index building speed are too many options to give one answer. Let's look at a sample CREATE TABLE and how you are going to feed the data.

  • Many connections are "Normalize, but do not reconfigure." In particular, do not normalize datetimes or float or other "continuous" values.

  • Make Pivot Tables

  • 2.3 million transactions per day - if it is 2.3M inserts (30 / sec), then there are not so many performance problems. If more complex, then RAID, SSD, batch processing, etc. may be required.

  • cope with that amount of data. If most of the activity is associated with the "last" lines, then buffer_pool will "cache" the activity, thereby avoiding I / O. If the activity is "random", then MySQL (or anyone else) will have problems with I / O.

+12
source

When collecting billions of rows, it is better (when possible) to consolidate, process, summarize any data before storing. Store raw data in a file if you think you need to get back to it.

Doing this will fix most of your questions and problems and speed up the processing.

+2
source

All Articles