Database design advice needed

I am a lone developer for a telecommunications company, and after some tips on developing a database from someone who has some time to respond.

I insert ~ 2 million rows every day into one table, these tables are then archived and compressed monthly. Each monthly table contains ~ 15,000,000 rows. Although it is increasing every month.

For each insert that I do above, I combine the data from the rows that belong to each other and creates another β€œcorrelated” table. This table is not currently archived, as I need to make sure that I never miss an update in a correlated table. (I hope that makes sense) Although in general this information should remain fairly static after several days of processing.

All of the above works fine. However, my company now wants to run some statistics against this data, and these tables are getting too large to provide results in what would be considered reasonable time. Even with the appropriate indexes.

So, I think, after all of the above, my question is pretty simple. Should I write a script that groups data from my correlated table into smaller tables. Or should I store the query result set in something like memcache? I already use mysqls cache, but due to limited control over how long the data is stored, it does not work perfectly.

The main advantages that I can see in using memcache:

  • Does not block my correlated table after the query has been cashed.
  • Greater flexibility for sharing collected data between the backend collector and the front processor. (i.e., user reports can be written to the backend and the results of their storage in the cache under the key, which is then shared with anyone who would like to see the data of this report).
  • Redundancy and scalability if we start sharing this data with a large number of customers.

The main disadvantages I can see are using something like memcache:

  • Data is not saved if the machine is rebooted / the cache is cleared.

Key Benefits of Using MySql

  • Permanent data.
  • Less code changes (although adding something like memcache is trivial anyway)

Key disadvantages of using MySql

  • You need to define table templates every time I want to save a new set of grouped data.
  • You must write a program that iterates over the correlated data and populates these new tables.
  • Potentially, it will still grow more slowly as the data fills.

I apologize for the rather long question. In any case, it helped me to write down these thoughts, and any advice / help / experience with similar problems would be greatly appreciated.

Many thanks.

Alan

+7
sql mysql caching memcached
source share
4 answers

In addition to the options that you discuss above, you can also consider adding more powerful equipment to the image, if this is an option.

This bit of your question shows that the main problem here is the speed of the results:

However, my company now wants to follow some statistics on this data, and these tables are getting too large to provide results in a considered reasonable time.

In situations where speed of results is important, throwing better / additional equipment into a problem can often work cheaper than developing new code / database structures, etc.

Just a thought!

+2
source share

(Another answer from me, quite different that I will publish it separately)

Two questions:

What statistics do you want to create? and
After rows are inserted into the database, do they ever change?

If the data does not change after insertion, you can create a separate "statistics" table that you change / update when inserting new rows or, possibly, soon after setting new rows.

eg. such things as:

  • When a new row related to stat 'B' is inserted, go and increase the number in another table for stat 'B', minute 'Y'
    or
  • Every hour, run a small query on the rows that were inserted in the last hour, which generates statistics for that hour and stores them separately

    or
  • As stated above, but every minute, etc.

It is difficult to be more specific without knowing the details, but depending on the statistics you follow, these approaches may help.

+1
source share

If you want to do some static data analysis a few days ago, you might want to consider using something like an OLAP system.

Basically, this type of intermediate system statistics in their format allows you to do short sums (), avg (), count () ... on a large table.

I think your question is a great example of the situation when it was used, but maybe I think so because it is my job. =)

Take a look.

+1
source share

I work for a company with a similar situation, with millions of inserts every month.

We adopted a strategy to summarize data in smaller tables, grouped by specific fields.

In our case, when the insertion is performed, it launches a function that classifies the inserted tuple and increases the pivot tables.

From time to time, we move the oldest rows to the backup table, reducing the growth of the main table.

+1
source share

All Articles