Storing large amounts of data in a database

Question

Storing large amounts of data in a database

I am currently working on a home automation project that provides the user with the ability to view their energy consumption over a period of time. We currently request data every 15 minutes, and we expect about 2,000 users for our first big pilot.

My boss asks us to store at least six months of data. A quick amount leads to estimates of about 35 million records. Although these records are small (about 500 bytes each), I still wonder if storing this data in our database (Postgres) is the right solution.

Does anyone have good reference materials and / or advise on how to handle this amount of information?

+4

database postgresql home-automation

Exelian Jul 20 '11 at 10:23

source share

6 answers

We often come across tables that look like this. Obviously structure your indexes based on usage (you read or write a lot, etc.), and from the very beginning think about partitioning the tables based on some grouping of high-level data.

In addition, you can implement the idea of archiving to maintain a lively thin table. Historical records either never concern or are not reported, both of which are not suitable for living in my opinion.

It is worth noting that we have tables of about 100 m of records, and we do not perceive them as a performance problem. Many of these performance improvements can be made with a little pain after this, so you can always start with a common sense solution and tune in only when the performance is poor.

+4

Adam houldsworth Jul 20 '11 at 10:27

source share

First of all, I suggest you perform a performance test - write a program that generates test records that match the number of records that you will see in six months, insert them and check the results to find out if they are satisfactory. If not, try indexing as suggested by other answers. By the way, it’s also worth trying to write down the performance so that you can actually insert the amount of data that you generate in 15 minutes .. 15 minutes or less.

Carrying out the test will allow the mother to avoid all problems - assumptions :-)

Also think about product performance — your pilot will have 2,000 users — will your production environment have 4,000 users, or 200,000 users in a year or two?

If we are talking about a really large environment, you need to think about a solution that allows you to scale by adding more nodes instead of relying on the ability to add more CPU, disk and memory to the same machine. You can do this in your application by tracking which of several database machines the information about a particular user is located on, or you can use one of the Postgresql clustering methods, or you can go a completely different way - NoSQL , where you completely leave the DBMS and use systems that are built to scale horizontally.

There are a number of such systems. I have only personal experience with Cassandra . You have to think in a completely different way from what you are used to in the world of RDBMS, which is something complicated - think more about how you want to access data, and not to store it. For your example, I think that storing the data with the user ID as the key, and then adding a column with the column name being the timestamp, and the column value being your data for this timestamp would make sense. Then you can query for fragments of these columns, for example, to get the results of a graphical display in the web interface. Cassandra has reasonably good response time for user interface applications.

The potential for investing in learning and using the nosql system is that when you need more space, you simply add a new node. The same thing if you need more write performance or more read performance.

+1

Erik forsberg Jul 20 '11 at 10:57

source share

With appropriate indexes, to avoid slow queries, I would not expect any decent RDBMS to deal with such a data set. Many people use PostgreSQL to process much more data.

What databases are created for :)

0

David Precious Jul 20 '11 at 10:27

source share

Aren't you better off storing individual samples for a full period? You could implement some kind of consolidation mechanism that combines weekly / monthly samples into one record. And run the specified consolidation on schedule.

Your decision should depend on the type of queries that need to be performed in the database.

0

zeFrenchy Jul 20 '11 at 10:32

source share

There are many methods to solve this problem. you will get performance only with a minimum number of records. in your case, you can use the following methods.

Try to save the old data in a separate table here, you can use table partitioning or you can use a different approach where you can store your old data in the file system and can serve them directly from your application without connecting to the database, thus your database will be free. I do this for one of my projects and it already has more than 50 GB of data, but it works very smoothly.
Try indexing the columns of the table, but be careful, as this will affect the input speed.
Try batch processing to enter or select queries. You can very easily deal with this problem. Example: suppose you get a request to insert a record into any table every 1 second, then you create a mechanism in which you process this request in batch record 5, so you hit your database after 5 seconds, which is much better . Yes, you can make users wait 5 seconds to wait for their record, inserted, like in Gmail, where you send email, and it asks you to wait / process. for selection, you can periodically place your result set in the file system and can serve them directly for the user without touching the database, as most companies in the securities market do.
You can also use some ORM like Hibernate. They will use some caching techniques to increase the speed of your data.

For any subsequent request, you can send me an email at ranjeet1985@gmail.com

0

Ranjeet rana May 30 '14 at 12:57

source share

jfg956 · Accepted Answer · 2011-07-20T10:52:46+0000

Currently, 35M records of 0.5K each means 37.5G of data. This is suitable for the database for your pilot, but you should also consider the next step after the pilot. Your boss will not be happy when the pilot is very successful, and you will tell him that in the coming months you will not be able to add 100,000 users without having to redo everything. Moreover, what about a new opportunity for VIP users to request data every minute ...

This is a complex issue, and the choices you make will limit the evolution of your software.

Keep it simple for the pilot to get the product as cheap as possible → ok for the database. But tell the boss that you cannot open such a service and that you will have to change things before receiving 10,000 new users per week.

One thing for the next release: there are many data repositories: one for your user data that is updated frequently, one for you query / statistics system, ...

You can watch RRD for the next version.

Also remember the refresh rate: 2000 users updating the data every 15 minutes means 2.2 updates per second → ok; 100,000 users updating data every 5 minutes means 333.3 updates per second. I'm not sure if a simple database can keep up with this, and one web service server definitely cannot.

Storing large amounts of data in a database

More articles: