How to efficiently store and query a billion rows of sensor data

Situation: I started a new job and I was instructed to figure out what to do with my sensor data table. It contains 1.3 billion rows of sensor data. The data is quite simple: basically only the sensor identifier, the date and value of the sensor at this point in time (double).

Currently, data is stored in a table in the MSSQL server database.

By the end of this year, I expect the number of rows to increase to 2-3 billion.

I'm looking for the best way to store and retrieve this data (by date), and since there is a lot of "big data" there, and I have no real experience managing such large data sets, m asking here about any pointers.

This is not a big company, and our resources are not limited;)

Additional information about our use case:

  • Data is displayed on graphs and shows sensor values ​​over time.
  • We plan to create an API so that our customers can receive sensor data for any period of time in which they are interested (... data from 2 years ago is as relevant as the data for the last month).

My research so far has led me to consider the following solutions:

  • Store data in SQL Server

    but split the table (it is not partitioned right now). This will require an enterprise version of SQL Server, which is expensive.

  • Move data to Azure SQL Server.

    There we get the function of splitting into much less money, but as soon as our database grows above 250 GB, it will cost much more (and much higher than 500 GB).

  • Use multiple databases

    We can use 1 database for each client. A few smaller databases will be cheaper than 1 huge database, but we have many clients and plans for more, so I don’t really like to think about all these databases.

  • Azure Storage Tables

    This is the option that I like best. We can split the data by company / sensor / year / month, use the date for the line row and save the sensor value.

    I have not had time to test query performance yet, but what I read should be fine. But there is one significant drawback, and that the limit of 1000 elements is returned to the HTTP request. If we need to get all the sensor data within a week, we need to make a lot of HTTP requests. I am not sure right now how big the question is for our use.

  • Azure HDInsight (Hadoop in Azure)

    As already mentioned, I do not have experience with big data, and currently I do not get Hadoop well enough to find out if this is relevant to our business (expose sensor data for a certain period of time through the API). Should I dig deeper and learn, or is my time better spent on pursuing another alternative?

Does anyone have experience in a similar case. What works for you? Keep in mind that price matters, and a “simple” solution may be preferable to a very complex one, even if the complex one is several seconds better.

UPDATE 1: To answer some of the questions in the comments below.

  • There are about 12,000 sensors that can report a value every 15 seconds. This is ~ 70 million per day. In fact, not all of these sensors have “reporting”, so we don’t get so much data every day, but since we naturally want to expand our work with a large number of clients and sensors, I really need a solution that can scale to many millions sensor values ​​per day.
  • Partitioning is a solution, and using multiple databases and / or multiple tables is what I have, but yes, but I consider it fallback if / when I have exhausted other solutions.
  • I read a little more about HBase, http://opentsdb.net/ and google https://cloud.google.com/bigtable/ , and it seems that Hadoop can be a real alternative, at least.

UPDATE 2: Today I did a bit of work with both the azure table and HDInsight (HDI). We do not require much in "query flexibility", so I believe that Azure Table Storage looks promising. It is a little slower to pull data from outside the limit of 1000 points per query, as I mentioned, but in my tests I find it fast enough for our use cases.

I also stumbled upon OpenTSDB, and it made me try the HDI first. Following the Azure tutorial ( https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-tutorial-get-started/ ) I was able to quickly save a million records and test some queries. It was much faster than Azure Table Storage. I could even pull out 300,000 entries in a single HTTP request (took 30 seconds).

But it costs quite a bit than Azure Table Storage, and I think I can optimize my code to improve query performance with Azure Table Storage (a finer-grained partition key and running queries in parallel). So right now I'm leaning towards Azure table storage because of its simplicity, price, and "reasonably good" performance.

I will soon send my findings to an external consultant, so I am very glad to know his opinion about things.

+8
sql-server hadoop bigdata azure-table-storage hdinsight
source share
3 answers

So, you will have 3 billion records by the end of this year (which have just begun). Each record has 4 bytes ID + 4 bytes datetime + 8 bytes double value, which is 3 * 10 ^ 9 * (4 + 4 + 8) == 48Gb.

You can easily store and process this 48Gb in a database such as Redis, CouchBase, Tarantool, Aerospike. All of them are open source, so you do not need to pay a license fee.

When using memory, there may be additional overhead of 10-30%, so 48Gb can grow up to 64 GB or a little more. You must feed these databases with your real data in order to choose the most economical for your case.

Only one physical machine should be sufficient for the entire workload, since in-memory databases can process requests / updates of 100K-1M per second per node (the actual number depends on your specific workload template). For better availability, I would install two servers - a master and a slave.

The cost of a physical server with 64 GB on board before my experience is 2-3K. Note that you don’t even need an SSD. The rotation should be perfect, because all reads fall into RAM, and all records only join the transaction log. This is how databases work in memory. I can talk about this if you have any questions.

+2
source share

So, I used all the technologies that you listed in some way. What queries should you fulfill? Because depending on this, you might be able to manage some decisions. If you do not need to request many different ways, then Table Storage may work well for you. Its "scales well if you follow the recommendations and cheap. But if you can't just do a point query for the data you need, then this may not work as well or be complicated to be a good option. Opentsdb is great if you need a time series database. This will limit you to queries such as time series. There are many dbs time series there and there are many applications that are built on top of it, like Bosun and Grafana , to list the two that I use. The last version of HDI, I would save the data in the parquet format (or in some column format), creating a table of data and query layout using Spark SQL . In fact, you do not need to use Spark, you can also use Hive. But what you should avoid is the traditional Zoom Out card, this paradigm is mostly dead now, and you shouldn't write new code into it. In addition, if you do not know this, there is a steep learning curve around. I have all the technologies, and we use them for different parts, this is a system, and it really depends on the requirements for reading and writing the application. I would look at using sparks and flooring if I were you, but it was a lot of new tools that might not be needed.

0
source share

3 billion data points per year is pretty small for modern time series databases such as VictoriaMetrics . This amount of data can be stored in less than 3 minutes at a receive rate of 19 million samples per second on a computer with 64 virtual CPUs. See this article for details .

There are VictoriaMetrics production facilities with 10 trillion data points per node. And it scales to a few nodes .

0
source share

All Articles