Situation: I started a new job and I was instructed to figure out what to do with my sensor data table. It contains 1.3 billion rows of sensor data. The data is quite simple: basically only the sensor identifier, the date and value of the sensor at this point in time (double).
Currently, data is stored in a table in the MSSQL server database.
By the end of this year, I expect the number of rows to increase to 2-3 billion.
I'm looking for the best way to store and retrieve this data (by date), and since there is a lot of "big data" there, and I have no real experience managing such large data sets, m asking here about any pointers.
This is not a big company, and our resources are not limited;)
Additional information about our use case:
- Data is displayed on graphs and shows sensor values ​​over time.
- We plan to create an API so that our customers can receive sensor data for any period of time in which they are interested (... data from 2 years ago is as relevant as the data for the last month).
My research so far has led me to consider the following solutions:
Store data in SQL Server
but split the table (it is not partitioned right now). This will require an enterprise version of SQL Server, which is expensive.
Move data to Azure SQL Server.
There we get the function of splitting into much less money, but as soon as our database grows above 250 GB, it will cost much more (and much higher than 500 GB).
Use multiple databases
We can use 1 database for each client. A few smaller databases will be cheaper than 1 huge database, but we have many clients and plans for more, so I don’t really like to think about all these databases.
Azure Storage Tables
This is the option that I like best. We can split the data by company / sensor / year / month, use the date for the line row and save the sensor value.
I have not had time to test query performance yet, but what I read should be fine. But there is one significant drawback, and that the limit of 1000 elements is returned to the HTTP request. If we need to get all the sensor data within a week, we need to make a lot of HTTP requests. I am not sure right now how big the question is for our use.
Azure HDInsight (Hadoop in Azure)
As already mentioned, I do not have experience with big data, and currently I do not get Hadoop well enough to find out if this is relevant to our business (expose sensor data for a certain period of time through the API). Should I dig deeper and learn, or is my time better spent on pursuing another alternative?
Does anyone have experience in a similar case. What works for you? Keep in mind that price matters, and a “simple” solution may be preferable to a very complex one, even if the complex one is several seconds better.
UPDATE 1: To answer some of the questions in the comments below.
- There are about 12,000 sensors that can report a value every 15 seconds. This is ~ 70 million per day. In fact, not all of these sensors have “reporting”, so we don’t get so much data every day, but since we naturally want to expand our work with a large number of clients and sensors, I really need a solution that can scale to many millions sensor values ​​per day.
- Partitioning is a solution, and using multiple databases and / or multiple tables is what I have, but yes, but I consider it fallback if / when I have exhausted other solutions.
- I read a little more about HBase, http://opentsdb.net/ and google https://cloud.google.com/bigtable/ , and it seems that Hadoop can be a real alternative, at least.
UPDATE 2: Today I did a bit of work with both the azure table and HDInsight (HDI). We do not require much in "query flexibility", so I believe that Azure Table Storage looks promising. It is a little slower to pull data from outside the limit of 1000 points per query, as I mentioned, but in my tests I find it fast enough for our use cases.
I also stumbled upon OpenTSDB, and it made me try the HDI first. Following the Azure tutorial ( https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-tutorial-get-started/ ) I was able to quickly save a million records and test some queries. It was much faster than Azure Table Storage. I could even pull out 300,000 entries in a single HTTP request (took 30 seconds).
But it costs quite a bit than Azure Table Storage, and I think I can optimize my code to improve query performance with Azure Table Storage (a finer-grained partition key and running queries in parallel). So right now I'm leaning towards Azure table storage because of its simplicity, price, and "reasonably good" performance.
I will soon send my findings to an external consultant, so I am very glad to know his opinion about things.