Design section for storing azure tables

I have software that collects data over a long period of time, approximately 200 samples per second. An SQL database is used for this. I am looking to use Azure to migrate a lot of my old archived data.

The software uses a multi-tenant architecture, so I plan to use one Azure table for each tenant. Each tenant, perhaps, tracks 10-20 different indicators, so I plan to use the metric identifier (int) as the section key.

Since each metric will have only one read per minute (max.), I plan to use DateTime.Ticks.ToString ("d19") as my RowKey.

I lack understanding on how this will scale; so I hoped someone could clarify this:

For performance, Azure will / can split my table into a partition group so that everything is fine and fast. In this case there will be one section per metric.

However, my row can potentially present data for about 5 years, so I estimate about 2.5 million rows.

Is Azure smart enough to then split based on rowkey, or will I develop a bottleneck in the future? I know, as a rule, not optimize ahead of schedule, but with something like Azure, which does not seem as reasonable as normal!

Looking for an Azure expert to tell me if I'm on the right line or if I need to split my data into other tables.

+6
source share
1 answer

A few comments:

In addition to storing data, you can also see how you want to receive data, as this can significantly change the design. Some of the questions you can ask yourself are:

  • When I receive data, will I always retrieve data for a specific metric and for a date / time range?
  • Or do I need to get data for all indicators for a specific date / time range? If so, you are viewing a full table scan. Obviously, you could avoid this by running multiple queries (single query / PartitionKey)
  • I need to see the latest results first, or I don't care. If this is earlier, your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19") .

In addition, since PartitionKey is a string value, you may need to convert the int value to a string value with some preliminary signature “0” so that all your identifiers appear in order, otherwise you will get 1, 10, 11, .., 19, 2, ... etc.

As far as I know, Windows Azure only splits data into PartitionKey , not RowKey . Inside the section, RowKey serves as a unique key. Windows Azure will try to store data with the same PartitionKey in the same node, but since each node is a physical device (and therefore has a size limit), data can flow to another node.

You might want to read this blog post from the Windows Azure Storage Group: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows -azure-tables.aspx .

UPDATE Based on your comments below and some information above, let's try and do the math. This is based on the latest scalability goals published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability- targets.aspx . The documentation states that:

A single table is a table partition — a table partition — all objects in a table with the same partition key value, and usually tables have many partitions. Throughput target for one table partition:

  • Up to 2000 objects per second
  • Note that this is for a single partition, not a single table. Therefore, a good split table can process up to 20,000 units per second, which is the overall goal of the account above.

Now you mentioned that you have 10 - 20 different metric points, and for each metric point you will record a maximum of 1 record per minute, which means that you write a maximum of 20 objects / minute / table, which is good for the purpose of scaling 2000 units in give me a sec.

Now the question remains from reading. Assuming that the user will read data for 24 hours (i.e. 24 * 60 = 1440 points) for each section. Now, assuming that the user receives data for all 20 indicators in 1 day, each user (thus, each table) will retrieve a maximum of 28,800 data points. The question that remains for you, I think, is how many queries like this can you get per second to match this threshold. If you could somehow extrapolate this information, I think you can come to the conclusion about the scalability of your architecture.

I would also recommend watching this video: http://channel9.msdn.com/Events/Build/2012/4-004 .

Hope this helps.

+16
source

All Articles