Amazon SimpleDB Suitability for Large Temporary Datasets Coming Out of Thousands of Individual Devices

Question

Amazon SimpleDB Suitability for Large Temporary Datasets Coming Out of Thousands of Individual Devices

I am trying to determine if Amazon SimpleDB is suitable for a subset of the data that I have.

I have thousands of deployed autonomous sensors that record data.

Each sensor device essentially reports several values four times per hour every day for several months and years. I need to save all this data for historical statistical analysis. As a rule, write it once, read many times. Server applications are launched regularly to request data to display other information.

Now the data rows in SQL look something like this:

(id, device_id, utc_timestamp, value1, value2)

Our existing MySQL solution will not expand much further, with tens of millions of rows. We ask for things like "tell me the sum of the total value1 yesterday" or "show me the average value2 over the past 8 hours." We do this in SQL, but we can gladly change it in the code. SimpleDBs "possible sequence" is great for our belly buttons.

I read everything I can and am going to start experimenting with our AWS , but it is not clear to me how the various concepts of SimpleDB (elements, domains, attributes, etc.) relate to our domain.

Is SimpleDB the right tool for this, and what would be a generic approach?

PS: We mainly use Python, but it does not matter when considering this at a high level. At this point, I know the boto library.

Edit:

Continuing my search for solutions to this, I came up with a question about stack overflow. What is the best open source solution for storing time series data? which was helpful.

+4

amazon-simpledb time-series

Aitch Jun 04 '11 at 11:01

source share

4 answers

I my opinon, Amazon SimpleDb, and Microsoft Azure Tables are a great solution if your queries are fairly simple. As soon as you try to make material that absolutely does not affect relational databases, such as aggregates, you begin to run into problems. Therefore, if you are going to make some heavy messages, it can become messy.

+1

Oliver weichhold Jun 04 '11 at 20:03

source share

It looks like your problem can be best handled with a round-robin database (RRD). RRD stores time variable data so that the file size never exceeds the initial setting. This is extremely cool and very useful for generating graphs and time series information.

0

Richard Hurt Jun 28 '11 at 14:15

source share

I agree with Oliver Weichold that the cloud database solution will process the utility you described. You can distribute your data on several SimpleDB domains (for example, partitions) and store your data so that most of your queries can be run from one domain without having to go through the entire database. Defining a partition strategy will be key to the success of the transition to a cloud-based database. Dividing a dataset is discussed here.

0

Zaffiro Jul 21 '11 at 18:00

source share

Aitch · Accepted Answer · 2012-03-16T23:13:28+0000

Just after that, many months later ...

I really had the opportunity to talk with Amazon directly about this last summer and eventually got access to the beta version of the program, which eventually became DynamoDB, but could not talk about it.

I would recommend it for this kind of scenario where you need a primary key and what can be described as a secondary index / range - for example, timestamps. This allows you much more confidence in your search, i.e. "Show me all the data for device X between Monday and Friday."

We have not yet moved on to this for various reasons, but are still planning.

http://aws.amazon.com/dynamodb/

Amazon SimpleDB Suitability for Large Temporary Datasets Coming Out of Thousands of Individual Devices

More articles: