How to query DynamoDB by date (range key) without an obvious hash key?

I need to save local data in an iOS application in sync with the data in a DynamoDB table. The DynamoDB table represents ~ 2K rows with only a hash key ( id ) and the following attributes:

  • id (uuid)
  • lastModifiedAt (timestamp)
  • name
  • latitude
  • longitude

I am currently browsing and filtering lastModifiedAt , where lastModifiedAt greater than the last update date of the application, but I think it will become expensive.

The best answer I can find is to add a global secondary index with lastModifiedAt as a range, but there is no explicit hash key for GSI.

What is the best thing to do when querying for a range using GSI, but there is no obvious hash key? . Alternatively, if a full scan is the only option, are there any recommendations for keeping down the cost?

+13
amazon-web-services amazon-dynamodb aws-sdk
Mar 12 '16 at 20:58
source share
3 answers

While D. Shauli's answer helped me point in the right direction, he missed two considerations for GSI:

  • The hash + range should be unique, but the day + timestamp (its recommended approach) will not necessarily be unique.
  • Using only a day as a hash, I will need to use a large number of queries to get results for every day since the last update (this may be months or years ago).

So here is the approach I took:

  • A global secondary index (GSI) has been created with a hash key as YearMonth (e.g. 201508 ) and a range as id
  • Request GSI several times, one request for each month since the last update. Requests are also filtered using lastModifiedAt > [given timestamp] .
+6
Mar 22 '16 at 3:09
source share

Despite the fact that the Global Secondary Index meets your requirements, any attempt to include related timestamp information as part of your Hash Key will most likely create a so-called โ€œhot sectionโ€, which is highly undesirable.

Uneven access will occur because the most recent items will be received at a higher frequency than the old ones. This will not only affect your performance, but also make your decision less economical.

See the documentation for more details:

For example, if a table has a very small number of highly accessible partition key values, perhaps even one very heavily used partition key value, query traffic is concentrated on a small number of partitions - potentially only one partition. If the workload is largely unbalanced, which means that it is disproportionately focused on one or more partitions, the queries will not have a throughput level. To get the most out of DynamoDB throughput, create tables in which the partition key has a large number of different values, and the values โ€‹โ€‹are requested fairly evenly, as soon as possible.

In accordance with what is indicated, id seems to be a good choice for your Hash Key (aka. Partition Key ), I would not change it, since GSI keys work the same as sectioning. As a separate note, performance is highly optimized when you retrieve data, providing the entire Primary Key , so we should try to find a solution that provides this when possible.

I would suggest creating separate tables for storing primary keys, depending on how they were recently updated. You can segment the data in tables based on the granularity that is best for your use cases. For example, say you want to segment updates by day:

but. Your daily updates can be stored in tables with the following naming convention: updates_DDMM

b. The updates_DDMM tables would only have id (hash keys for another table)

Now say that the last update date of the application was from 2 days ago (04/07/16), and you need to get the latest entries, then you need:

I am. Scan the updates_0504 and updates_0604 to get all the hash keys.

II. Finally, get the entries from the main table (containing lat / lng, name, etc.) by sending a BatchGetItem with all the hash keys received.

BatchGetItem is super fast and does the job like no other operation.

It can be argued that creating additional tables will increase the cost of your overall solution ... well, with GSI you essentially duplicate your table (in case you design all the fields) and add that extra cost for all ~ 2k records, being recently updated or not ...

It seems like this is an intuitive table creation like this, but this is actually the best practice when dealing with time series data (from AWS DynamoDB documentation):

[...] applications can display an uneven access pattern for all elements in a table where the latest customer data is more relevant, and your application can access the latter more often and these items are less accessible, as a result, old elements rarely access them. If this is a known access pattern, you can take it when designing a table schema. Instead of storing all the items in one table, you can use several tables to store these items. For example, you can create tables to store monthly or weekly data. For a data storage table from the last month or week, where the data access speed is high, the query is higher than the throughput and for tables storing old data, you can gain throughput and save resources.

You can save resources by storing hot items in one table with higher throughput values โ€‹โ€‹and cold items in another table with lower bandwidth settings. You can remove old items by simply deleting tables. If necessary, you can copy these tables to other storage such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is much more efficient than deleting items one by one, which substantially doubles the write throughput, as you do as many delete operations as input operations.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Hope this helps. Regards.

+16
Apr 08 '16 at 4:19
source share

You can use the "daytime" part of the timestamp as a hash and use the entire timestamp as a range.

+2
Mar 12 '16 at 21:03
source share



All Articles