Check if the S3 file has been modified

Question

Check if the S3 file has been modified

How can I use a shell script to check if the Amazon S3 file (small .xml file) has been modified. I am currently using curl to check every 10 seconds, but it makes a lot of GET requests.

 curl "s3.aws.amazon.com/bucket/file.xml" if cmp "file.xml" "current.xml" then echo "no change" else echo "file changed" cp "file.xml" "current.xml" fi sleep(10s)

Is there a better way to check every 10 seconds, which reduces the number of GET requests? (This is built on top of the rails application, so could I build a handler in rails?)

+7

bash shell amazon-s3

loop Jul 16 '16 at 4:39

source share

2 answers

Bruno reis · Answer 1 · 2016-07-16T05:42:14+0000

Let me first tell you some facts about S3. You may know this, but if you do not, you can see that your current code may have some kind of "unexpected" behavior.

S3 and Ultimate Consistency

S3 provides “possible consistency” for rewritable objects. From the S3 FAQ you have:

Q: What data consistency model does Amazon S3 use?
Amazon S3 buckets in all regions provide post-write consistency for PUTS new items and possible consistency for overwriting PUTs and DELETES .

The ultimate consistency for overwriting means that whenever an object is updated (i.e. whenever your small XML file is overwritten), clients restoring the file may see the new version or MAY see the old version. How long? For an indefinite time. It usually achieves consistency in much less than 10 seconds, but you must assume that ultimately it takes more than 10 seconds to achieve consistency. More interesting (unfortunately?), Even after successfully searching for a new version, customers MAY still get the older version later.

The only thing you can be sure of: if the client starts downloading the version of the file, it will download the whole version (in other words, there is no chance to get, for example, the first half of the XML file as the old version, and the second as the new version).

Please note that your script will not be able to detect changes during your 10 second timeframe: you can make several requests even after the change, until your script loads the changed version. And even then, after you find the change, it is (unfortunately) quite possible that the next request will load the previous (!) Version and cause another “change” in your code, then the next will give the current version, and call one more “change” in your code!

If you're fine with the fact that S3 provides ultimate consistency, you can improve your system.

Idea 1: S3 + SNS Event Notifications

You mentioned that you thought about using SNS. This can definitely be an interesting approach: you can enable S3 event notifications and then receive notification via SNS whenever the file is updated.

How do you get a notification? You will need to create a subscription, and here you have several options.

Idea 1.1: S3 + SNS + web application event notifications

If you have a "web application", that is, everything that works in the public HTTP endpoint, you can create an HTTP subscriber, so SNS will call your server with a notification whenever this happens. This may or may not be possible or desirable in your scenario.

Idea 2: S3 + SQS event notifications

You can create a message queue in SQS and send S3 notifications directly to the queue. It would also be possible as S3 + SNS + SQS event notifications, since you can add a queue as a subscriber to the SNS topic (the advantage is that if you need to add functionality later, you can add more queues and subscribe to them to the same topic, so getting a “few copies” of the notice).

You must call SQS to receive a notification. You still have to interrogate, i.e. Have a cycle and call GET on SQS (the cost of which is about the same or maybe a little less depending on the region than S3 GET). The small difference is that you can reduce the number of requests - SQS supports long polling requests up to 20 seconds long : you make a GET call in SQS and, if there are no messages, SQS holds the request for up to 20 seconds, returns immediately if it arrives message, or returns an empty response if there are no messages during these 20 seconds. Thus, you send only 1 GET every 20 seconds to receive faster notifications than currently. You can potentially halve the amount of GET (once every 10 s S3 versus once every 20 seconds before SQS).

In addition, you can select one single SQS queue to aggregate all changes to all XML files or multiple SQS queues, one per XML file. With one queue, you significantly reduce the total number of GET requests. With one queue in the XML file, when you could "halve" the number of GET requests compared to what you have.

Idea 3: S3 + AWS Lambda event notifications

You can also use the Lambda function for this. This may require some changes in your environment - you will not use the Shell script for polling, but S3 can be configured to call the Lambda function for you as a response to an event, for example, updating your XML file, you can write your code in Java, Javascript or Python (some people have developed some “hacks” for using other languages, including Bash).

The beauty is that there are no more polls and you don’t need to support a web server (as in “idea 1.1”). Your code "just launches" whenever there is a change.

Note that no matter which of these ideas you use, you still have to deal with possible consistency . In other words, you know that a PUT / POST has occurred, but as soon as your code sends a GET, you can still get an older version ...

Idea 4: Use DynamoDB Instead

If you have the opportunity to make more structural changes to the system, you might consider using DynamoDB for this task.

I suggest this because DynamoDB maintains strong consistency even for updates. Note that this is not the default - by default, DynamoDB works in constant consistency mode, but the “retrieve” operations (GetItem, for example) support fully consistent reads.

In addition, DynamoDB has what we call "DynamoDB streams", which is a mechanism that allows you to receive a stream of changes made to any (or all) elements on your table. These notifications can be polled or even used in conjunction with the lambda function, which will be called automatically whenever a change occurs! This, plus the fact that DynamoDB can be used with strong consistency, can help you solve your problem.

DynamoDB is usually a good practice to keep records small. You mentioned in your comments that your XML files are about 2 KB - I would say that this can be considered "small enough" to make it convenient for DynamoDB! (reasoning: reading DynamoDB is usually calculated as a multiplicity of 4kB, therefore, to fully read 1 of your XML files, you will only consume 1 reading, and depending on how you do it, for example, using a query operation instead of GetItem, it’s possible , you can read 2 XML files from DynamoDB, consuming only 1 read operation).

Some links:

Rash · Answer 2 · 2019-04-05T19:10:25+0000

I can think of another way using S3 Versioning ; this will require the least amount of change in your code.

Version control is a way to store multiple variations of an object in a single bucket.

This will mean that every time a new file.xml S3 will create a new version.

In your scenario, instead of getting the object and comparing it, get the HEAD of the object that contains the VersionId field. Compare this version with the previous version to see if the file has changed.

If the file really changed, get a new file, and also get a new version of this file and save it locally so that next time you can use this version to check if a newer version has been downloaded.

Note 1: You will still make many S3 calls, but instead of extracting the whole file every time, you will only extract the file metadata, which is much faster and smaller in size.

Note 2: However, if your goal was to reduce the number of calls, the simplest solution I can think of is to use lambdas. You can run the lambda function every time you load a file, which then calls the endpoint REST your service to notify you of a file change.

Check if the S3 file has been modified

S3 and Ultimate Consistency

Idea 1: S3 + SNS Event Notifications

Idea 1.1: S3 + SNS + web application event notifications

Idea 2: S3 + SQS event notifications

Idea 3: S3 + AWS Lambda event notifications

Idea 4: Use DynamoDB Instead

Some links:

More articles: