Use Amazon S3 and Cloudfront to intelligently cache web pages

I have a website (powered by Tomcat on Elastic Beanstalk) that generates an artist discography (one page for one artist). This can be resource-intensive, since the artist’s pages do not change for a month, I put CloudFront Distribution in front of him.

I thought this meant that not one request from the executor should ever be served by my server more than once, but it is not so good. This post explains that each location of the edges (Europe, USA, etc.) will be skipped when you first view the resource and that there is a limit on the number of resources stored in the cloud cache so that they can be discarded.

So, to meet this, I changed the server code to save a copy of the web page in a bucket in S3, and check it first when the request arrives, so if the artist page already exists on S3, then the server extracts it and returns its contents as web pages. This greatly reduces processing, as it only once creates a web page for a particular artist.

But:

  • The request still needs to go to the server to check if the artist page exists.
  • If the artist page exists, the web page (and sometimes can be large up to 20 MB) is first uploaded to the server, and then the server returns the page.

So, I wanted to know if I could improve this - I know that you can build an S3 bucket as a redirect to another site. Is there a way for each page, can I get the artist’s request to go into the S3 bucket and then return it if it exists, or call the server if it is not?

Alternatively, I can get the server to check if the page exists, and then redirect to the S3 page, and not load the page to the server first?

+7
java amazon-s3 amazon-web-services amazon-cloudfront
source share
4 answers

OP says:

sometimes they can be large up to 20mb

Since the amount of data that you serve can be quite large, I think that it is possible for you to do this in 2 requests instead of one, where you separate the creation of content from the part that serves the content. The reason for this is to minimize the amount of time / resources required by the server to retrieve data from S3 and maintain it.

AWS supports pre-signed URLs that can be valid for a short period of time; We can try to use the same here to avoid security issues, etc.

Your architecture currently looks something like the one below. the client initiates the request, you check if the requested data exists on S3, and then retrieve and serve it, if there, otherwise you create the content and save it on S3:

if exists on S3 client --------> server --------------------> fetch from s3 and serve | |else |------> generate content -------> save to S3 and serve 

In terms of network resources, you always consume 2X the amount of bandwidth and time here. If the data exists, then as soon as you have to pull it from the server and serve it for the client (so this is 2X). If the data does not exist, you send it to the client and S3 (so this is again 2X)


Instead, you can try the two approaches below, both of which assume that you have a basic template, and that other data can be obtained using AJAX calls, and both of which reduce this 2X factor in the overall architecture.

  • Serve only with S3. This requires changes in the way you develop your product, and therefore may not be so easily integrated.

    In principle, for each incoming request, return the S3 URL to it if the data already exists, otherwise create a task for it in SQS, generate the data and click on S3. Based on your usage patterns for different artists, you should estimate how long it takes to combine the data on average, and thus return a URL that will be valid with the_time_for_completetion ( T ) rating of the task,

    The client waits for time T , and then sends a request to the URL returned earlier. He makes up to 3 attempts to obtain this data in case of failure. In fact, the data already existing on S3 can be considered as the basic case when T = 0 .

    In this case, you make 2-4 network requests from the client, but only the first of these requests arrives on your server. You transfer data once to S3 only if it does not exist, and the client always retrieves data from S3.

      if exists on S3, return URL client --------> server --------------------------------> s3 | |else SQS task |---------------> generate content -------> save to S3 return pre-computed url wait for time `T` client -------------------------> s3 


  1. Check if data exists and make the second network call appropriately.

    This is similar to what you are currently doing when serving data from the server, if it does not already exist. Again, we make 2 requests here, however this time we try to serve the data synchronously with the server in case it does not exist.

    So, in the first hit, we check if the content has been previously created, in which case we get a successful URL or error message. When this succeeds, the next hit hits S3.

    If the data does not exist on S3, we make a new request (to another POST URL), upon receipt of which the server calculates the data, serves it, adding an asynchronous task to direct it to S3.

      if exists on S3, return URL client --------> server --------------------------------> s3 client --------> server ---------> generate content -------> serve it | |---> add SQS task to push to S3 
+2
source share

The CloudFront cache redirects but does not execute it: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#ResponseCustomRedirects .

You have not specified specific numbers, but will it work for you to pre-create all these pages and place them on S3 and immediately point CloudFront to S3?

If feasible, there are several advantages:

  • Separate content creation from content that will make the system more stable overall.
  • The performance requirements for the content generator will be much lower, since it can move as slowly as you want to regenerate the content.

Definitely, if you don’t know which pages you should generate in advance, this will not work.

0
source share

Although I have not done this before, this will be a technique that I would look at.

  • Start by creating an S3 bucket, as you described, as a “redirect” for the website.

  • Look at the S3 event handlers. They only deal with creating an S3 object, but can you try GET to get started and if it does not respond to POST or PUT in the same path by placing it in a “marker” file or calling API that will fire the event?

https://aws.amazon.com/blogs/aws/s3-event-notification/ http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

  • As soon as the event is triggered, either your server will listen through SQS for the event, or move the code of the creator of your artist to AWS Lambda, which will submit SNS.

My only concern is where the GET will come from. You don’t want anyone getting into your S3 bucket with an invalid POST - you would generate everywhere. But I will leave this as an exercise for the reader.

0
source share

Why not put a web server like ngx or apache in front of tomcat? The tomat tool works on some other port, such as 8085, the web server runs on 80. It receives hits and has its own cache. Then you do not need S3 at all, but it can return to your server + Cloudfront.

Thus, Cloudfront gets to your web server, if it is in the cache, it returns the page directly. The rest go to tomcat.

The cache may be in the same process or redis ... depends on the total size of the data that needs to be cached.

0
source share

All Articles