Several SELECT Postgres processes (django GET requests) are stuck, resulting in 100% CPU usage

I will try to give as much information as possible here. Although the solution will be wonderful, I just want to give recommendations on how to solve this problem. How to view more useful log files, etc. Since I am new to server maintenance. Any advice is appreciated.

Here is what happens in chronological order:

  • I launch two drops with the digital ocean (Ubuntu 14.04 VPS)
  • Drop # 1 working on django, nginx, gunicorn
  • Drop # 2 performing postgr
  • Everything works fine for a month, and suddenly a drop of postgres 100% CPU usage
  • You can see the htop log when this happens. I added a screenshot
  • Another screenshot is nginx error.log, you can see this problem started at 15:56:14, where I highlighted a red frame.
  • sudo poweroff drops Postgres and restarting it does not fix the problem
  • Restoring drops of postgres to my last backup (20 hours ago) solves the problem, but it continues to repeat. This is the 7th time in 2 days.

I will continue to do research and give more information. Meanwhile, any opinions are welcome.

Thanks.

Postgres server htop log when a problem occurs nginx error.log when a problem occurs

May 20, 2016 Patch

  • Slow request logging is enabled on the Postgres server, as recommended by e4c5
  • After 6 hours, the server freezes (100% CPU utilization) again at 8:07 in the morning. I have added all the related screenshots.
  • The browser displays error 502 when trying to access the site during blocking
  • sudo service restart postgresql (and gunicorn, nginx on django server) does NOT fix freeze ( I think this is a very interesting point )
  • However restore the Postgres server to my previous backup (now 2 days) does the freeze fix
  • Postgres log error message Failed to send data to client: Broken Pipe
  • Nginx log error message is a simple django-rest framework

    api that return only 20 elements (each with some foreign key data request)

Update No. 2 May 20, 2016 When freezing occurs, I tried to do the following in chronological order (turn off all and turn them one by one)

  • sudo service stop postgresql CPU usage drops to 0-10%
  • sudo service stop gunicorn β†’ cpu usage stays at 0-10%
  • sudo service stop nginx cpu usage stays at 0-10%
  • sudo service restart postgresql CPU usage remains at 0-10%
  • sudo service restart gunicorn β†’ cpu usage stays at 0-10%
  • sudo service restart nginx β†’ Using cpu has grown to 100% and remains to eat

So, is this not about server loading or a long request time?

This is very confusing, because if I restored the database to my last backup (2 days ago), everything returns to the network without even touching the nginx / gunicorn / django server ...


Update June 8, 2016. I turned on slow query logging. Set it for log requests that are longer than 1000 ms.

I have received this request in the journal many times.

 SELECT "products_product"."id", "products_product"."seller_id", "products_product"."priority", "products_product"."media", "products_product"."active", "products_product"."title", "products_product"."slug", "products_product"."description", "products_product"."price", "products_product"."sale_active", "products_product"."sale_price", "products_product"."timestamp", "products_product"."updated", "products_product"."draft", "products_product"."hitcount", "products_product"."finished", "products_product"."is_marang_offline", "products_product"."is_seller_beta_program", COUNT("products_video"."id") AS "num_video" FROM "products_product" LEFT OUTER JOIN "products_video" ON ( "products_product"."id" = "products_video"."product_id" ) WHERE ("products_product"."draft" = false AND "products_product"."finished" = true) GROUP BY "products_product"."id", "products_product"."seller_id", "products_product"."priority", "products_product"."media", "products_product"."active", "products_product"."title", "products_product"."slug", "products_product"."description", "products_product"."price", "products_product"."sale_active", "products_product"."sale_price", "products_product"."timestamp", "products_product"."updated", "products_product"."draft", "products_product"."hitcount", "products_product"."finished", "products_product"."is_marang_offline", "products_product"."is_seller_beta_program" HAVING COUNT("products_video"."id") >= 8 ORDER BY "products_product"."priority" DESC, "products_product"."hitcount" DESC LIMIT 100 

I know such an ugly request (generated by django aggregation). In English, this query simply means "give me a list of products that contain more than 8 videos."

And here is the EXPLAIN output of this request:

  QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=351.90..358.40 rows=100 width=933) -> GroupAggregate (cost=351.90..364.06 rows=187 width=933) Filter: (count(products_video.id) >= 8) -> Sort (cost=351.90..352.37 rows=187 width=933) Sort Key: products_product.priority, products_product.hitcount, products_product.id, products_product.seller_id, products_product.media, products_product.active, products_product.title, products_product.slug, products_product.description, products_product.price, products_product.sale_active, products_product.sale_price, products_product."timestamp", products_product.updated, products_product.draft, products_product.finished, products_product.is_marang_offline, products_product.is_seller_beta_program -> Hash Right Join (cost=88.79..344.84 rows=187 width=933) Hash Cond: (products_video.product_id = products_product.id) -> Seq Scan on products_video (cost=0.00..245.41 rows=2341 width=8) -> Hash (cost=88.26..88.26 rows=42 width=929) -> Seq Scan on products_product (cost=0.00..88.26 rows=42 width=929) Filter: ((NOT draft) AND finished) 

(11 lines)

--- Update June 8, 2016 # 2 --- Since many people offer many offers. Therefore, I will try to apply the corrections one by one and periodically report back.

@ e4c5 Here you need some info:

You can think of my site as an Udemy online course. There is a "Product" (course). Each product contains several videos. Users can comment on both the product page and all videos.

In many cases, I will need to request a list of product orders by the number of TOTAL comments received (the sum of the comments and comments for each video of this product)

A django request that matches the EXPLAIN output above:

 all_products_exclude_draft = Product.objects.all().filter(draft=False) products_that_contain_more_than_8_videos = all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30] 

I just noticed that I (or some other developer on my team) got into the database twice with these two python lines.

Here are the django models for the product and video:

 from django_model_changes import ChangesMixin class Product(ChangesMixin, models.Model): class Meta: ordering = ['-priority', '-hitcount'] seller = models.ForeignKey(SellerAccount) priority = models.PositiveSmallIntegerField(default=1) media = models.ImageField(blank=True, null=True, upload_to=download_media_location, default=settings.MEDIA_ROOT + '/images/default_icon.png', storage=FileSystemStorage(location=settings.MEDIA_ROOT)) active = models.BooleanField(default=True) title = models.CharField(max_length=500) slug = models.SlugField(max_length=200, blank=True, unique=True) description = models.TextField() product_coin_price = models.IntegerField(default=0) sale_active = models.BooleanField(default=False) sale_price = models.IntegerField(default=0, null=True, blank=True) #100.00 timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True) updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True) draft = models.BooleanField(default=True) hitcount = models.IntegerField(default=0) finished = models.BooleanField(default=False) is_marang_offline = models.BooleanField(default=False) is_seller_beta_program = models.BooleanField(default=False) def __unicode__(self): return self.title def get_avg_rating(self): rating_avg = self.productrating_set.aggregate(Avg("rating"), Count("rating")) return rating_avg def get_total_comment_count(self): comment_count = self.video_set.aggregate(Count("comment")) comment_count['comment__count'] += self.comment_set.count() return comment_count def get_total_hitcount(self): amount = self.hitcount for video in self.video_set.all(): amount += video.hitcount return amount def get_absolute_url(self): view_name = "products:detail_slug" return reverse(view_name, kwargs={"slug": self.slug}) def get_product_share_link(self): full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url()) return full_url def get_edit_url(self): view_name = "sellers:product_edit" return reverse(view_name, kwargs={"pk": self.id}) def get_video_list_url(self): view_name = "sellers:video_list" return reverse(view_name, kwargs={"pk": self.id}) def get_product_delete_url(self): view_name = "products:product_delete" return reverse(view_name, kwargs={"pk": self.id}) @property def get_price(self): if self.sale_price and self.sale_active: return self.sale_price return self.product_coin_price @property def video_count(self): videoCount = self.video_set.count() return videoCount class Video(models.Model): seller = models.ForeignKey(SellerAccount) title = models.CharField(max_length=500) slug = models.SlugField(max_length=200, null=True, blank=True) story = models.TextField(default=" ") chapter_number = models.PositiveSmallIntegerField(default=1) active = models.BooleanField(default=True) featured = models.BooleanField(default=False) product = models.ForeignKey(Product, null=True) timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True) updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True) draft = models.BooleanField(default=True) hitcount = models.IntegerField(default=0) objects = VideoManager() class Meta: unique_together = ('slug', 'product') ordering = ['chapter_number', 'timestamp'] def __unicode__(self): return self.title def get_comment_count(self): comment_count = self.comment_set.all_jing_jing().count() return comment_count def get_create_chapter_url(self): return reverse("sellers:video_create", kwargs={"pk": self.id}) def get_edit_url(self): view_name = "sellers:video_update" return reverse(view_name, kwargs={"pk": self.id}) def get_video_delete_url(self): view_name = "products:video_delete" return reverse(view_name, kwargs={"pk": self.id}) def get_absolute_url(self): try: return reverse("products:video_detail", kwargs={"product_slug": self.product.slug, "pk": self.id}) except: return "/" def get_video_share_link(self): full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url()) return full_url def get_next_url(self): current_product = self.product videos = current_product.video_set.all().filter(chapter_number__gt=self.chapter_number) next_vid = None if len(videos) >= 1: try: next_vid = videos[0].get_absolute_url() except IndexError: next_vid = None return next_vid def get_previous_url(self): current_product = self.product videos = current_product.video_set.all().filter(chapter_number__lt=self.chapter_number).reverse() next_vid = None if len(videos) >= 1: try: next_vid = videos[0].get_absolute_url() except IndexError: next_vid = None return next_vid 

And here is the index of the product table and the video that I received from the team:

 my_database_name=# \di 

Note: this is photoshop, as well as some other models. Product Model and Video Indexes


--- Update June 8, 2016 No. 3 --- @Jerzyk As you suspected. After I checked all my code again, I found that I really did a β€œslicing in memory”: I tried to shuffle the first 10 results by doing the following:

 def get_queryset(self): all_product_list = Product.objects.all().filter(draft=False).annotate( num_video=Count( Case( When( video__draft=False, then=1, ) ) ) ).order_by('-priority', '-num_video', '-hitcount') the_first_10_products = list(all_product_list[:10]) the_11th_product_onwards = list(all_product_list[10:]) random.shuffle(copy) finalList = the_first_10_products + the_11th_product_onwards 

Note: in the code above, I need to count the number of videos that are not in draft status.

So this will be one of the things I need to fix. Thank you > _ & L;


--- Below are the corresponding screenshots ---

Postgres freeze log (log_min_duration = 500 milliseconds) Postgres Magazine May 20, 2016

Postgres Magazine (based on the screenshot above) Postgres Magazine May 20, 2016 (Page 2)

Nginx error.log for the same time period Nginx Magazine May 20, 2016

DigitalOcean CPU usage schedule just before freezing DigitalOcean graph May 20, 2016 (1)

DigitalOcean processor usage chart immediately after freezing DigitalOcean Chart May 20, 2016 (2)

+6
source share
2 answers

We can come to the conclusion that your problems are caused by a slow request. By itself, each request run does not look slow enough to cause timeouts. However, it is possible that several of these requests are executed simultaneously, and this can lead to a crisis. There are two things you can do to speed up the process.

1) Download the result

The result of a long request can be cached.

 from django.core.cache import cache def get_8x_videos(): cache_key = 'products_videos_join' result = cache.get(cache_key, None) if not result: all_products_exclude_draft = Product.objects.all().filter(draft=False) result = all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30] result = Product.objects.annotate('YOUR LONG QUERY HERE') cache.set(cache_key, result) return result 

This request now comes from memcache (or what you use for caching), which means that if you have two consecutive hits for a page that uses this in quick succession, the second will not affect the database. You can control how long an object is cached in memory.

2) Optimize the query

The first thing that prompts you from the explanation is that you perform sequential scans in the products_products and product_videos tables. Typically, sequential scans are less desirable than index scans. However, index scanning cannot be used in this query due to the COUNT() and HAVING COUNT() HAVING that you have, as well as the massive GROUP BY clauses on it.

update:

There is a LEFT OUTER JOIN in your request, it is possible that an INNER JOIN or a subquery can be faster, for this we need to recognize that grouping in the Video table on product_id can give us a set of videos that display at least 8 products.

 inner = RawSQL('SELECT id from product_videos GROUP BY product_id HAVING COUNT(product_id) > 1',params=[]) Product.objects.filter(id__in=b) 

The above item is LEFT OUTER JOIN and introduces a subquery. However, this does not provide easy access to the actual number of videos for each product, so this request in this form cannot be fully used.

3) Index Improvement

Although it may be tempting to create an index on draft and finished columns, it will be useless since these columns do not have enough power to be good candidates for indexes. However, it is possible to create a conditional index. Again, the conclusion can be made only after viewing your tables.

+2
source

*** Update June 7, 2016: the problem will recur. The CPU has reached 100% and remains there. This answer really helps in performance, but unfortunately is not a solution to this problem.


Thanks to the recommendation of the DigitalOcean suppport team. I tried the setup suggested by this tool:

http://pgtune.leopard.in.ua/

Which recommend me the following values ​​for my drop with 1 central core and 1 GB of RAM:

in postgresql.conf:

 max_connections = 200 shared_buffers = 256MB effective_cache_size = 768MB work_mem = 1310kB maintenance_work_mem = 64MB checkpoint_segments = 32 checkpoint_completion_target = 0.7 wal_buffers = 7864kB default_statistics_target = 100 

/etc/sysctl.conf

 kernel.shmmax=536870912 kernel.shmall=131072 

So far, my postgres server has been working fine for 3-4 days. Therefore, I guess this is the solution. Thanks everyone!

+1
source

All Articles