Django __in search performance for queries

I have a complex database model configured in Django, and I have to do a series of calculations based on filter data. I have a Test object, a TestAttempt object TestAttempt and a UserProfile object (with a foreign key for verification and a foreign key back to the user file). There is a method that I run on TestAttempt that calculates the test score (based on a number of options provided by the user, compared to the correct answers associated with each test). And another method that I run on Test , which calculates the average test score based on each of the associated TestAttempt . But sometimes I only need an average value based on the provided subset of the associated TestAttempt that are associated with a specific set of UserProfiles . Therefore, instead of calculating the average test score for a particular test as follows:

 [x.score() for x in self.test_attempts.all()] 

and then averaging these values. I make a request like this:

 [x.score() for x in self.test_attempts.filter(profile__id__in=user_id_list).all()] 

where user_id_list is a specific subset of the UserProfile identifier for which I want to find the average grade of the test as a list. My question is this: if user_id_list really the entire UserProfile set (so the filter will return the same as self.test_attempts.all() ), and most of the time it will be whether it pays for checking this case, and if it does not filter at all ? or __in is quite effective, even if user_id_list contains all users, it will be more efficient to run the filter. Also, do I need to worry about making the result of test_attempts distinct ()? or can't they include duplicates with the structure of my request?

EDIT: for anyone interested in a raw SQL query, it looks like this without a filter:

 SELECT "mc_grades_testattempt"."id", "mc_grades_testattempt"."date", "mc_grades_testattempt"."test_id", "mc_grades_testattempt"."student_id" FROM "mc_grades_testattempt" WHERE "mc_grades_testattempt"."test_id" = 1 

and this is with a filter:

 SELECT "mc_grades_testattempt"."id", "mc_grades_testattempt"."date", "mc_grades_testattempt"."test_id", "mc_grades_testattempt"."student_id" FROM "mc_grades_testattempt" INNER JOIN "mc_grades_userprofile" ON ("mc_grades_testattempt"."student_id" = "mc_grades_userprofile"."id") WHERE ("mc_grades_testattempt"."test_id" = 1 AND "mc_grades_userprofile"."user_id" IN (1, 2, 3)) 

note that the array (1,2,3) is just an example

+7
source share
3 answers
  • The short answer is the standard. Check it in different situations and measure the load. This will be the best answer.

  • There can be no duplicates.

  • Is it really a problem to check two places? Here's the hypothetical code:

     def average_score(self, user_id_list=None): qset = self.test_attempts.all() if user_id_list is not None: qset = qset.filter(profile__id__in=user_id_list) scores = [x.score() for x in qset] # and compute the average 
  • I don’t know what the score method does, but you can’t calculate the average DB level? This will give you a much more noticeable increase in productivity.

  • And don't forget about caching.

+2
source

From what I understand in the documentation, all requests are created before they are actually used. So, for example, test_attempts.all() generates SQL code once, and when you execute the query, you actually get the data by doing something like .count() , for t in test_attempts.all(): etc. It runs a query in the database and returns a Queryset object or just an object if you used get (). With this in mind, the number of calls in the database will be exactly the same, while the actual call will be different. As you can see in your edited post, the raw requests are different, but both are generated in the same way before the data is available to Django. From the point of view of Django, they will be created the same way and then executed in the database. In my opinion, it would be better not to test the all () situation, since you would need to run TWO queries to determine this. I believe that you should work with the code that you have and skip the check for the all () script, which you describe as the most common case. Most modern database engines run queries in such a way that added joins do not interfere with performance metrics, since they process queries in optimal sequences. Anyway,

+2
source

Use Annotation instead of iterating over queries, creating a new database hit for each item in user_id_list and doing the average in python.

 ms = MyModel.objects.annotate(Avg('some_field')) ms[0].avg__some_field # prints the average for that instance 

Will return a query set with the average value available as an attribute for objects in the query set. Using ORM may require structural changes to the relationship with the foreign key and which model contains the data to make the annotation convenient. This reordering, if necessary, will have beneficial side effects (data likes to live in a certain way), so this is a good exercise.

+2
source

All Articles