TL; DR : I have a table with millions of instances, and I wonder how I can index it.
I have a Django project that uses SQL Server as a database.
After creating a model with approximately 14 million instances in the production environment, I realized that I was having performance issues:
class UserEvent(models.Model) A_EVENT = 'A' B_EVENT = 'B' types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.CharField(max_length=1, choices=types) contract = models.ForeignKey(Contract)
I use a lot of queries based on this field and it is very inefficient as the field is not indexed. Filtering a model using only this field takes almost 7 seconds, and a query using an indexed foreign key does not cause performance problems:
UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
When I realized this, I also asked myself the question: "Is this field not SmallIntegerField? Because I have only a small set of options, and queries based on integer fields are more efficient than text / varchar based queries."
So, from what I understand, I have two options *:
* I understand that there may be a third option, since index fields with low power cannot cause serious improvements , but since my values ββare [1% -99%] (and I'm looking for a 1% part), indexing this field will seems to be a valid option.
A) Just index this field and leave it as CharField.
A_EVENT = 'A' B_EVENT = 'B' types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.CharField(max_length=1, choices=types, db_index=True)
B) Migrate to convert this field to SmallIntegerField (I don't want it to be a BooleanField, because you can add additional parameters to the field), and then index the field.
A_EVENT = 1 B_EVENT = 2 types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.SmallIntegerField(choices=types, db_index=True)
Option A
Pros: Simplicity
Cons: CharField indexes are less efficient than Integer based indexes
Option B
Pros: Integer indices are more efficient than CharField indices
Cons: I need to perform a complicated operation:
- Transitioning a schema to create a new SmallIntegerField
- Data transfer: copying (and converting) millions of instances from the old field to the new field.
- Update the project code to use a new field or perform another scheme migration to rename the new field as the previous one.
- Delete the old field.
To summarize, the real question is here:
Is the performance improvement that I get from moving a field to SmallIntegerField at risk?
I tend to try option A, and check if the performance improvements are consistent.
I also raised this question in StackOverflow because a more general question arose:
- Is there a situation where using Django's choice of CharFields is a better option than using Boolean / Integer / SmallIntegerField?
This situation arose because when defining project models, I was inspired by a piece of Django documentation code :
YEAR_IN_SCHOOL_CHOICES = ( ('FR', 'Freshman'), ('SO', 'Sophomore'), ('JR', 'Junior'), ('SR', 'Senior'), ) year_in_school = models.CharField(max_length=2, choices=YEAR_IN_SCHOOL_CHOICES, default=FRESHMAN)
Why do they use characters when they can use integers, since it's just a representation of a value that should never be displayed?