Django Model Selection: IntegerField vs CharField

TL; DR : I have a table with millions of instances, and I wonder how I can index it.

I have a Django project that uses SQL Server as a database.

After creating a model with approximately 14 million instances in the production environment, I realized that I was having performance issues:

class UserEvent(models.Model) A_EVENT = 'A' B_EVENT = 'B' types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.CharField(max_length=1, choices=types) contract = models.ForeignKey(Contract) # field_x = (...) # field_y = (...) 

I use a lot of queries based on this field and it is very inefficient as the field is not indexed. Filtering a model using only this field takes almost 7 seconds, and a query using an indexed foreign key does not cause performance problems:

 UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count() # elapsed time: 0:00:06.921287 UserEvent.objects.filter(contract_id=62).count() # elapsed time: 0:00:00.344261 

When I realized this, I also asked myself the question: "Is this field not SmallIntegerField? Because I have only a small set of options, and queries based on integer fields are more efficient than text / varchar based queries."

So, from what I understand, I have two options *:

* I understand that there may be a third option, since index fields with low power cannot cause serious improvements , but since my values ​​are [1% -99%] (and I'm looking for a 1% part), indexing this field will seems to be a valid option.

  • A) Just index this field and leave it as CharField.

     A_EVENT = 'A' B_EVENT = 'B' types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.CharField(max_length=1, choices=types, db_index=True) 
  • B) Migrate to convert this field to SmallIntegerField (I don't want it to be a BooleanField, because you can add additional parameters to the field), and then index the field.

     A_EVENT = 1 B_EVENT = 2 types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.SmallIntegerField(choices=types, db_index=True) 

Option A

Pros: Simplicity

Cons: CharField indexes are less efficient than Integer based indexes

Option B

Pros: Integer indices are more efficient than CharField indices

Cons: I need to perform a complicated operation:

  • Transitioning a schema to create a new SmallIntegerField
  • Data transfer: copying (and converting) millions of instances from the old field to the new field.
  • Update the project code to use a new field or perform another scheme migration to rename the new field as the previous one.
  • Delete the old field.

To summarize, the real question is here:

Is the performance improvement that I get from moving a field to SmallIntegerField at risk?

I tend to try option A, and check if the performance improvements are consistent.


I also raised this question in StackOverflow because a more general question arose:

  • Is there a situation where using Django's choice of CharFields is a better option than using Boolean / Integer / SmallIntegerField?

This situation arose because when defining project models, I was inspired by a piece of Django documentation code :

 YEAR_IN_SCHOOL_CHOICES = ( ('FR', 'Freshman'), ('SO', 'Sophomore'), ('JR', 'Junior'), ('SR', 'Senior'), ) year_in_school = models.CharField(max_length=2, choices=YEAR_IN_SCHOOL_CHOICES, default=FRESHMAN) 

Why do they use characters when they can use integers, since it's just a representation of a value that should never be displayed?

+7
django sql-server indexing django-models
source share
1 answer

Speed ​​Requests Count.

 UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count() # elapsed time: 0:00:06.921287 

Queries of this kind, unfortunately, will always be slow in databases if the table has a large number of records.

Mysql optimizes query counting by looking at the index , provided the indexed columns are numeric . So this is a good reason to use SmallIntegeField instead of Charfield if you were on mysql, but apparently it is not. Your mileage depends on other databases. I am not an expert on SQL Server, but I understand that it works especially poorly with indexes on COUNT (*) queries.

Markup

You may be able to improve the overall performance of event_type-related queries by splitting the data. Since the power of the current index is bad, it is often better for the scheduler to perform a full table scan. If the data has been shared, only this particular section will need to be scanned.

Char or Smallint

Which takes up more space for char (2) or small int? The answer is that it depends on your character set. If a character set requires only one byte per one small integer number of characters, and char (2) will occupy the same space. Since the field will have a very low power, using char or smallint will not have significant differences in this case.

+1
source share

All Articles