Python processing speed per VS string in a piece

I am trying to do very simple calculations in a huge file, for example, counting the number of labels for some columns or the mean and standard deviation for other columns. The file is too large to fit in memory, and I'm currently processing it on every line:

unique = {key: [] for key in categorical_keys} means = {key: 0.0 for key in numerical_keys} sds = {key: 0.0 for key in numerical_keys} with open('input/train.csv', 'r') as read_file: reader = csv.DictReader(read_file, delimiter=',', quotechar='|') for i, row in enumerate(reader): for key, value in row.iteritems(): if key in categorical_keys: if row[key] not in unique[key]: unique[key].extend([value]) elif key in numerical_keys: if value: means[key] = (means[key]*i + float(value))/(i+1) if i > 1: sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i 

Now this seems too slow, and I wonder if it will be faster to process it in a piece that can fit in memory. Would it be faster? If so, why?

Thank you for your help.

+5
source share
2 answers

Loop optimization

If you need to get some speed:

  • make sure you really need acceleration (otherwise you spend too much time on a useless task).
  • start with loops
    • check if some loops can be prevented
    • optimize / delete instructions inside the loop
      • each team counts
      • counting each link

Here is my optimized code project (not tested):

 from collections import defaultdict unique = defaultdict(set) means = {key: 0.0 for key in numerical_keys} sds = {key: 0.0 for key in numerical_keys} with open('input/train.csv', 'r') as read_file: reader = csv.DictReader(read_file, delimiter=',', quotechar='|') for i, row in enumerate(reader): for key in categorical_keys: unique[key].add(row[key]) for key in numerical_keys: try: # shall throw ValueError if None or empty string value=float(row[key]) mean_val = (means[key]*i + value)/(i+1) means[key] = mean_val # following fails for i < = 1 with ZeroDivisionError sds[key] = (sds[key]*(i-1) + (value-mead_val)**2)/i except (ValueError, ZeroDivisionError): pass 

Collection of unique values

You are using a dict with a list of unique values:

 unique = {key: [] for key in categorical_keys} 

and add unique values ​​to it as an element of the list (happens in a loop):

 if key in categorical_keys: if row[key] not in unique[key]: unique[key].extend([value]) 

You can confidently check some test (if this value exists in the list), if you directly add the value to set - the set will be taken care of, only unique values ​​will be collected there.

Using defaultdict , you will make sure that you already have an empty set if you use some key that has not been used yet.

Do not test record types in each cycle, know them in advance

Your code repeatedly iterates over the record keys and tests them for a type, then does something:

  if key in categorical_keys: if row[key] not in unique[key]: unique[key].extend([value]) elif key in numerical_keys: if value: means[key] = (means[key]*i + float(value))/(i+1) if i > 1: sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i 

You can protect these tests if your categorical_keys and numerical_keys set to correctly existing values. Then you can directly iterate over the known key names:

  for key in categorical_keys: unique[key].add(row[value]) for key in numerical_keys: try: # shall throw ValueError if None or empty string value=float(row[value]) means[key] = (means[key]*i + value)/(i+1) if i > 1: sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i except ValueError: pass 

Resuse after the calculated value

Your code recalculates the value:

 float(value) 

Do it once and reuse.

Also, the mean[key] value is calculated and set to means[key] , and two lines later, using the value again. It is better to store the value in a local variable and use it twice. Any search (for example, means[key] ) is worth something.

Failure exception is mostly faster than value test

Your code checks that the value is not empty:

  elif key in numerical_keys: if value: # something here 

You can replace it with code that works directly with the value. If the value is incorrect, it will fail and a ValueError exception will be caught and ignored. If you have most of the values ​​set, this will speed it up.

  try: value=float(value) means[key] = (means[key]*i + value)/(i+1) if i > 1: sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i except ValueError: pass 

Can you prevent the test if i > 1: :?

This condition is true in most cases, but you check it in every loop. If you find a way (I did not) to prevent this test, you will get it faster.

As you suggested, we can resolve it by catching ZeroDivisionError for i <= 1:

  try: # shall throw ValueError if None or empty string value=float(value) means[key] = (means[key]*i + value)/(i+1) # for i <= 1 shall raise ZeroDivisionError sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i except (ValueError, ZeroDivisionError): pass 

Data processing in pieces

Regarding processing in pieces:

  • it definitely adds some complexity
  • he can speed up the program if done carefully
  • it may slow down or provide incorrect results.

If chunking can improve speed

Reading files in large fragments

This sounds obvious, but the libraries have already taken care of it. Expect slight or minor improvement.

Getting CSV Records in Chunks

I am not aware of the csv.reader or csv.DictReader , which allows you to get a piece of records directly. You have to do it yourself. It is possible, I recommend using itertools.groupby .

Do not expect acceleration from it yourself (this will slow it down a bit), but this is a prerequisite for other chunk-based accelerations later.

Adding a Value Slice to a Set

The code (currently) adds values ​​to the set one at a time. If you do this with a chunk (the bigger the better), it will be faster since every python call has some small overhead.

Calculation of averages and sds values

You can use the statistics package, which is likely to have optimized code (but it seems to be in pure python anyway).

In any case, since you are going to process the data in chunks, simply statistics.mean will not work for you, or you will have to get the results somehow together (if possible).

If you calculate the value yourself, with careful coding, you can get some acceleration, based mainly on the fact, you get the values ​​directly in one piece and will not have the value dereferenced by the value in each cycle.

Conclusions on the conclusion

For me, optimizing chunking seems too complicated, and it's hard to predict if it brings any value.

+2
source

Yang answer is great, but if you want to increase the speed even more, you can do your research as follows:

Phased execution profile: if you do not know what is slow, you cannot improve its further development ( https://github.com/rkern/line_profiler )

Since most of your code is numerical operations, you can spend a lot of time on type checking overhead. There are two ways: either decorate the code with types, or use Cython , or leave the code as it is and use Pypy - both should give you a good boost.

+1
source

All Articles