Loop optimization
If you need to get some speed:
- make sure you really need acceleration (otherwise you spend too much time on a useless task).
- start with loops
- check if some loops can be prevented
- optimize / delete instructions inside the loop
- each team counts
- counting each link
Here is my optimized code project (not tested):
from collections import defaultdict unique = defaultdict(set) means = {key: 0.0 for key in numerical_keys} sds = {key: 0.0 for key in numerical_keys} with open('input/train.csv', 'r') as read_file: reader = csv.DictReader(read_file, delimiter=',', quotechar='|') for i, row in enumerate(reader): for key in categorical_keys: unique[key].add(row[key]) for key in numerical_keys: try:
Collection of unique values
You are using a dict with a list of unique values:
unique = {key: [] for key in categorical_keys}
and add unique values to it as an element of the list (happens in a loop):
if key in categorical_keys: if row[key] not in unique[key]: unique[key].extend([value])
You can confidently check some test (if this value exists in the list), if you directly add the value to set - the set will be taken care of, only unique values will be collected there.
Using defaultdict , you will make sure that you already have an empty set if you use some key that has not been used yet.
Do not test record types in each cycle, know them in advance
Your code repeatedly iterates over the record keys and tests them for a type, then does something:
if key in categorical_keys: if row[key] not in unique[key]: unique[key].extend([value]) elif key in numerical_keys: if value: means[key] = (means[key]*i + float(value))/(i+1) if i > 1: sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i
You can protect these tests if your categorical_keys and numerical_keys set to correctly existing values. Then you can directly iterate over the known key names:
for key in categorical_keys: unique[key].add(row[value]) for key in numerical_keys: try: # shall throw ValueError if None or empty string value=float(row[value]) means[key] = (means[key]*i + value)/(i+1) if i > 1: sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i except ValueError: pass
Resuse after the calculated value
Your code recalculates the value:
float(value)
Do it once and reuse.
Also, the mean[key] value is calculated and set to means[key] , and two lines later, using the value again. It is better to store the value in a local variable and use it twice. Any search (for example, means[key] ) is worth something.
Failure exception is mostly faster than value test
Your code checks that the value is not empty:
elif key in numerical_keys: if value:
You can replace it with code that works directly with the value. If the value is incorrect, it will fail and a ValueError exception will be caught and ignored. If you have most of the values set, this will speed it up.
try: value=float(value) means[key] = (means[key]*i + value)/(i+1) if i > 1: sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i except ValueError: pass
Can you prevent the test if i > 1: :?
This condition is true in most cases, but you check it in every loop. If you find a way (I did not) to prevent this test, you will get it faster.
As you suggested, we can resolve it by catching ZeroDivisionError for i <= 1:
try: # shall throw ValueError if None or empty string value=float(value) means[key] = (means[key]*i + value)/(i+1) # for i <= 1 shall raise ZeroDivisionError sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i except (ValueError, ZeroDivisionError): pass
Data processing in pieces
Regarding processing in pieces:
- it definitely adds some complexity
- he can speed up the program if done carefully
- it may slow down or provide incorrect results.
If chunking can improve speed
Reading files in large fragments
This sounds obvious, but the libraries have already taken care of it. Expect slight or minor improvement.
Getting CSV Records in Chunks
I am not aware of the csv.reader or csv.DictReader , which allows you to get a piece of records directly. You have to do it yourself. It is possible, I recommend using itertools.groupby .
Do not expect acceleration from it yourself (this will slow it down a bit), but this is a prerequisite for other chunk-based accelerations later.
Adding a Value Slice to a Set
The code (currently) adds values to the set one at a time. If you do this with a chunk (the bigger the better), it will be faster since every python call has some small overhead.
Calculation of averages and sds values
You can use the statistics package, which is likely to have optimized code (but it seems to be in pure python anyway).
In any case, since you are going to process the data in chunks, simply statistics.mean will not work for you, or you will have to get the results somehow together (if possible).
If you calculate the value yourself, with careful coding, you can get some acceleration, based mainly on the fact, you get the values directly in one piece and will not have the value dereferenced by the value in each cycle.
Conclusions on the conclusion
For me, optimizing chunking seems too complicated, and it's hard to predict if it brings any value.