I am looking for a way to speed up my code. I managed to speed up most of my code by reducing the runtime to 10 hours, but it's still not fast enough, and since I'm running out of time, I'm looking for a quick way to optimize my code.
Example:
text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000) new_text = [np.array(chunk)[:,2] for chunk in text] new_text = list(itertools.chain.from_iterable(new_text))
In the above code, I read about 6 million lines of text documents in pieces and smoothed them out. This code takes about 3-4 hours. This is the main bottleneck of my program. edit . I realized that I do not quite understand what was the main problem, Smoothing is the part that takes the most time.
Also this part of my program takes a lot of time:
train_dict = dict(izip(text,labels)) result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))]
In the above code, text documents are first fastened with the corresponding labels (this is a machine learning task, and train_dict is a training set). Earlier in the program, I generated forecasts on a test case. There are duplicates between my train and the test suite, so I need to find these duplicates. Thus, I need to iterate over the test set line by line (total 2 million lines), when I find a duplicate, I really do not want to use the predicted label, but the shortcut from the duplicate to train_dict. I assign the result of this iteration to the variable result in the above code.
I heard that there are several libraries on python that can speed up parts of your code, but I don’t know which of them could do this work, and I know correctly that I don’t have time to investigate this, so I need someone to something pointed me in the right direction. Is there any way I could speed up the code snippets above?
edit2
I explored again. And this is definitely a memory problem. I tried to read the file line by line, and after a while the speed dropped sharply, in addition, my bar usage was almost 100%, and disk usage in python increased sharply. How to reduce the amount of memory? Or should I find a way to make sure that I don't store everything in memory?
Edit3 Since memory is the main problem of my problems, I will talk about part of my program. At the moment, I rejected the predictions, which significantly reduced the complexity of my program, instead I insert a standard sample for each non-duplicate in my test suite.
import numpy as np import pandas as pd import itertools import os train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000) train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000) test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000) sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb sample = sample[1] test_set = [np.array(chunk)[:,2] for chunk in test] test_set = list(itertools.chain.from_iterable(test_set)) train_set = [np.array(chunk)[:,2] for chunk in train] train_set = list(itertools.chain.from_iterable(train_set)) labels = [np.array(chunk)[:,3] for chunk in train_2] labels = list(itertools.chain.from_iterable(labels)) """zipping train and labels""" train_dict = dict(izip(train,labels)) """finding duplicates""" results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))]
Although this is not my entire program, it is part of my code that needs to be optimized. As you can see, I use only three important modules in this part, pandas, numpy and itertools. Memory problems arise when aligning train_set and test_set. The only thing I do is read in files, get the necessary parts, fastening the train documents with the corresponding marks in the dictionary. And then do a duplicate search.
edit 4 As requested, I will give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain an identifier for each sample, the second column contains headings, and the third column contains samples of the text body (from 100 to 700 words). The fourth column contains category labels. Test.csv contains only identification and text bodies and names. Columns are separated by commas.