Looking for a quick way to speed up my code

I am looking for a way to speed up my code. I managed to speed up most of my code by reducing the runtime to 10 hours, but it's still not fast enough, and since I'm running out of time, I'm looking for a quick way to optimize my code.

Example:

text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000) new_text = [np.array(chunk)[:,2] for chunk in text] new_text = list(itertools.chain.from_iterable(new_text)) 

In the above code, I read about 6 million lines of text documents in pieces and smoothed them out. This code takes about 3-4 hours. This is the main bottleneck of my program. edit . I realized that I do not quite understand what was the main problem, Smoothing is the part that takes the most time.

Also this part of my program takes a lot of time:

  train_dict = dict(izip(text,labels)) result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))] 

In the above code, text documents are first fastened with the corresponding labels (this is a machine learning task, and train_dict is a training set). Earlier in the program, I generated forecasts on a test case. There are duplicates between my train and the test suite, so I need to find these duplicates. Thus, I need to iterate over the test set line by line (total 2 million lines), when I find a duplicate, I really do not want to use the predicted label, but the shortcut from the duplicate to train_dict. I assign the result of this iteration to the variable result in the above code.

I heard that there are several libraries on python that can speed up parts of your code, but I don’t know which of them could do this work, and I know correctly that I don’t have time to investigate this, so I need someone to something pointed me in the right direction. Is there any way I could speed up the code snippets above?

edit2

I explored again. And this is definitely a memory problem. I tried to read the file line by line, and after a while the speed dropped sharply, in addition, my bar usage was almost 100%, and disk usage in python increased sharply. How to reduce the amount of memory? Or should I find a way to make sure that I don't store everything in memory?

Edit3 Since memory is the main problem of my problems, I will talk about part of my program. At the moment, I rejected the predictions, which significantly reduced the complexity of my program, instead I insert a standard sample for each non-duplicate in my test suite.

 import numpy as np import pandas as pd import itertools import os train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000) train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000) test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000) sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb sample = sample[1] test_set = [np.array(chunk)[:,2] for chunk in test] test_set = list(itertools.chain.from_iterable(test_set)) train_set = [np.array(chunk)[:,2] for chunk in train] train_set = list(itertools.chain.from_iterable(train_set)) labels = [np.array(chunk)[:,3] for chunk in train_2] labels = list(itertools.chain.from_iterable(labels)) """zipping train and labels""" train_dict = dict(izip(train,labels)) """finding duplicates""" results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))] 

Although this is not my entire program, it is part of my code that needs to be optimized. As you can see, I use only three important modules in this part, pandas, numpy and itertools. Memory problems arise when aligning train_set and test_set. The only thing I do is read in files, get the necessary parts, fastening the train documents with the corresponding marks in the dictionary. And then do a duplicate search.

edit 4 As requested, I will give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain an identifier for each sample, the second column contains headings, and the third column contains samples of the text body (from 100 to 700 words). The fourth column contains category labels. Test.csv contains only identification and text bodies and names. Columns are separated by commas.

+6
source share
3 answers

Could you place a dummy sample of data from half a dozen rows or so?

I can’t understand what your code is doing, and I am not a Pandas expert, but I think we can speed this code up significantly. It reads all the data into memory, and then saves re-copying the data to different places.

By writing "lazy" code, we can avoid all duplicate copies. It would be ideal to read one line, convert it as we want, and save it in the final place. This code also uses indexing when it should just iterate over the values; we can also pick up some speed.

Is the code by which you sent your actual code, or something you did to post here? It seems like it contains some errors, so I'm not sure what it actually does. In particular, train and labels will contain identical data.

I'll be back and see if you have posted sample data. If so, I can probably write "lazy" code for you, which will have less re-copying of data and will be faster.

EDIT: based on your new info, here is my dummy data:

 id,title,body,category_labels 0,greeting,hello,noun 1,affirm,yes,verb 2,deny,no,verb 

Here is the code that reads above:

 def get_train_data(training_file): with open(training_file, "rt") as f: next(f) # throw away "headers" in first line for line in f: lst = line.rstrip('\n').split(',') # lst contains: id,title,body,category_labels yield (lst[1],lst[2]) train_dict = dict(get_train_data("data.csv")) 

And here is a faster way to build the results :

 results = [train_dict.get(x, sample) for x in test] 

Instead of re-indexing test to find the next element, we simply iterate over the values ​​in the test. The dict.get() method processes the if x in train_dict that we need.

+1
source

You can try Cython . It supports numpy and can give you nice acceleration. Here's an introduction and explanation of what you need to do http://www.youtube.com/watch?v=Iw9-GckD-gQ

+1
source

If the order of your lines is not important, you can use sets to find elements that are in the train set, but not in the test set (intersection trace and test set), and add them first to your “result”, and then use the specified difference (testet-trainset) to add items that are in your test set but not in the train set. This will save on checking whether the sample is on the trains.

0
source

All Articles