Python - using threads or queues to iterate over a for loop that calls a function

I am new to python and am creating a script that allows you to transfer cloud cloud data from other programs to Autodesk Maya. I have a script function, but I'm trying to do it faster. I have a for loop that iterates through a list of numbered files. That is, datafile001.txt, datafile002.txt and so on. I am wondering if there is a way to get it to do more than one at a time, perhaps using threads or a queue? Below I have the code I was working on:

def threadedFuntion(args): if len(sourceFiles) > 3: for count, item in enumerate(sourceFiles): t1=Thread(target=convertPcToPdc,args=(sourceFiles[filenumber1], particlesName, startframe, endframe, pdcIncrements, outputDirectory, variableFolder, acceptableArrayforms, dataType)) t1.start() t2=Thread(target=convertPcToPdc,args=(sourceFiles[filenumber2], particlesName, startframe, endframe, pdcIncrements, outputDirectory, variableFolder, acceptableArrayforms, dataType)) t2.start() t3=Thread(target=convertPcToPdc,args=(sourceFiles[filenumber3], particlesName, startframe, endframe, pdcIncrements, outputDirectory, variableFolder, acceptableArrayforms, dataType)) t3.start() t4=Thread(target=convertPcToPdc,args=(sourceFiles[filenumber4], particlesName, startframe, endframe, pdcIncrements, outputDirectory, variableFolder, acceptableArrayforms, dataType)) t4.start() 

This obviously does not work for a number of reasons, at first it will create only 4 threads, I would like you to have an option for more or less. Secondly, are these errors because it is trying to reuse the stream? As I said, I'm pretty new to python and a little in front of me, I read a few posts here, but can't get them to work correctly. I think that the queue may be something that I need, but I could not understand it, I experimented with the condition expression and the join operation, but again I could not get what I want.

I assume that more specifically what I want to achieve is that the function reads a text file, retrieves the coordinates, and then exports them as a binary for reading Maya. Usually, one of these text files requires 5-10 million x, y, z coordinates, which take quite a lot of time. The task manager takes about 30 minutes - 1 hour to make 1 file on a rather creepy computer, the task manager says that python uses only 12% of the processor and about 1% of the bar, so if I could do several of them at once, it did the execution of these 100 or more files is much faster. I would not have thought it would be difficult for multithreading / queuing in a for loop, but I got lost and tried fail-safe solutions for a week.

Thanks to everyone for any help, I really appreciate this and I think this site is awesome. This is my first post, but I feel like I learned python completely by simply reading it here.

+3
source share
2 answers

Subclasses threading.Thread and put your working function in this class as part of run ().

 import threading import time import random class Worker(threading.Thread): def __init__(self, srcfile, printlock,**kwargs): super(Worker,self).__init__(**kwargs) self.srcfile = srcfile self.lock = printlock # so threads don't step on each other prints def run(self): with self.lock: print("starting %s on %s" % (self.ident,self.srcfile)) # do whatever you need to, return when done # example, sleep for a random interval up to 10 seconds time.sleep(random.random()*10) with self.lock: print("%s done" % self.ident) def threadme(srcfiles): printlock = threading.Lock() threadpool = [] for file in srcfiles: threadpool.append(Worker(file,printlock)) for thr in threadpool: thr.start() # this loop will block until all threads are done # (however it won't necessarily first join those that are done first) for thr in threadpool: thr.join() print("all threads are done") if __name__ == "__main__": threadme(["abc","def","ghi"]) 

As requested, to limit the number of threads, use the following:

 def threadme(infiles,threadlimit=None,timeout=0.01): assert threadlimit is None or threadlimit > 0, \ "need at least one thread"; printlock = threading.Lock() srcfiles = list(infiles) threadpool = [] # keep going while work to do or being done while srcfiles or threadpool: # while there room, remove source files # and add to the pool while srcfiles and \ (threadlimit is None \ or len(threadpool) < threadlimit): file = srcfiles.pop() wrkr = Worker(file,printlock) wrkr.start() threadpool.append(wrkr) # remove completed threads from the pool for thr in threadpool: thr.join(timeout=timeout) if not thr.is_alive(): threadpool.remove(thr) print("all threads are done") if __name__ == "__main__": for lim in (1,2,3,4): print("--- Running with thread limit %i ---" % lim) threadme(("abc","def","ghi"),threadlimit=lim) 

Note that this will actually process the sources in reverse order (due to the pop () list). If you want them to be executed in order, cancel the list somewhere or use deque and popleft ().

+1
source

I would recommend using mrjob for this.

Mr Job is an implementation of the python map reduce .

The following is the mr job code for multi-point word counting across multiple text files:

 from mrjob.job import MRJob class MRWordCounter(MRJob): def get_words(self, key, line): for word in line.split(): yield word, 1 def sum_words(self, word, occurrences): yield word, sum(occurrences) def steps(self): return [self.mr(self.get_words, self.sum_words),] if __name__ == '__main__': MRWordCounter.run() 

This code displays all the files in parallel (counts the words for each file), and then reduces the number of samples to one total number of words.

0
source

All Articles