Reading multiple files using stream / multiprocessing

I am currently extracting .txt files from a list of FileNameList paths that works. But my main problem is that too many files are too many.

I use this code to print a list of txt files,

import os import sys #FileNameList is my set of files from my path for filefolder in FileNameList: for file in os.listdir(filefolder): if "txt" in file: filename = filefolder + "\\" + file print filename 

Any help or suggestion to have a thread / multiprocess and make it fast will be accepted. Thanks in advance.

+4
source share
4 answers

Multithreading or multiprocessing will not speed it up; your bottleneck is a storage device.

+3
source

So you mean that there is no way to speed it up ?, because my script is to read a bunch of files, then read each line and store it in the database

The first rule of optimization is to ask yourself whether you should worry. If your program runs only once or optimizes several times, this is a waste of time.

The second rule is that before you do anything else, measure where the problem is;

Write a simple program that sequentially reads files, breaks them into lines, and populates them in a database. Run this program under the profiler to see where the program spends most of its time.

Only then do you know which part of the program needs to be accelerated.


Here are some pointers.

  • You can start reading files using mmap .
  • You can use multiprocessing.Pool to distribute the reading of multiple files across different cores. But then the data from these files will end up in different processes and should be sent back to the parent process using IPC. This has significant overhead for large amounts of data.
  • In a CPython Python implementation, only one thread at a time can execute Python bytecode. Although actual reading from files does not prevent this, the processing of the results. Therefore, it is doubtful if the threads will offer improvements.
  • Filling rows in the database is likely to always be the main bottleneck, because everything comes together here. How much it depends on the database. Whether it is in memory or on disk, does it allow several programs to update it at the same time, etc.
+3
source

You can get acceleration, depending on the number and size of your files. See this answer to a similar question: Efficient reading of files in python with the need to split into "\ n"

Essentially, you can read multiple files in parallel with a multi-threaded, multi-processor, or other way (like an iterator) ... and you can get acceleration. The simplest thing is to use a library like pathos (yes, I’m the author), which provides multiprocessing, multithreading and other parameters in one common API - basically, so you can code it once and then switch between different backends until you have what works fastest for your case.

There are many options for different types of maps (on the pool object), as you can see here: Python Multiprocessing - monitoring the process of pool.map .

While the following examples are not the most figurative examples, it displays a double-nested map (equivalent to a double nested loop) and how easy it is to change internal and other parameters on it.

 >>> import pathos >>> p = pathos.pools.ProcessPool() >>> t = pathos.pools.ThreadPool() >>> s = pathos.pools.SerialPool() >>> >>> f = lambda x,y: x+y >>> # two blocking maps, threads and processes >>> t.map(p.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)]) [[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]] >>> # two blocking maps, threads and serial (ie python map) >>> t.map(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)]) [[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]] >>> # an unordered iterative and a blocking map, threads and serial >>> t.uimap(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)]) <multiprocess.pool.IMapUnorderedIterator object at 0x103dcaf50> >>> list(_) [[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]] >>> 

I found that in general, unordered iterative maps ( uimap ) are the fastest, but then you do not need to worry about which order is processed, since it can fail on return. As for speed ... surround above with a call to time.time or similar.

Get pathos here: https://github.com/uqfoundation

+2
source

In this case, you can try using multithreading. But keep in mind that each non-atomic operation will be performed in a single thread due to the Python GIL (Global Interpreter Lock). If you run multiple machines, it is possible that you are faster. You can use something like a working producer:

  • The producer (one stream) will contain a list of files and a queue
  • An employee (more than one thread) will collect information about files from the queue and push the contents to the database

Look at the queues and pipes in multiprocessing (real split subprocesses) to get around the GIL.

Using these two communication objects, you can create interesting blocking or non-blocking programs.

Side note: keep in mind that not every db connection is thread safe.

0
source

All Articles