Efficient file reading in python with the need to split into '\ n'

Question

Efficient file reading in python with the need to split into '\ n'

I traditionally read files:

file = open(fullpath, "r") allrecords = file.read() delimited = allrecords.split('\n') for record in delimited[1:]: record_split = record.split(',')

and

 with open(os.path.join(txtdatapath,pathfilename), "r") as data: datalines = (line.rstrip('\r\n') for line in data) for record in datalines: split_line = record.split(',') if len(split_line) > 1:

But it seems that when I process these files in a multiprocessor thread, I get a MemoryError. What is the best way to read files line by line when the text file I am reading needs to be split into '\n' .

Here is the multiprocessor code:

 pool = Pool() fixed_args = (targetdirectorytxt, value_dict) varg = ((filename,) + fixed_args for filename in readinfiles) op_list = pool.map_async(PPD_star, list(varg), chunksize=1) while not op_list.ready(): print("Number of files left to process: {}".format(op_list._number_left)) time.sleep(60) op_list = op_list.get() pool.close() pool.join()

Here is the error log

 Exception in thread Thread-3: Traceback (most recent call last): File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner self.run() File "C:\Python27\lib\threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results task = get() MemoryError

I'm trying to establish pathos, as Mike kindly suggested, but I have problems. Here is my installation command:

 pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre

But here are the error messages I get:

 Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip-1e4saj-b uild\setup.py) egg_info for package from https://github.com/uqfoundation/pathos/ zipball/master Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0) Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip_build_jp tyuser\ppft\setup.py) egg_info for package ppft warning: no files found matching 'python-restlib.spec' Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:\pyth on27\lib\site-packages\dill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0) Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:\pytho n27\lib\site-packages\pox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0) Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) Some externally hosted files were ignored (use --allow-external pyre to allow) . Cleaning up... No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) Storing debug log for failure in C:\Users\xxx\pip\pip.log

I install on Windows 7 64 bit. In the end, I was able to install using easy_install.

But now I have a failure because I cannot open many files:

 Finished reading in Exposures... Reading Samples from: C:\XXX\XXX\XXX\ Traceback (most recent call last): File "events.py", line 568, in <module> mdrcv_dict = ReadDamages(damage_dir, value_dict) File "events.py", line 185, in ReadDamages res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files) File "C:\Python27\lib\site-packages\pathos-0.2a1.dev0-py2.7.egg\pathos\multipr ocessing.py", line 230, in amap return _pool.map_async(star(f), zip(*args)) # chunksize File "events.py", line 184, in <genexpr> files = (open(name, 'r') for name in readinfiles[0:]) IOError: [Errno 24] Too many open files: 'C:\\xx.csv'

Currently, using the multiprocessing library, I pass parameters and dictionaries to my function and open the associated file and then output the dictionary. Here is an example of how I am doing this at present, how would a sensible way to do this with pathos?

 def PP_star(args_flat): return PP(*args_flat) def PP(pathfilename, txtdatapath, my_dict): return com_dict fixed_args = (targetdirectorytxt, my_dict) varg = ((filename,) + fixed_args for filename in readinfiles) op_list = pool.map_async(PP_star, list(varg), chunksize=1)

How to execute the same function with pathos.multiprocessing

0

python multiprocessing

disruptive Feb 19 '15 at 16:21

source share

3 answers

gefei · Answer 1 · 2015-02-19T16:24:19+0000

just iterate over the lines instead of reading the whole file. like this

 with open(os.path.join(txtdatapath,pathfilename), "r") as data: for dataline in data: split_line = record.split(',') if len(split_line) > 1:

Mike mckerns · Answer 2 · 2015-02-19T17:35:44+0000

Let's say we have a file1.txt:

 hello35 1234123 1234123 hello32 2492wow 1234125 1251234 1234123 1234123 2342bye 1234125 1251234 1234123 1234123 1234125 1251234 1234123

file2.txt:

 1234125 1251234 1234123 hello35 2492wow 1234125 1251234 1234123 1234123 hello32 1234125 1251234 1234123 1234123 1234123 1234123 2342bye

etc., through the file.txt file:

 1234123 1234123 1234125 1251234 1234123 1234123 1234123 1234125 1251234 1234125 1251234 1234123 1234123 hello35 hello32 2492wow 2342bye

I would suggest using a hierarchical map parallel to quickly read your files. The multiprocessing fork (called pathos.multiprocessing ) can do this.

 >>> import pathos >>> thpool = pathos.multiprocessing.ThreadingPool() >>> mppool = pathos.multiprocessing.ProcessingPool() >>> >>> def rstrip(line): ... return line.rstrip() ... # get your list of files >>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt'] >>> # open the files >>> files = (open(name, 'r') for name in fnames) >>> # read each file in asynchronous parallel >>> # while reading and stripping each line in parallel >>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files) >>> # get the result when it done >>> res.ready() True >>> data = res.get() >>> # if not using a files iterator -- close each file by uncommenting the next line >>> # files = [file.close() for file in files] >>> data[0] ['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123'] >>> data[1] ['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye'] >>> data[-1] ['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']

However, if you want to check how many remaining files are left to complete, you can use the "repeated" card ( imap ) instead of the "asynchronous" card ( amap ). See this post for more details: Python Multiprocessing - tracking pool.map process

Get pathos here: https://github.com/uqfoundation

Dominik schmidt · Answer 3 · 2015-02-19T16:32:59+0000

Try the following:

 for line in file('file.txt'): print line.rstrip()

Of course, instead of printing them, you can also add them to the list or perform some other operations on them.

Efficient file reading in python with the need to split into '\ n'

More articles: