Efficient file reading in python with the need to split into '\ n'

I traditionally read files:

file = open(fullpath, "r") allrecords = file.read() delimited = allrecords.split('\n') for record in delimited[1:]: record_split = record.split(',') 

and

 with open(os.path.join(txtdatapath,pathfilename), "r") as data: datalines = (line.rstrip('\r\n') for line in data) for record in datalines: split_line = record.split(',') if len(split_line) > 1: 

But it seems that when I process these files in a multiprocessor thread, I get a MemoryError. What is the best way to read files line by line when the text file I am reading needs to be split into '\n' .

Here is the multiprocessor code:

 pool = Pool() fixed_args = (targetdirectorytxt, value_dict) varg = ((filename,) + fixed_args for filename in readinfiles) op_list = pool.map_async(PPD_star, list(varg), chunksize=1) while not op_list.ready(): print("Number of files left to process: {}".format(op_list._number_left)) time.sleep(60) op_list = op_list.get() pool.close() pool.join() 

Here is the error log

 Exception in thread Thread-3: Traceback (most recent call last): File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner self.run() File "C:\Python27\lib\threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results task = get() MemoryError 

I'm trying to establish pathos, as Mike kindly suggested, but I have problems. Here is my installation command:

 pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre 

But here are the error messages I get:

 Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip-1e4saj-b uild\setup.py) egg_info for package from https://github.com/uqfoundation/pathos/ zipball/master Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0) Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip_build_jp tyuser\ppft\setup.py) egg_info for package ppft warning: no files found matching 'python-restlib.spec' Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:\pyth on27\lib\site-packages\dill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0) Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:\pytho n27\lib\site-packages\pox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0) Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) Some externally hosted files were ignored (use --allow-external pyre to allow) . Cleaning up... No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) Storing debug log for failure in C:\Users\xxx\pip\pip.log 

I install on Windows 7 64 bit. In the end, I was able to install using easy_install.

But now I have a failure because I cannot open many files:

 Finished reading in Exposures... Reading Samples from: C:\XXX\XXX\XXX\ Traceback (most recent call last): File "events.py", line 568, in <module> mdrcv_dict = ReadDamages(damage_dir, value_dict) File "events.py", line 185, in ReadDamages res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files) File "C:\Python27\lib\site-packages\pathos-0.2a1.dev0-py2.7.egg\pathos\multipr ocessing.py", line 230, in amap return _pool.map_async(star(f), zip(*args)) # chunksize File "events.py", line 184, in <genexpr> files = (open(name, 'r') for name in readinfiles[0:]) IOError: [Errno 24] Too many open files: 'C:\\xx.csv' 

Currently, using the multiprocessing library, I pass parameters and dictionaries to my function and open the associated file and then output the dictionary. Here is an example of how I am doing this at present, how would a sensible way to do this with pathos?

 def PP_star(args_flat): return PP(*args_flat) def PP(pathfilename, txtdatapath, my_dict): return com_dict fixed_args = (targetdirectorytxt, my_dict) varg = ((filename,) + fixed_args for filename in readinfiles) op_list = pool.map_async(PP_star, list(varg), chunksize=1) 

How to execute the same function with pathos.multiprocessing

0
source share
3 answers

just iterate over the lines instead of reading the whole file. like this

 with open(os.path.join(txtdatapath,pathfilename), "r") as data: for dataline in data: split_line = record.split(',') if len(split_line) > 1: 
+1
source

Let's say we have a file1.txt:

 hello35 1234123 1234123 hello32 2492wow 1234125 1251234 1234123 1234123 2342bye 1234125 1251234 1234123 1234123 1234125 1251234 1234123 

file2.txt:

 1234125 1251234 1234123 hello35 2492wow 1234125 1251234 1234123 1234123 hello32 1234125 1251234 1234123 1234123 1234123 1234123 2342bye 

etc., through the file.txt file:

 1234123 1234123 1234125 1251234 1234123 1234123 1234123 1234125 1251234 1234125 1251234 1234123 1234123 hello35 hello32 2492wow 2342bye 

I would suggest using a hierarchical map parallel to quickly read your files. The multiprocessing fork (called pathos.multiprocessing ) can do this.

 >>> import pathos >>> thpool = pathos.multiprocessing.ThreadingPool() >>> mppool = pathos.multiprocessing.ProcessingPool() >>> >>> def rstrip(line): ... return line.rstrip() ... # get your list of files >>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt'] >>> # open the files >>> files = (open(name, 'r') for name in fnames) >>> # read each file in asynchronous parallel >>> # while reading and stripping each line in parallel >>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files) >>> # get the result when it done >>> res.ready() True >>> data = res.get() >>> # if not using a files iterator -- close each file by uncommenting the next line >>> # files = [file.close() for file in files] >>> data[0] ['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123'] >>> data[1] ['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye'] >>> data[-1] ['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye'] 

However, if you want to check how many remaining files are left to complete, you can use the "repeated" card ( imap ) instead of the "asynchronous" card ( amap ). See this post for more details: Python Multiprocessing - tracking pool.map process

Get pathos here: https://github.com/uqfoundation

+1
source

Try the following:

 for line in file('file.txt'): print line.rstrip() 

Of course, instead of printing them, you can also add them to the list or perform some other operations on them.

0
source

All Articles