Preferred block size when reading / writing blobs

I need to read and write huge binaries. is there a preferred or even optimal number of bytes (what i call BLOCK_SIZE ) should i read() at a time?

1 byte, of course, is not enough; and I don’t think reading 4 GB in RAM is a good idea - is there a β€œbetter” block size? or does it even depend on the file system (I'm on ext4)? what do i need to consider?

python open() even provides a buffering argument. would I need to configure it too?

This is an example of code that simply connects the two files in-0.data and in-1.data in out.data (in real life there is more processing that is not relevant to the issue under consideration). BLOCK_SIZE is selected equal to io.DEFAULT_BUFFER_SIZE , which is used by default for buffering :

 from pathlib import Path from functools import partial DATA_PATH = Path(__file__).parent / '../data/' out_path = DATA_PATH / 'out.data' in_paths = (DATA_PATH / 'in-0.data', DATA_PATH / 'in-1.data') BLOCK_SIZE = 8192 def process(data): pass with out_path.open('wb') as out_file: for in_path in in_paths: with in_path.open('rb') as in_file: for data in iter(partial(in_file.read, BLOCK_SIZE), b''): process(data) out_file.write(data) # while True: # data = in_file.read(BLOCK_SIZE) # if not data: # break # process(data) # out_file.write(data) 
+6
source share
1 answer

Let the OS decide for you. Use the mmap module:

https://docs.python.org/3.4/library/mmap.html

It uses a database binding mechanism for your OS to display the contents of a file in RAM.

Keep in mind that the file size limit is 2 GB if you use 32-bit Python, so be sure to use the 64-bit version if you decide to go this route.

For instance:

 f1 = open('input_file', 'r+b') m1 = mmap.mmap(f1.fileno(), 0) f2 = open('out_file', 'a+b') # out_file must be >0 bytes on windows m2 = mmap.mmap(f2.fileno(), 0) m2.resize(len(m1)) m2[:] = m1 # copy input_file to out_file m2.flush() # flush results 

Please note that you never had to call read () functions and decide how many bytes to add to RAM. This example simply copies one file to another, but as you said in your example, you can do whatever processing you need. Please note: if the entire file is mapped to the address space in RAM, this does not mean that it was actually copied there. It will be copied piecewise, at the discretion of the OS.

+5
source

All Articles