How can I process a file section as if it were the file itself?

I have data stored in a collection of files or in a single compound file. A composite file is formed by combining all the individual files, and then the header precedes everything, which gives offsets and sizes of the constituent parts. I would like to have a file-like object representing the representation of a complex file, where the representation is only one of the member files. (Thus, I can have functions for reading data that accept either a real file object or a view object, and they don’t need to worry about how any particular data set is stored.) In which library will it be for me?

The mmap class looked promising because it was created from a file, length and offset, which is what I have, but the offset needs to be consistent with the basic granularity of the file system distribution, and the files I read do not meet this requirement. The MultiFile class MultiFile is suitable for the account, but it is intended for attachments in e-mail messages, and my files do not have such a structure.

The most interesting file operations are read , seek and tell . The files I read are binary, so text functions like readline and next are not that important. I may also need write , but I am ready to abandon this function, as I am not sure what appending should look like.

+7
source share
2 answers

I know that you were looking for a library, but as soon as I read this question, I thought I would write to myself. So here it is:

 import os class View: def __init__(self, f, offset, length): self.f = f self.f_offset = offset self.offset = 0 self.length = length def seek(self, offset, whence=0): if whence == os.SEEK_SET: self.offset = offset elif whence == os.SEEK_CUR: self.offset += offset elif whence == os.SEEK_END: self.offset = self.length+offset else: # Other values of whence should raise an IOError return self.f.seek(offset, whence) return self.f.seek(self.offset+self.f_offset, os.SEEK_SET) def tell(self): return self.offset def read(self, size=-1): self.seek(self.offset) if size<0: size = self.length-self.offset size = max(0, min(size, self.length-self.offset)) self.offset += size return self.f.read(size) if __name__ == "__main__": f = open('test.txt', 'r') views = [] offsets = [i*11 for i in range(10)] for o in offsets: f.seek(o+1) length = int(f.read(1)) views.append(View(f, o+2, length)) f.seek(0) completes = {} for v in views: completes[v.f_offset] = v.read() v.seek(0) import collections strs = collections.defaultdict(str) for i in range(3): for v in views: strs[v.f_offset] += v.read(3) strs = dict(strs) # We want it to raise KeyErrors after that. for offset, s in completes.iteritems(): print offset, strs[offset], completes[offset] assert strs[offset] == completes[offset], "Something went wrong!" 

And I wrote another script to create the test.txt file:

 import string, random f = open('test.txt', 'w') for i in range(10): rand_list = list(string.ascii_letters) random.shuffle(rand_list) rand_str = "".join(rand_list[:9]) f.write(".%d%s" % (len(rand_str), rand_str)) 

It worked for me. The files I tested are not binary files like yours and they are not as big as yours, but it can be useful, I hope. If not, thanks, it was a good call: D

In addition, I was wondering if in fact it is several files, why not use some kind of archive file format and use their libraries to read them?

Hope this helps.

+4
source

Depending on how difficult you need it, something like this should work - I settled on some details, since I do not know how much you need to emulate a file object (for example, will you ever use obj.read() , or you will always use obj.read(nbytes) ):

 class FileView(object): def __init__(self,file,offset,length): self._file=file self._offset=offset self._length=length def seek(self,pos): #May need to get a little fancier here to support the second argument to seek. return self._file.seek(self._offset+pos) def tell(self): return self._file.tell()-self._offset def read(self,*args): #May need to get a little more complicated here to make sure that the number of #bytes read is smaller than the number of bytes available for this file return self._file.read(*args) 
+3
source

All Articles