How lazy loading a data structure (python)

I have a way to build a data structure (e.g. from some file contents):

def loadfile(FILE): return # some data structure created from the contents of FILE 

Therefore, I can do things like

 puppies = loadfile("puppies.csv") # wait for loadfile to work kitties = loadfile("kitties.csv") # wait some more print len(puppies) print puppies[32] 

In the above example, I spent a ton of time actually reading kitties.csv and creating a data structure that I never used. I would like to avoid this waste without constantly checking if not kitties whenever I want to do something. I would like to be able to do

 puppies = lazyload("puppies.csv") # instant kitties = lazyload("kitties.csv") # instant print len(puppies) # wait for loadfile print puppies[32] 

So, if I never try to do something with kitties , loadfile("kitties.csv") will never be called.

Is there a standard way to do this?

After playing a little with him, I produced the following solution, which works correctly and is rather concise. Are there any alternatives? Are there any flaws in using this approach, which I must remember?

 class lazyload: def __init__(self,FILE): self.FILE = FILE self.F = None def __getattr__(self,name): if not self.F: print "loading %s" % self.FILE self.F = loadfile(self.FILE) return object.__getattribute__(self.F, name) 

What could be even better if something like this works:

 class lazyload: def __init__(self,FILE): self.FILE = FILE def __getattr__(self,name): self = loadfile(self.FILE) # this never gets called again # since self is no longer a # lazyload instance return object.__getattribute__(self, name) 

But this does not work, because self is local. In fact, this ends with calling loadfile every time you do something.

+7
source share
6 answers

The csv module in stdlibrary Python will not load data until you start iterating over it, so it is actually lazy.

Edit: if you need to read the entire file in order to build a data structure, having a complex Lazy load object that proxies things is redundant. Just do the following:

 class Lazywrapper(object): def __init__(self, filename): self.filename = filename self._data = None def get_data(self): if self._data = None: self._build_data() return self._data def _build_data(self): # Now open and iterate over the file to build a datastructure, and # put that datastructure as self._data 

Using the above class you can do this:

 puppies = Lazywrapper("puppies.csv") # Instant kitties = Lazywrapper("kitties.csv") # Instant print len(puppies.getdata()) # Wait print puppies.getdata()[32] # instant 

Also

 allkitties = kitties.get_data() # wait print len(allkitties) print kitties[32] 

If you have a lot of data and you really do not need to load all the data, you can also implement something like a class that will read the file until it finds a dog called "Froufrou" and then stops, but what is the best place to store data in the database once and for all and access it from there.

+4
source

Would it if not self.F lead to another __getattr__ call, putting you in an infinite loop? I think your approach makes sense, but to be safe, I would make this line:

 if name == "F" and not self.F: 

Alternatively, you can make loadfile a class method, depending on what you are doing.

+1
source

If you really care about the if statement, you have a Stateful object.

 from collections import MutableMapping class LazyLoad( MutableMapping ): def __init__( self, source ): self.source= source self.process= LoadMe( self ) self.data= None def __getitem__( self, key ): self.process= self.process.load() return self.data[key] def __setitem__( self, key, value ): self.process= self.process.load() self.data[key]= value def __contains__( self, key ): self.process= self.process.load() return key in self.data 

This class delegates the work to the process object, which is either a Load or a DoneLoading object. The Load object will be loaded. DoneLoading not loading.

Please note that there are no if statements.

 class LoadMe( object ): def __init__( self, parent ): self.parent= parent def load( self ): ## Actually load, setting self.parent.data return DoneLoading( self.parent ) class DoneLoading( object ): def __init__( self, parent ): self.parent= parent def load( self ): return self 
+1
source

Here is a solution that uses the class decorator to delay initialization until the first use of the object:

 def lazyload(cls): original_init = cls.__init__ original_getattribute = cls.__getattribute__ def newinit(self, *args, **kwargs): # Just cache the arguments for the eventual initialization. self._init_args = args self._init_kwargs = kwargs self.initialized = False newinit.__doc__ = original_init.__doc__ def performinit(self): # We call object __getattribute__ rather than super(...).__getattribute__ # or original_getattribute so that no custom __getattribute__ implementations # can interfere with what we are doing. original_init(self, *object.__getattribute__(self, "_init_args"), **object.__getattribute__(self, "_init_kwargs")) del self._init_args del self._init_kwargs self.initialized = True def newgetattribute(self, name): if not object.__getattribute__(self, "initialized"): performinit(self) return original_getattribute(self, name) if hasattr(cls, "__getitem__"): original_getitem = cls.__getitem__ def newgetitem(self, key): if not object.__getattribute__(self, "initialized"): performinit(self) return original_getitem(self, key) newgetitem.__doc__ = original_getitem.__doc__ cls.__getitem__ = newgetitem if hasattr(cls, "__len__"): original_len = cls.__len__ def newlen(self): if not object.__getattribute__(self, "initialized"): performinit(self) return original_len(self) newlen.__doc__ = original_len.__doc__ cls.__len__ = newlen cls.__init__ = newinit cls.__getattribute__ = newgetattribute return cls @lazyload class FileLoader(dict): def __init__(self, filename): self.filename = filename print "Performing expensive load operation" self[32] = "Felix" self[33] = "Eeek" kittens = FileLoader("kitties.csv") print "kittens is instance of FileLoader: %s" % isinstance(kittens, FileLoader) # Well obviously print len(kittens) # Wait print kittens[32] # No wait print kittens[33] # No wait print kittens.filename # Still no wait print kittens.filename 

Exit:

 kittens is instance of FileLoader: True Performing expensive load operation 2 Felix Eeek kitties.csv kitties.csv 

I tried to restore the original magic methods after initialization, but this did not work. It may be necessary to proxy additional magic methods; I have not studied each scenario.

Note that kittens.initialized will always return True because it starts initialization if it has not already been executed. Obviously, one could add an exception for this attribute so that it returns False if no other operation was performed on the object, or the check could be changed to the equivalent of the hasattr call, and the initialized attribute could be deleted after initialization.

+1
source

Here is a hack that makes an β€œeven better” solution, but I think it’s pretty annoying that it’s probably best to just use the first solution. The idea is to complete the step self = loadfile(self.FILE) , passing the variable name as an attribute:

 class lazyload: def __init__(self,FILE,var): self.FILE = FILE self.var = var def __getattr__(self,name): x = loadfile(self.FILE) globals()[self.var]=x return object.__getattribute__(x, name) 

Then you can do

 kitties = lazyload("kitties.csv","kitties") ^ ^ \ / These two better match exactly 

After calling any method on kitties (except kitties.FILE or kitties.var ), it will become completely indistinguishable from what you got with kitties = loadfile("kitties.csv") . In particular, it will no longer be an instance of lazyload and kitties.FILE and kitties.var will no longer exist.

0
source

If you need to use puppies[32] , you also need to define the __getitem__ method because __getattr__ will not catch this behavior.

I use lazy load for my needs, there is an unadapted code:

 class lazy_mask(object): '''Fake object, which is substituted in place of masked object''' def __init__(self, master, id): self.master=master self.id=id self._result=None self.master.add(self) def _res(self): '''Run lazy job''' if not self._result: self._result=self.master.get(self.id) return self._result def __getattribute__(self, name): '''proxy all queries to masked object''' name=name.replace('_lazy_mask', '') #print 'attr', name if name in ['_result', '_res', 'master', 'id']:#don't proxy requests for own properties return super(lazy_mask, self).__getattribute__(name) else:#but proxy requests for masked object return self._res().__getattribute__(name) def __getitem__(self, key): '''provide object["key"] access. Else can raise TypeError: 'lazy_mask' object is unsubscriptable''' return self._res().__getitem__(key) 

(master is a registry object that loads data when I run the get () method)

This implementation works fine for isinstance () and str () and json.dumps () with it

0
source

All Articles