I am developing a Python command line utility that potentially includes quite large file set requests. This is a reasonable final list of queries (think DB indexed columns). To improve performance in the process, I can generate sorted / structured lists, maps, and trees once, and then hit them several times, rather than hit the file system every time.
However, these caches are lost when the process ends, and they need to be rebuilt every time the script runs, which greatly increases the execution time of my program. I would like to determine the best way to share this data between several runs of my command, which can be simultaneous, one after the other or with significant delays between execution.
Requirements:
- It should be fast - any processing of each execution should be minimized, including disk IO and object design.
- There must be an OS agnostic (or at least be able to connect to similar basic behaviors on Unix / Windows, which is more likely).
- Must allow a fairly complex query / filtering - I don't think the key / value map will be good enough.
- Is it necessary not to update - (briefly) outdated data is fine, it is just a cache, the actual data is written to disk separately.
- It is not possible to use a heavy daemon process such as MySQL or MemCached. I want to minimize installation costs and ask each user to install these services too.
Preferences:
- I would like, if possible, to avoid any process of a lengthy demon process.
- Although I would like to quickly update the cache, restoring the entire cache when updating is not the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I could directly support Python objects between executions, such as Java threads (like Tomcat requests) that use the same type of data storage objects, but I understand that this may not be possible. The closer I can understand this, the better.
Candidates:
SQLite in memory
SQLite alone does not seem fast enough for my use, as it is supported by the disk and therefore will have to read from the file every time it runs. This may not be as bad as it sounds, but it seems like you need to permanently store the database in memory. SQLite allows DBs to use memory as storage , but these databases are destroyed when the program exits and cannot be shared between instances.
The flat database is loaded into memory using mmap
At the opposite end of the spectrum, I could write caches to disk and then load them into memory using mmap, can share the same memory space between separate executions. I do not understand what will happen with mmap if all processes exit. This is normal if mmap eventually flashes from memory, but I would like it to linger a bit (30 seconds? A few minutes?), So the user can run the commands one by one, and the cache can be reused. This example assumes that there must be an open MMAP descriptor, but I did not find an exact description of when the memory mapped files get deleted from memory and must be reloaded from disk.
I think I can implement this if mmap objects are closed after exiting, but it feels very low, and I think that someone already has a more elegant solution. I would really like to start building this just to understand that I'm rebuilding SQLite. On the other hand, it looks like it will be very fast, and I could do the optimization based on my specific use case.
Separate Python objects between processes with Processing
The processing batch indicates " Objects can be shared between processes using ... shared memory ." Looking through the rest of the documents, I no longer mentioned this behavior, but it sounds very promising. Can someone direct me for more information?
Saving data to a RAM disk
My concern here is about OS-specific features, but I could create a RAM disk and then just read / write to it as I like (SQLite?). The fs.memoryfs package seems like a promising alternative to working with multiple operating systems, but comments imply a limited number of restrictions.
I know pickle is an efficient way to store Python objects, so it can have speed advantages in any manual data storage. Can I soak marijuana in any of the above options? Would it be better than flat files or sqlite?
I know a lot of questions related to this, but I did a lot of digging and could not find anything, directly addressing my question regarding the execution of several commands.
I fully admit that I can overestimate this. I'm just trying to understand my options, and if they are worth it or not.
Thank you so much for your help!
dimo414
source share