I use a set of Python sets to store unique objects. Each object has __hash__ and __eq__ overridden.
The set contains about 200,000 objects. The kit itself takes up about 4 GB of memory. It works fine on a machine with a capacity of more than 5 GB, but now I need to run the script on a machine that has only 3 GB of RAM.
I rewrote the script in C # - I actually read the same data from the same source, put it in the CLR analogue of set (HashSet), and instead of 4 GB it took about 350 MB, and the script execution speed was relatively the same (about 40 seconds). But I have to use Python.
Q1: Does Python have any kind of βhard driveβ or any other workaround? I assume that it can only store the "key" data used in hash / eq methods, and everything else can be stored on disk. Or maybe in Python there are other workarounds to have a unique collection of objects that can have more memory than is available on the system.
Q2: less practical question: why does a python suite take up much more memory for a suite?
I use standard Python 2.7.3 on 64-bit Ubuntu 12.10
Thanks.
Update1: What the script does:
Read a lot of semi-structured JSON documents (each JSON consists of a serialized object with a collection of related objects)
Analyze each JSON document to extract the main object and objects from aggregated collections from it. Each analyzed object is stored in a set. The set is used to store unique objects. Firstly, I used the database, but the unique limitation in the database is slower on the x100-x1000. Each JSON document is parsed into 1-8 different types of objects. Each type of object is stored in its own set to store only unique objects in memory.
All data stored in sets is stored in a relational database with unique restrictions. Each set is stored in a separate database table.
The whole idea of ββa script is to accept unstructured data, remove duplicates from aggregate collections of objects in a JSON document, and store structured data in a relational database.
Update 2:
2 delnan: I commented out all lines of code with addition to other sets, keeping all other employees (receiving data, parsing, iteration) the same - the script took up less than 4 GB of memory.
This means that when these 200K objects are added to the sets, they begin to take up so much memory. The object is simple movie data from TMDB-ID, a list of genres, a list of actors, directors, many other details of the film and, possibly, a great description of the film from Wikipedia.