Python is a very large set. How to avoid an exception from memory?

I use a set of Python sets to store unique objects. Each object has __hash__ and __eq__ overridden.

The set contains about 200,000 objects. The kit itself takes up about 4 GB of memory. It works fine on a machine with a capacity of more than 5 GB, but now I need to run the script on a machine that has only 3 GB of RAM.

I rewrote the script in C # - I actually read the same data from the same source, put it in the CLR analogue of set (HashSet), and instead of 4 GB it took about 350 MB, and the script execution speed was relatively the same (about 40 seconds). But I have to use Python.

Q1: Does Python have any kind of β€œhard drive” or any other workaround? I assume that it can only store the "key" data used in hash / eq methods, and everything else can be stored on disk. Or maybe in Python there are other workarounds to have a unique collection of objects that can have more memory than is available on the system.

Q2: less practical question: why does a python suite take up much more memory for a suite?

I use standard Python 2.7.3 on 64-bit Ubuntu 12.10

Thanks.

Update1: What the script does:

  • Read a lot of semi-structured JSON documents (each JSON consists of a serialized object with a collection of related objects)

  • Analyze each JSON document to extract the main object and objects from aggregated collections from it. Each analyzed object is stored in a set. The set is used to store unique objects. Firstly, I used the database, but the unique limitation in the database is slower on the x100-x1000. Each JSON document is parsed into 1-8 different types of objects. Each type of object is stored in its own set to store only unique objects in memory.

  • All data stored in sets is stored in a relational database with unique restrictions. Each set is stored in a separate database table.

The whole idea of ​​a script is to accept unstructured data, remove duplicates from aggregate collections of objects in a JSON document, and store structured data in a relational database.

Update 2:

2 delnan: I commented out all lines of code with addition to other sets, keeping all other employees (receiving data, parsing, iteration) the same - the script took up less than 4 GB of memory.

This means that when these 200K objects are added to the sets, they begin to take up so much memory. The object is simple movie data from TMDB-ID, a list of genres, a list of actors, directors, many other details of the film and, possibly, a great description of the film from Wikipedia.

+6
source share
3 answers

Kits do use a lot of memory, but there are no lists.

 >>> from sys import getsizeof >>> a = range(100) >>> b = set(a) >>> getsizeof(a) 872 >>> getsizeof(b) 8424 >>> 

If the only reason you use the kit is to prevent duplicates, I would advise you to use a list instead. You can prevent duplication by testing if the objects are already on your list before adding them. This may be slower than using the built-in set mechanics, but it will certainly use a lot less memory.

+4
source

A better approach would probably be to keep the objects you are storing in a set smaller. If they contain unnecessary fields, delete them.

To reduce the overall overhead of an object, you can also use __slots__ to declare the fields used:

 class Person(object): __slots__ = ['name', 'age'] def __init__(self): self.name = 'jack' self.age = 99 
+5
source

Try using __slots__ to reduce the use of your memory.

When I last encountered this problem with a large number of objects, using __slots__ reduces memory usage to 1/3.

Here is a SO question about __slots__ that may seem interesting to you.

+2
source

All Articles