Worst case optimization Time complexity for O (1) for python dicts

I need to store a 500M two-digit Unicode character in memory (RAM).

The data structure used must have:

Worst Case Space Complexity: O(n) Worst Case Time Complexity: O(1) <-- insertion, read, update, deletion 

I was thinking about choosing a dict, which is a hash implementation in python, but then the problem is that it provides O (1) time complexity for the required operations only in medium cases than in the worst case.

I heard that if the number of records is known, O (1) time complexity can be achieved in the worst case.

How to do it?

In case this is not possible in python, can I access the memory addresses and data on it directly in my python code? If so, how?

+2
source share
3 answers

In most cases, productivity shocks (usually taken in a collision) are amortized over all calls. Therefore, for the most realistic use, you will not get O(n) for each call. In fact, the only case where you harm O(n) on every call is in the pathological case where each key hash collides with an existing key hash value (i.e. the worst (or worst) use of the hash table).

If, for example, you know your set of keys in advance, and you know that they will not have hash collisions (i.e. all their hashes are unique), then you will not suffer from conflicts. Another main O(n) operation is changing the size of the hash table, but the frequency of this depends on the implementation (expansion coefficient / hash function / conflict resolution scheme, etc.), and will also vary from start to start depending on input data set.

In any case, you can avoid a sudden slowdown if you can pre-populate the dict with all the keys. values โ€‹โ€‹can simply be set to None and later filled with their real values. This should lead to a noticeable result when, at the first start, "dicting" with the keys, and in the future, the value insertion should be constant.

A completely different question is how are you going to read / query the structure? Do you need to attach individual values โ€‹โ€‹and access them using a key? Should I order it? perhaps set might be more appropriate than a dict , since you don't need key:value matching.

Update:

Based on your description in the comments, this starts to sound more like a job for the database, even if you work with temporary dialing. You can use a relational database in memory (e.g. with SQLite). In addition, you can use ORM, such as SQLAlchemy, to interact with the database in more pythonic ways and without having to write SQL.

It even sounds like you can read data from a database to get started, so maybe you can use it yet?

Storing / querying / updating a large number of typed records that are uniquely identified is exactly what RDBMS has been specialized for decades of development and research. Using the in-memory version of an existing relational database (such as SQLite) is likely to be a more pragmatic and sustainable choice.

Try using python built in sqlite3 module and try the in-memory version by providing ":memory:" as the path of the db file when building

 con = sqlite3.connect(":memory:") 
+4
source

The dictionary technically has the worst case O (n), but it is unlikely to occur and probably will not be in your case. I would try to use a dictionary and only switch to another implementation if this is not enough for what you want to do.

Here is a helpful question on the topic.

+2
source

Is there a reason you care about worst case performance rather than average performance? Any reasonable hash table will give you average O (N) performance.

If you really want to get the worst O (1) performance, here are two possible approaches:

  • You have a record vector max(charCode)-min(charCode) and directly find the value you want from the Unicode character code. This will work well if your keys fall into a compact range that you can put in RAM.

  • Use brute force approach to select hash functions or dictionary sizes (using a custom dictionary implementation that allows you to control this), and keep trying new functions and / or sizes until you get one without collisions, Expect it to take a lot of time . I do not recommend this.

EDIT:

Suppose you know that the minimum character code that you see is 1234 and the maximum number you will see is 98765. Suppose you have enough RAM to store 98765-1234 elements. I also assume that you are ready to use the numpy library or another efficient array implementation. In this case, you can save the values โ€‹โ€‹in a vector as follows:

 # configuration info max_value = 98765 # replace with your number min_value = 1234 # replace with your number spread = (max_value - min_value) dtype = object # replace with a primitive type if you want to store something simpler # create the big vector my_data = numpy.empty((spread,), dtype=dtype) # insert elements my_char_code = ... my_value_for_my_char_code = ... assert min_value <= my_char_code < max_value my_data[my_char_code - min_value] = my_value_for_my_char_code # extract elements my_char_code = ... assert min_value <= my_char_code < max_value my_value_for_my_char_code = my_data[my_char_code - min_value] 

This is O (1), because the search is performed using pointer arithmetic and there is no dependence on the number of elements stored in the array.

This approach can be extremely wasteful for RAM if the number of elements you really want to keep is much less than spread . For example, if spread is 4 billion (all UTF32), then only my_data will consume at least 4 billion * 8 bytes / pointer = 32 GB of RAM (and probably a lot more, I donโ€™t know how big the Python Links are). On the other hand, if min_value is 3 billion and max_value = min_value + 100 , then memory usage will be negligible.

+2
source

All Articles