If false positives are acceptable, then one possible solution would be to use a bloom filter . Bloom filters are similar to hash tables, but instead of using a single hash value to index the bucket table, it uses multiple hashes to index the bitmap. Bits corresponding to these indices are set. Then, to check if there is a string in the filter, the string is hashed again, and if the corresponding indexes are set, then the string "is" in the filter.
It does not store any string information, so it uses very little memory - but if there is a collision between two lines, collision resolution is not possible. This means that there can be false positives (because a string that does not have a filter can have a hash with the same indices as a string that is in the filter). However, there can be no false negatives; any string that is actually in the set will be found in the color filter.
There are several Python implementations . It is also not difficult to collapse on your own; I remember how to code a fast-dirty bloom filter using bitarray , which worked very well.
source share