Opposite the bloom filter?

I am trying to optimize a piece of software that basically runs millions of tests. These tests are generated in such a way that there may be some repetition. Of course, I do not want to waste time doing tests that I have already performed, if I can avoid this effectively.

So, I'm thinking of using a Bloom filter to store tests that have already been run. However, the Bloom filter is mistaken on the insecure side for me. It gives false positives. That is, he can report that I checked a test that I do not have. Although this may be acceptable in the scenario I'm working on, I was wondering if there is an equivalent to the Bloom filter, but it is mistaken on the opposite side, that is, it only gives false negatives.

I looked through the literature with no luck.

+50
data-structures bloom-filter
Mar 11 '09 at 18:18
source share
9 answers

Yes, a lossy hash table or LRUCache is an O (1) quick lookup data structure that will give false negatives - if you ask, “If I run test X,” it will tell you: “Yes, you definitely have ", or" I don’t remember. "

Sorry for the extremely rude pseudo code:

setup_test_table(): create test_table( some large number of entries ) clear each entry( test_table, NEVER ) return test_table has_test_been_run_before( new_test_details, test_table ): index = hash( test_details , test_table.length ) old_details = test_table[index].detail // unconditionally overwrite old details with new details, LRU fashion. // perhaps some other collision resolution technique might be better. test_table[index].details = new_test_details if ( old_details === test_details ) return YES else if ( old_details === NEVER ) return NEVER else return PERHAPS main() test_table = setup_test_table(); loop test_details = generate_random_test() status = has_test_been_run_before( test_details, test_table ) case status of YES: do nothing; NEVER: run test (test_details); PERHAPS: if( rand()&1 ) run test (test_details); next loop end. 
+20
Jun 15 '10 at 5:55
source share

The exact data structure that performs this task is a Direct-mapped cache and is commonly used in CPUs.

 function set_member(set, item) set[hash(item) % set.length] = item function is_member(set, item) return set[hash(item) % set.length] == item 
+9
Jun 13 2018-12-12T00:
source share

Is it possible to save tests that you did not run? This should invert the filter behavior.

+6
May 09 '09 at 9:18
source share

What about LRUCache?

0
Jun 17 '09 at 6:46
source share

I'm sorry that I will not help much - I do not think that this is possible. If test execution cannot be ordered, perhaps use a packaged format (8 tests per byte!) Or a good sparse library of arrays to store the results in memory.

0
Jun 17 '09 at 7:17
source share

I think you are giving up part of the decision; to completely eliminate false positives, you still have to keep track of which ones were executed, and essentially use a flowering filter as a shortcut to identify a test that definitely didn't run.

However, since you know the number of tests in advance, you can determine the filter size in such a way as to ensure an acceptable error rate using some well-known formulas; for a 1% probability of a false positive return, you need ~ 9.5 bits / record, so for one million records, 1.2 megabytes is enough. If you reduce the permissible error rate to 0.1%, it only increases to 1.8 MB.

The Wikipedia article Bloom Filters provides an excellent analysis and an interesting overview of the maths in question.

0
May 08 '10 at 16:38
source share
  • Use the bit as above. If you know, no. tests that you are going to run in advance, you will always get the correct results (present, not real) from the data structure.
  • Do you know which keys you will hash? If so, you should run an experiment to see the key distribution in BloomFilter so that you can fine tune it to reproduce false positives or what you have.
  • You might also want to check out HyperLogLog.
0
Feb 17 '15 at 15:02
source share

No, and if you think about it, it will not be very useful. In your case, you cannot be sure that your test run will ever stop, because if there are always “false negatives”, there will always be tests that need to be run ...

I would say that you just need to use a hash.

-one
Mar 11 '09 at 18:41
source share

The data structure you expect does not exist. Since such a data structure should be ambiguous, or, say, a limited set of states. In the same internal state there must be at least two different inputs. Thus, you cannot determine whether both (or more) of them are in the set, only knowledge of at least one of these inputs exists.

EDIT This statement is true only if you are looking for an efficient data structure with memory. If memory is not limited, you can always get a data structure to get 100% accurate results, saving each element of the element.

-one
Jul 24 2018-12-12T00:
source share



All Articles