Search for sets that have specific subsets

Question

Search for sets that have specific subsets

I am a graduate student in physics, and I am working on writing code to sort a few hundred gigabytes of data and return fragments of that data when I ask for it. Here's the trick, I don’t know a good method for sorting and searching for this kind of data.

My data essentially consists of a large number of sets of numbers. These sets can contain from 1 to n numbers inside them (although in 99.9% of sets n is less than 15), and their number is about 1.5-2 billion (unfortunately, this size excludes brute force search).

I need to specify a collection with k elements and every collection with k + 1 elements or more that contains the specified subset returned to me.

A simple example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)

If I were to give the query (1,3), I would have the sets: (1,2,3), (1,2,3,4,5) and (1,3,8,9).
Request (11) will return the set: (5,8,11).
The query (1,2,3) will return the sets: (1,2,3) and (1,2,3,4,5)
Query (50) did not return any sets:

Currently, the template should be clear. The main difference between this example and my data is that the sets with my data are larger, the numbers used for each element of the sets are from 0 to 16383 (14 bits), and there are many many other sets.

If this is important, I write this program in C ++, although I also know java, c, some assemblies, some fortran and some perl.

Does anyone have any tips on how to do this?

edit:
To answer a couple of questions and add a few points:

1.) Data does not change. All this was taken in one long set of runs (each is divided into 2 gigabytes).

2.) As for the storage space. The source data is about 250 gigabytes. I estimate that after processing and removing a large number of extraneous metadata that do not interest me, I could bring down this place anywhere from 36 to 48 gigabytes, depending on how much metadata I decide to save (without indexes). Also, if in my initial data processing I come across enough sets that are the same, I could add the data again by adding counters for repeated events, rather than just repeating the events over and over again.

3.) Each number in the processed set actually contains in LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). Thus, I will need at least three bytes per number.

4.) My "although in 99.9% of the sets, n is less than 15," the comment is misleading. Having previously looked at some pieces of data, I found that I have sets that contain up to 22 numbers, and the median is 5 numbers per unit, and the average is 6 numbers in the set.

5.) Although I like the idea of creating a pointer to file pointers, I am a little cautious because for queries containing more than one number, I remain with the slow task (at least I think it is slow) to find the set of all the pointers common for lists, i.e. search for the largest common subset for a given number of sets.

6.) As for the resources available to me, I can collect about 300 concerts after I have the initial data in the system (the rest of my quota in this system). The system is a dual-processor server with 2 quad-core processors and 16 gigabytes of RAM.

7.) Yes 0 can happen, this is an artifact of the data collection system when this happens, but it can happen.

+4

set algorithm dataset subset

James matta Jan 30 '09 at 4:40

source share

6 answers

Your problem is the same as that of search engines. "I have bajillion documents, I need those that contain this set of words." You have simple (very convenient) integers instead of words and small documents. The solution is an inverted index . An introduction to the information search from Manning et al. (Via this link), available for free on the Internet, is very readable and will be described in detail on how to do this.

You will have to pay the price of disk space, but it can be parallelized, and it should be more than fast enough to meet your time requirements as soon as the index is created.

+11

Jay kominek Jan 30 '09 at 5:55

source share

Assuming a random distribution of 0-16383 with matched 15 elements in a set and two billion sets, each element will appear in approximately 1.8 M sets. Have you considered (and you have the opportunity) to build a lookup table 16384x ~ 1.8M (30B records, 4 bytes each)? Given such a table, you can query which sets contain (1) and (17) and (5555), and then find the intersections of these three lists of elements with a size of 1.8 M.

+2

Sparr Jan 30 '09 at 4:58

source share

My assumption is as follows.

Suppose each set has a name or identifier or address (a 4-byte number will do if there are only 2 billion of them).

Now go through all the sets once and create the following output files:

A file containing the identifiers of all sets that contain '1'
File containing identifiers of all sets that contain '2'
A file containing the identifiers of all sets that contain "3"
... etc.

If there are 16 records in a set, then on average each of these 2 ^ 16 files will contain identifiers of 2 ^ 20 sets; with each ID equal to 4 bytes, this requires 2 ^ 38 bytes (256 GB) of memory.

You will do it higher before processing requests.

When receiving requests, use these files as follows:

Look at a couple of numbers in the query
Open several related index files
Get a list of all the sets that exist in both of these files (each file has only a million identifiers, so it shouldn't be difficult)
See which of these several sets satisfies the rest of the query.

I assume that if you do this higher, creating indexes will be (very) slow and processing requests will be (very) fast.

+2

Chrisw Jan 30 '09 at 5:24

source share

Make 16383 index files, one for each possible search value. For each value in your input set, write the position of the start file of the set in the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of upstream indices in the main file.

To perform a search, start reading the index files matching each search value. If you are reading an index that is lower than the index that you are reading from another file, drop it and read another one. When you get the same index from all files, it is a coincidence - get a set from the main file and read the new index from each index file. Once you reach the end of any of the index files, you are done.

If your values are evenly distributed, each index file will contain 1/16383 input sets. If your average set of searches consists of 6 values, you will perform a linear walk through 6/16383 of your original input. This is still an O (n) solution, but your n is a bit smaller.

PS Is a null value impossible, or do you really have 1638 4 capabilities?

+1

Mark ransom Jan 30 '09 at 16:44

source share

Just playing the devil's lawyer for an approach that involves brute force + index search:

Create an index using the min, max, and no elements of the sets.
Then apply brute force, excluding sets where max <max (specified in the search) and min> min (the search is performed)
In brute force, it also excludes that the totality of all elements is less than that of the found set.

95% of your searches will indeed be rude, forcing a very smaller subset. Just a thought.

0

Learning Jan 30 '09 at 5:49

source share

James matta · Accepted Answer · 2009-09-11T03:37:10+0000

I recently discovered methods that use space-filling curves to map multidimensional data to a single dimension. You can then index the data based on its 1D index. Range queries can be easily performed by finding segments of the curve that intersect the field that represents the curve, and then extracting these segments.

I believe that this method is far superior to what the crazy indexes were supposed to, because, looking at it, the index will be as large as the data that I wanted to keep, it is hardly good. A slightly more detailed explanation of this can be found at:

http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html

Search for sets that have specific subsets

More articles: