I am a graduate student in physics, and I am working on writing code to sort a few hundred gigabytes of data and return fragments of that data when I ask for it. Here's the trick, I donโt know a good method for sorting and searching for this kind of data.
My data essentially consists of a large number of sets of numbers. These sets can contain from 1 to n numbers inside them (although in 99.9% of sets n is less than 15), and their number is about 1.5-2 billion (unfortunately, this size excludes brute force search).
I need to specify a collection with k elements and every collection with k + 1 elements or more that contains the specified subset returned to me.
A simple example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I were to give the query (1,3), I would have the sets: (1,2,3), (1,2,3,4,5) and (1,3,8,9).
Request (11) will return the set: (5,8,11).
The query (1,2,3) will return the sets: (1,2,3) and (1,2,3,4,5)
Query (50) did not return any sets:
Currently, the template should be clear. The main difference between this example and my data is that the sets with my data are larger, the numbers used for each element of the sets are from 0 to 16383 (14 bits), and there are many many other sets.
If this is important, I write this program in C ++, although I also know java, c, some assemblies, some fortran and some perl.
Does anyone have any tips on how to do this?
edit:
To answer a couple of questions and add a few points:
1.) Data does not change. All this was taken in one long set of runs (each is divided into 2 gigabytes).
2.) As for the storage space. The source data is about 250 gigabytes. I estimate that after processing and removing a large number of extraneous metadata that do not interest me, I could bring down this place anywhere from 36 to 48 gigabytes, depending on how much metadata I decide to save (without indexes). Also, if in my initial data processing I come across enough sets that are the same, I could add the data again by adding counters for repeated events, rather than just repeating the events over and over again.
3.) Each number in the processed set actually contains in LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). Thus, I will need at least three bytes per number.
4.) My "although in 99.9% of the sets, n is less than 15," the comment is misleading. Having previously looked at some pieces of data, I found that I have sets that contain up to 22 numbers, and the median is 5 numbers per unit, and the average is 6 numbers in the set.
5.) Although I like the idea of โโcreating a pointer to file pointers, I am a little cautious because for queries containing more than one number, I remain with the slow task (at least I think it is slow) to find the set of all the pointers common for lists, i.e. search for the largest common subset for a given number of sets.
6.) As for the resources available to me, I can collect about 300 concerts after I have the initial data in the system (the rest of my quota in this system). The system is a dual-processor server with 2 quad-core processors and 16 gigabytes of RAM.
7.) Yes 0 can happen, this is an artifact of the data collection system when this happens, but it can happen.