DISCLAIMER: I have no practical experience with this, but this is a fun exercise, and after some thought, this is what I came across:
Since you have 2,000,000 documents, you are kind of stuck with an integer for document identifiers; which is 4 bytes + 4 bytes; the comparison is apparently between 0.00 and 1.00, I think the byte will do, encoding 0.00-1.00 as 0..100.
So your table will be: id1, id2, relationship_value
This brings him exactly 9 bytes per record. Thus (without any overhead) ((2 * 10 ^ 6) ^ 2) * 9 / 2bytes are needed, which is about 17Tb.
Of course, if you only have a base table. Since you do not plan to request it often, I believe that performance is not such a big problem. That way, you can go “creatively” by keeping the horizontal values. To simplify things, you will save the values of 2 million per 2 million square, and each “intersection” will be a byte representing the relationship between their coordinates. This “only” would require about 3.6 TB, but it would be painful to maintain and not to use the fact that the relationship is symmetrical.
So, I would suggest using a hybrid approach, a table with two columns. The first column will contain the “left” document identifier (4 bytes), the second column will contain a row of all document values, starting with the identifier above the identifier in the first column, using varbinary. Since varbinary only accepts the space it needs, it helps us return some space offered by the symmetry of the relation.
In other words,
- record 1 would have a row (2.000.000-1) bytes as the value for the second column
- record 2 would have a row (2.000.000-2) bytes as the value for the second column
- record 3 would have a row (2.000.000-3) bytes as the value for the second column
- etc.
This way you can get away with something like 2Tb (with overhead) to store information. Add compression to it, and I'm sure you can save it to a modern drive.
Of course, the system is far from optimal. In fact, requesting information will require some patience, since you cannot approach things based on settings, and you will have to pretty much scan bytes by bytes. A nice “advantage” of this approach would be that you can easily add new documents by adding a new byte to the EACH record + 1 extra line at the end. Such operations will be expensive, as this will lead to page breaks; but at least it will be possible without completely rewriting the table. But over time, this will lead to quite fragmentation, and you may want to rebuild the table once in a while to make it more "aligned". Ah .. technical problems.
Selecting and updating will require some creative use of SubString () operations, but nothing complicated.
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it, you can actually save 8 values in 7 bytes and save about 300 MB more, but that will do things are a little more complicated ... again, it doesn't look like the data will be readable anyway =)
PS: this line of thinking is completely aimed at reducing the required amount of space, while remaining practical in terms of updating data. I am not saying that it will be fast; in fact, if you go looking for all documents that have a ratio of 0.89 or higher, the system will have to scan the entire table and even with modern disks, which will take some time.
Keep in mind that all this is the result of an hour-long brainstorming; I actually hope someone can call back with a more accurate approach =)