Saving trillions of document similarities

I wrote a program to calculate the similarity among a set of 2 million documents. The program works, but I have problems saving the results. I will not need to access the results often, but sometimes they need to be queried and pull out subsets for analysis. The output basically looks like this:

1,2,0.35 1,3,0.42 1,4,0.99 1,5,0.04 1,6,0.45 1,7,0.38 1,8,0.22 1,9,0.76 . . . 

Columns 1 and 2 are identifiers of documents, and column 3 is an assessment of similarity. Since similarity ratings are symmetrical, I don’t need to calculate them, but it still leaves me with 2,000,000 * (2,000,000-1) / 2 ≈ 2,000,000,000,000 lines of records.

A text file with 1 million lines of records is already 9 MB. By extrapolating, this means that I will need 17 TB to store such results (in text files).

Are there more efficient ways to store these types of data? I can have one row for each document and get rid of duplicate document identifiers in the first column. But it only went that far. What about file formats or special database systems? This should be a common problem in big data; I have seen documents / blogs report similar analyzes, but none of them discuss practical aspects like storage.

+7
performance sql csv bigdata
source share
1 answer

DISCLAIMER: I have no practical experience with this, but this is a fun exercise, and after some thought, this is what I came across:

Since you have 2,000,000 documents, you are kind of stuck with an integer for document identifiers; which is 4 bytes + 4 bytes; the comparison is apparently between 0.00 and 1.00, I think the byte will do, encoding 0.00-1.00 as 0..100.

So your table will be: id1, id2, relationship_value

This brings him exactly 9 bytes per record. Thus (without any overhead) ((2 * 10 ^ 6) ^ 2) * 9 / 2bytes are needed, which is about 17Tb.

Of course, if you only have a base table. Since you do not plan to request it often, I believe that performance is not such a big problem. That way, you can go “creatively” by keeping the horizontal values. To simplify things, you will save the values ​​of 2 million per 2 million square, and each “intersection” will be a byte representing the relationship between their coordinates. This “only” would require about 3.6 TB, but it would be painful to maintain and not to use the fact that the relationship is symmetrical.

So, I would suggest using a hybrid approach, a table with two columns. The first column will contain the “left” document identifier (4 bytes), the second column will contain a row of all document values, starting with the identifier above the identifier in the first column, using varbinary. Since varbinary only accepts the space it needs, it helps us return some space offered by the symmetry of the relation.

In other words,

  • record 1 would have a row (2.000.000-1) bytes as the value for the second column
  • record 2 would have a row (2.000.000-2) bytes as the value for the second column
  • record 3 would have a row (2.000.000-3) bytes as the value for the second column
  • etc.

This way you can get away with something like 2Tb (with overhead) to store information. Add compression to it, and I'm sure you can save it to a modern drive.

Of course, the system is far from optimal. In fact, requesting information will require some patience, since you cannot approach things based on settings, and you will have to pretty much scan bytes by bytes. A nice “advantage” of this approach would be that you can easily add new documents by adding a new byte to the EACH record + 1 extra line at the end. Such operations will be expensive, as this will lead to page breaks; but at least it will be possible without completely rewriting the table. But over time, this will lead to quite fragmentation, and you may want to rebuild the table once in a while to make it more "aligned". Ah .. technical problems.

Selecting and updating will require some creative use of SubString () operations, but nothing complicated.

PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it, you can actually save 8 values ​​in 7 bytes and save about 300 MB more, but that will do things are a little more complicated ... again, it doesn't look like the data will be readable anyway =)

PS: this line of thinking is completely aimed at reducing the required amount of space, while remaining practical in terms of updating data. I am not saying that it will be fast; in fact, if you go looking for all documents that have a ratio of 0.89 or higher, the system will have to scan the entire table and even with modern disks, which will take some time.

Keep in mind that all this is the result of an hour-long brainstorming; I actually hope someone can call back with a more accurate approach =)

+1
source share

All Articles