Binary search or Btree index update issue

Imagine that every day you submit a new book from the author. The book is incomplete. He does not tell you what he changed or added.

Your task is to identify changes and additions and transfer them only to the publisher (who does not have time to read the whole book every day)

For the purpose of this problem, the book consists of 1m lines of ascii text and grows (in fact, a MySQL backup file).

My current idea is to make a secure hash (e.g. SHA256) of each line (1k Chars) and save it to HD. Since the hash is only 32 bytes, the file is only 32 MB.

Then, when we get the next file tomorrow, we go through it line by line, creating a new hash for each line and comparing it with the hash from the previous day.

When the process is complete, we will overwrite the hash file the next day.

The comparison uses the binary search method for string comparisons (> <operands). This returns an average of four iterations.

I haven't coded the btree index solution yet, but how would you handle this?

+4
source share
6 answers

I would use diff .

If I needed to implement it in my own program, I would use one of the algorithms to search for the longest common subsequence of two sequences, treating each file as a sequence of lines.

+1
source

"Then, when we get the next file tomorrow, we go through it line by line, creating a new hash for each line and comparing it with the hash from the previous day.

Got: 1 m lines of today's hash values ​​compared to 1 m lines of yesterday's values.

Are lines inserted or deleted? If not, this is a simple set of concurrent reads to see if the hashes are different.

If there is an addition or deletion, you will need to use the diff algorithm to determine the scope of the change.

All this is wonderful. Not too hard to implement.

In this context, the following does not make sense.

Comparison uses the binary search string comparison method (> <operands). This returns the result in the average of four iterations.

Is there any sort of hash value? Or some kind of tree structure?

0
source

A book of 1 million lines is huge: perhaps 30-50 lines per page, so let's be generous and assume 100 lines per page, which means 10,000 pages per book.

1 kb lines are also much larger than normal; basic readability does not allow to approach the set of characters in a string. Do you intend to use hash strings up to 1 KB or cut a file in 1 KB pieces? One of the problems with your schema is that any duplicate lines will have a repeating hash; You can never tell when one of these lines was added or deleted.

You apparently also need to notify the publisher of deleted lines.

As with Glomek, I would use diff in the file. If you save the file under RCS or CVS, you will only have the current version of the file and the difference between previous versions. In doing so, you could provide cumulative differences over the course of a week or month.

And I probably would not have developed my own B-Tree indexing.

0
source

the solution you described is somewhat similar to the rsync algorithm. one of the important points is that rsync must recognize existing fragments anywhere in the target file at any offset from the original.

If your files are really write-structured, you can simplify the bit as you suggest. if not, you need a rolling checksum.

Also, do you need to recognize reordering? or just insert / delete / replace?

the most common case is the full rsync algorithm, which looks like this:

  • definition of parameters:

    • choose a block size of 512, or 1k usually works fine.
      • select a "strong" checksum. something like MD4 or so. 64 bits is a lot.
      • select the "weak" checksum. one that allows you to “subtract” the tail byte and “add” the byte to the head to get a block checksum of 1 byte in advance. usually a 16-bit checksum works fine.
  • signature of the old file:

    • move the entire old file, on each block both weak and strong checksums are calculated. with 16 and 64 bits of checksums and 512 bytes of blocks, which means 10bytes per block, or 20KB per megabyte. it's a "signature"
  • create a “patch” with the new file and the signature of the old file:

    • load the signature of the old file, the best hash table, with weak checksums as keys, strong checksums and block position are the values.
      • read the first block of the new file
      • calculate weak checksum of loaded block
      • check the hash table to see if there is a weak checksum.
      • if found, calculate a strong checksum and compare with the one found in the hash
      • if both control matches coincide, mark how the "received" link to the block in the hash, promote one integral block and return to step 3
      • if the strong checksum did not match, or if the weak checksum was not in the hash, "collapse" the weak checksum, that is, add the next byte after the block and subtract the first byte from the tail.
      • add byte "subtracted" from the tail to the list of "new" bytes in the patch
      • return to step 4
  • apply patch to old file

    • "patch" is a list of "new" bytes that occurred when pumping the checksum, as well as a list of "received" blocks that correspond to the old file.
0
source

This is the method used to incrementally upload to the data warehouse. In a situation where you cannot identify the changed data in the source system, you can take a snapshot of the data and compare it with the last snapshot to determine the differences. This method is even mentioned in Ralph Kimball's book on this subject and is used in the application in which I participated in the development.

You need a very wide key hash algorithm, as this approach is vulnerable to birthday attacks . MD5 or any SHA family would be nice. It also cannot detect deletions without a post process that goes through the difference in finding the missing natural keys. This calculation really needs to know the structure of the table.

0
source

One of the problems with your schema is that any duplicate lines will have a repeating hash; You can never tell when one of these lines was added or deleted.

Very good point, but not a problem. The duplicate row is a duplicate, and all duplicates are deleted in the next processing step. So yes, you are right, but that is not a problem.

Does "diff" link to a page describing what I assume this application is? There is no download link, no language in any language ... What am I missing here?

Some of you have talked about byte level granularity. It's not needed. only granularity is needed at the row level, because if something in the row has been changed, the entire row (record) must be processed, because any change within the row affects the entire row.

So, we compare the lines of about 1000 characters (without binary), in two files (today's shot and yesterday's shot), each of which is about 1 m.

Thus, using a secure hash such as SHA256 (MD5 has collisions and is slow compared), I can process about 30 MB / s on my HO laptop. The server, of course, will chew it much faster.

So, if the arond file is 1GB, then it takes about 33 seconds to create all hases, and reading a 1Gb file using the window memory of the pages takes about 30 seconds. Not terrifying

Now we have two hash arrays representing the lines in each file. If we sort them, we can now use binary search, so we iterate our way through the hash files of the new files, which look for a match in the hash files of the old files. If we do not find it, this line is added to the changes file.

Keep in mind that the string book (obsolete database) is unknown in all aspects. There is no guarantee of row order, location of changes, type of changes.

The suggestions for reading page by page are good, but it is assumed that the two files are in smae order until the first change. It's impossible. Lines (lines) can be in any order. Also, the choice of arbitrary blockade violates the granularity of the line. For the purpose of this task, strings are immutable.

From this excellent invrementa download link: Capturing file associations: this method is also known as a differential snapshot method. This method works by saving before and after images of files that belong to the data warehouse. Entries are compared with the search for changes, and key entries are compared with the search for inserts and deletions. This method is most appropriate for legacy systems due to the fact that triggers generally do not exist, and transaction logs are either nonexistent or in a proprietary format. Since most legacy databases have some mechanism for dumping data into files, this method creates periodic snapshots and then compares the results to create change records. Of course, there are all the problems of static capture. The added complexity is introduced into the task of comparing entire lines of information and identifying and matching keys. This method is complex in nature and usually undesirable, but in some cases it may be the only solution.

This is most relevant here: As we move into the area of ​​terabyte data warehouses, the ability to rebuild the data warehouse from scratch on a nightly basis will follow the path of the dinosaur. A logical and efficient approach to updating a data warehouse involves some form of incremental update strategy.

So, I think I'm on the right track? Would a btree index provide benefits?

0
source

All Articles