One of the problems with your schema is that any duplicate lines will have a repeating hash; You can never tell when one of these lines was added or deleted.
Very good point, but not a problem. The duplicate row is a duplicate, and all duplicates are deleted in the next processing step. So yes, you are right, but that is not a problem.
Does "diff" link to a page describing what I assume this application is? There is no download link, no language in any language ... What am I missing here?
Some of you have talked about byte level granularity. It's not needed. only granularity is needed at the row level, because if something in the row has been changed, the entire row (record) must be processed, because any change within the row affects the entire row.
So, we compare the lines of about 1000 characters (without binary), in two files (today's shot and yesterday's shot), each of which is about 1 m.
Thus, using a secure hash such as SHA256 (MD5 has collisions and is slow compared), I can process about 30 MB / s on my HO laptop. The server, of course, will chew it much faster.
So, if the arond file is 1GB, then it takes about 33 seconds to create all hases, and reading a 1Gb file using the window memory of the pages takes about 30 seconds. Not terrifying
Now we have two hash arrays representing the lines in each file. If we sort them, we can now use binary search, so we iterate our way through the hash files of the new files, which look for a match in the hash files of the old files. If we do not find it, this line is added to the changes file.
Keep in mind that the string book (obsolete database) is unknown in all aspects. There is no guarantee of row order, location of changes, type of changes.
The suggestions for reading page by page are good, but it is assumed that the two files are in smae order until the first change. It's impossible. Lines (lines) can be in any order. Also, the choice of arbitrary blockade violates the granularity of the line. For the purpose of this task, strings are immutable.
From this excellent invrementa download link: Capturing file associations: this method is also known as a differential snapshot method. This method works by saving before and after images of files that belong to the data warehouse. Entries are compared with the search for changes, and key entries are compared with the search for inserts and deletions. This method is most appropriate for legacy systems due to the fact that triggers generally do not exist, and transaction logs are either nonexistent or in a proprietary format. Since most legacy databases have some mechanism for dumping data into files, this method creates periodic snapshots and then compares the results to create change records. Of course, there are all the problems of static capture. The added complexity is introduced into the task of comparing entire lines of information and identifying and matching keys. This method is complex in nature and usually undesirable, but in some cases it may be the only solution.
This is most relevant here: As we move into the area of terabyte data warehouses, the ability to rebuild the data warehouse from scratch on a nightly basis will follow the path of the dinosaur. A logical and efficient approach to updating a data warehouse involves some form of incremental update strategy.
So, I think I'm on the right track? Would a btree index provide benefits?