Finding Changes Between 2 HUGE (Text) Files

I have access to .com zone files. A zone file is a text file with a list of domain names and their name servers. It follows the format, for example:

mydomain NS ns.mynameserver.com. mydomain NS ns2.mynameserver.com. anotherdomain NS nameservers.com. notinalphadomain NS ns.example.com. notinalphadomain NS ns1.example.com. notinalphadomain NS ns2.example.com. 

As you can see, for each domain there can be several lines (if there are several name servers), and the file is NOT in alpha order. These files are 7 GB in size .

I try to take the previous file and the new file and compare them to find:

  • What domains were added
  • Which domains have been deleted
  • Which domains have name servers changed

Since 7GB is too much to load the entire file into memory, obviously, I need to read in the stream. The method that I have now come up with as the best way to do this is to make a few passes over both files. One pass for each letter of the alphabet, loading all domains in the first pass that start with “a,” for example. Once I have all the “a” domains from the old and new files, I can make a fairly simple comparison in memory to find the changes.

The problem is that even reading char on char and optimizing, as far as I could think, each pass through the file takes about 200-300 seconds , with the collection of all domains for the current missing letter. So, I believe that in the current state I look at about an hour to process files without even saving changes to the database (which will take some time). It's on a dual-core quad-core xeon processor, so there’s not much to throw more horsepower into it for me. This time cannot be a robber, but I hope someone has bright ideas on how to speed up the process ... Admittedly, I have not tried asynchronous I / O yet, this is my next step.

Thanks in advance for any ideas!

+6
c # text large-files
source share
4 answers

Preparing your data can help, both in terms of the best type of code: an unwritten look, and in terms of speed of execution.

 cat yesterday-com-zone | tr AZ az | sort > prepared-yesterday cat today-com-zone | tr AZ az | sort > prepared-today 

Your program now runs a very simple difference algorithm, and you can even use diff:

 diff prepared-today prepared-yesterday 

Edit:

And an alternative solution that removes the extra processing at the possible cost of diff runtime. This also involves using GnuWin32 CoreUtils:

 sort -f <today-com-zone >prepared-today sort -f <yesterday-com-zone >prepared-yesterday diff -i prepared-today prepared-yesterday 

The result will be a list of additions, paragraphs and changes. Not necessarily 1 change record for each zone (think what happens when two domains are deleted in alphabetical order). You may need to play around with diff options to make it not check as many lines of context to avoid big changes to false positives.

You may need to write your own program to take two sorted input files and just run them in lock mode for each zone. When a new zone is found in the TODAY file, it is a new zone. When the "new" zone is in the YESTERD file (but is missing today), this is a deletion. When the "same" zone is in both files, compare the NS records. This is either a change or a change in name servers.

+5
source share

The question has already been given, but I will give a more detailed answer, with facts that are good for everyone to understand. I will try to highlight existing solutions and even distribute them with explanations of why everything turned out the way they are.

You have a 7GB text file. Your disk allows us to transfer data, let it be pessimistic, 20 MB / s. It can transmit it all in 350 seconds. It is less than 6 minutes.

If we assume that the middle line is 70 characters, we have 100 million lines. If our disk rotates at a speed of 6000 rpm, the average rotation takes 0.01 seconds, so the capture of a random part of the data from the disk can take from 0 to 0.01 seconds and an average of 0.005 seconds. This is called our search time. If you know exactly where each record is and find each row, it will take you 0.005 seconds * 100,000,000 = 500,000 seconds, which is close to 6 days.

Lessons?

  • When working with data on disk, you really want to avoid searching. You want to transfer data.
  • If possible, you do not want your data to be on disk.

Now the standard way to solve this problem is to sort the data. The standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then combining them together to get a larger block. The merge operation transfers data and writes a stream, which is exactly the type of access pattern that you like on disk. Now, theoretically, with 100 million rows, you would need 27 passes with merging. But in fact, most of these passages fit easily into memory. In addition, a smart implementation that seems to be nsort can compress intermediate data files to save more passes in memory. This data set must be highly structured and compressible, in which all intermediate data files must be able to fit in RAM. Therefore, you completely avoid using the drive except reading and writing data.

This is the solution you are facing.


OK, so we’ll show you how to solve this problem. What more can be said?

Very little. Analyze what happened with the database suggestions. The standard database contains a table and some indexes. An index is just a structured data set that tells you where your data is in your table. Thus, you look at the index (perhaps, you perform several queries, although in practice all but the last are usually located in RAM), which then tells you where your data is in the table, and then you need to look again to get data. Thus, capturing part of the data from a large table potentially means 2 disk accesses. In addition, writing a piece of data to a table means writing data to the table and updating the index. This means that you write in several places. This means more drives are looking.

As I explained at the beginning, disk problems are bad. You do not want to do this. This is a catastrophe.

But you ask if the database data knows these things? Well of course yes. They create databases to do what users ask them to do, and they do not control users. But they also make them do the right thing when they can understand what it is. If you work with a decent database (for example, Oracle or PostgreSQL, but not MySQL), the database will have a good idea when it will be worse to use an index than to merge and choose to do the right thing. But this can only be done if it has the whole context, so it is very important to push the work into the database, rather than encode a simple loop.

In addition, the database is good at not writing everywhere until it is needed. In particular, the database writes something called the WAL log (write access log - yes, I know the second log is redundant) and updates the data in memory. When he bypasses it, he writes the changes in memory to disk. This allows you to record and make it look less. However, there is a limit on how much you can collect. Thus, maintaining indexes is an expensive operation. This is why the standard advice for large data loads in databases is to delete all indexes, load the table, and then recreate the indexes.

But all this suggests that databases have limitations. If you know the correct way to solve the problem inside the database, I guarantee that using this solution without the overhead of the database will always be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do this, it’s much easier to get a database, how to do it reasonably well than to make the perfect solution from scratch.


And the last bit. What if we have a cluster of machines? The standard solution for this case (popularized by Google, which uses it internally) is called MapReduce. Based on this, there is an observation that merge sorting, which is good for a disk, is also very good for distributing work on multiple machines. So we really want to force work.

The trick that is used for this is to complete the work in three main steps:

  • Take large amounts of data and release a stream of key / relevant facts.
  • Sort the facts, divide them by key / values ​​and send them for further processing.
  • You have a reducer that takes a set of keys / values ​​and does something with them.

If necessary, the reducer can send data to another MapReduce, and you can execute a line in any set of these operations.

From the user's point of view, the good thing about this paradigm is that all you have to do is write a simple match (takes a piece of data - for example, a string and emits 0 or more key / value pairs) and a reducer (takes the / a set of values, something does to it), and gory details can be dropped into your MapReduce infrastructure. You do not need to know that it uses sorting under the hood. And he can even take care of things like what needs to be done if one of your working machines dies in the middle of your work. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widespread framework that will work with many other languages. (Yes, it is written in Java, but it doesn’t matter what language the converter and reducer are written in.)

In your case, your cartographer can start with a piece of data in the form (filename, block_start) , open this file, start from this block and emit for each line a key / value pair of the form domain: (filename, registrar) . Then the reducer will receive for one domain 1 or 2 files from which it came, with full information. Then it emits only the facts of interest. Adds that it is in the new, but not in the old. Drops are that it is in the old, but not new. Changes to the registrar are that it is in both, but the registrar has changed.

Assuming your file is easily accessible in a compressed form (so it can be easily copied to multiple clients), this can allow you to process your data set much faster than any single machine can do.

+4
source share

This is very similar to a Google interview question that looks something like this: "say that you have a list of one million 32-bit integers that you want to print in ascending order, and the machine you are working on has only 2 MB of RAM how would you approach the problem? "

The answer (or rather, one correct answer) is to split the list into manageable chunks, sort each fragment and then use the merge operation to create the final sorted list.

So I wonder if a similar approach can work here. As in the case, starting from the first list, read as much data as you can efficiently work in memory immediately. Sort it, and then write the sorted fragment to disk. Repeat this until you process the entire file and then combine the pieces to build one sorted dataset (this step is optional ... you can just make the final comparison using all sorted fragments from file 1 and all sorted fragments from file 2).

Repeat the steps above for the second file, and then open two sorted data sets and read them one line at a time. If the lines match, go to the next line. Otherwise, write down the difference in your result set (or output file), and then run some file with a lexicographically “lower” value on the next line and repeat.

Not sure how fast it will be, but it is almost certainly faster than doing 26 passes through each file (you have 1 pass to assemble the pieces, 1 pass to combine the pieces and 1 pass to compare sorted data sets).

This or use a database.

+3
source share

You must read each file once and save them in the database. Then you can perform any analysis you need using database queries. Databases are designed to quickly process and process large amounts of data like this.

It will still be pretty slow to read all the data in the database for the first time, but you won’t have to read the files more than once.

+2
source share

All Articles