The question has already been given, but I will give a more detailed answer, with facts that are good for everyone to understand. I will try to highlight existing solutions and even distribute them with explanations of why everything turned out the way they are.
You have a 7GB text file. Your disk allows us to transfer data, let it be pessimistic, 20 MB / s. It can transmit it all in 350 seconds. It is less than 6 minutes.
If we assume that the middle line is 70 characters, we have 100 million lines. If our disk rotates at a speed of 6000 rpm, the average rotation takes 0.01 seconds, so the capture of a random part of the data from the disk can take from 0 to 0.01 seconds and an average of 0.005 seconds. This is called our search time. If you know exactly where each record is and find each row, it will take you 0.005 seconds * 100,000,000 = 500,000 seconds, which is close to 6 days.
Lessons?
- When working with data on disk, you really want to avoid searching. You want to transfer data.
- If possible, you do not want your data to be on disk.
Now the standard way to solve this problem is to sort the data. The standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then combining them together to get a larger block. The merge operation transfers data and writes a stream, which is exactly the type of access pattern that you like on disk. Now, theoretically, with 100 million rows, you would need 27 passes with merging. But in fact, most of these passages fit easily into memory. In addition, a smart implementation that seems to be nsort can compress intermediate data files to save more passes in memory. This data set must be highly structured and compressible, in which all intermediate data files must be able to fit in RAM. Therefore, you completely avoid using the drive except reading and writing data.
This is the solution you are facing.
OK, so we’ll show you how to solve this problem. What more can be said?
Very little. Analyze what happened with the database suggestions. The standard database contains a table and some indexes. An index is just a structured data set that tells you where your data is in your table. Thus, you look at the index (perhaps, you perform several queries, although in practice all but the last are usually located in RAM), which then tells you where your data is in the table, and then you need to look again to get data. Thus, capturing part of the data from a large table potentially means 2 disk accesses. In addition, writing a piece of data to a table means writing data to the table and updating the index. This means that you write in several places. This means more drives are looking.
As I explained at the beginning, disk problems are bad. You do not want to do this. This is a catastrophe.
But you ask if the database data knows these things? Well of course yes. They create databases to do what users ask them to do, and they do not control users. But they also make them do the right thing when they can understand what it is. If you work with a decent database (for example, Oracle or PostgreSQL, but not MySQL), the database will have a good idea when it will be worse to use an index than to merge and choose to do the right thing. But this can only be done if it has the whole context, so it is very important to push the work into the database, rather than encode a simple loop.
In addition, the database is good at not writing everywhere until it is needed. In particular, the database writes something called the WAL log (write access log - yes, I know the second log is redundant) and updates the data in memory. When he bypasses it, he writes the changes in memory to disk. This allows you to record and make it look less. However, there is a limit on how much you can collect. Thus, maintaining indexes is an expensive operation. This is why the standard advice for large data loads in databases is to delete all indexes, load the table, and then recreate the indexes.
But all this suggests that databases have limitations. If you know the correct way to solve the problem inside the database, I guarantee that using this solution without the overhead of the database will always be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do this, it’s much easier to get a database, how to do it reasonably well than to make the perfect solution from scratch.
And the last bit. What if we have a cluster of machines? The standard solution for this case (popularized by Google, which uses it internally) is called MapReduce. Based on this, there is an observation that merge sorting, which is good for a disk, is also very good for distributing work on multiple machines. So we really want to force work.
The trick that is used for this is to complete the work in three main steps:
- Take large amounts of data and release a stream of key / relevant facts.
- Sort the facts, divide them by key / values and send them for further processing.
- You have a reducer that takes a set of keys / values and does something with them.
If necessary, the reducer can send data to another MapReduce, and you can execute a line in any set of these operations.
From the user's point of view, the good thing about this paradigm is that all you have to do is write a simple match (takes a piece of data - for example, a string and emits 0 or more key / value pairs) and a reducer (takes the / a set of values, something does to it), and gory details can be dropped into your MapReduce infrastructure. You do not need to know that it uses sorting under the hood. And he can even take care of things like what needs to be done if one of your working machines dies in the middle of your work. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widespread framework that will work with many other languages. (Yes, it is written in Java, but it doesn’t matter what language the converter and reducer are written in.)
In your case, your cartographer can start with a piece of data in the form (filename, block_start) , open this file, start from this block and emit for each line a key / value pair of the form domain: (filename, registrar) . Then the reducer will receive for one domain 1 or 2 files from which it came, with full information. Then it emits only the facts of interest. Adds that it is in the new, but not in the old. Drops are that it is in the old, but not new. Changes to the registrar are that it is in both, but the registrar has changed.
Assuming your file is easily accessible in a compressed form (so it can be easily copied to multiple clients), this can allow you to process your data set much faster than any single machine can do.