At my work, I had to develop and implement a solution for the following problem:
Given a set of data from 30M records of extraction (key, value) of tuples from a specific field of a data set, group them by key and value, keeping the number of identical values for each key. Record the 5000 most commonly used values for each key in the database. Each row of the data set contains up to 100 (key, value) tuples in the form of serialized XML.
I came up with such a solution (using Spring-Batch ):
Batch job actions:
Step 1. Iterate over the rows of the data set and extract (key, value) tuples. Having received a certain fixed number of tuples, unload them to disk. Each tuple goes to a file with the template name '/ chunk-', so all values for the specified key are stored in the same directory. Inside a single file, the values are stored sorted.
Step 2. Iterate over all the directories and merge their chunk files into one group with the same values. Since the values are stored sorted, it is trivial to combine them for complexity O (n * log k), where "n" is the number of values in the chunk file, and "k" is the initial number of fragments.
Step 3 .. For each merged file (in other words, for each key), read its values sequentially, using PriorityQueue to maintain 5000 values without loading all the values into memory. Write the contents of the queue to the database.
I spent about a week on this task, mainly because I had not worked with Spring-Batch before and because I tried to focus on scalability, which requires the exact implementation of the multi-threaded part.
The problem is that my manager too quickly considers this task to spend so much time on it.
And the question arises: do you know a more effective solution or may be less effective, which would be easier to implement? And how long will it take you to implement my solution?
I know about MapReduce-like frameworks, but I can’t use them because the application must be running on a simple PC with 3 cores and 1 GB for the Java heap.
Thank you in advance!
UPD: I think I did not clearly state my question. Let me ask another way:
Given the problem and being a project manager or at least a reviewer, do you make my decision? And how much time will you devote to this task?