Efficient solution for grouping the same values in a large dataset

Question

Efficient solution for grouping the same values in a large dataset

At my work, I had to develop and implement a solution for the following problem:

Given a set of data from 30M records of extraction (key, value) of tuples from a specific field of a data set, group them by key and value, keeping the number of identical values for each key. Record the 5000 most commonly used values for each key in the database. Each row of the data set contains up to 100 (key, value) tuples in the form of serialized XML.

I came up with such a solution (using Spring-Batch ):

Batch job actions:

Step 1. Iterate over the rows of the data set and extract (key, value) tuples. Having received a certain fixed number of tuples, unload them to disk. Each tuple goes to a file with the template name '/ chunk-', so all values for the specified key are stored in the same directory. Inside a single file, the values are stored sorted.

Step 2. Iterate over all the directories and merge their chunk files into one group with the same values. Since the values are stored sorted, it is trivial to combine them for complexity O (n * log k), where "n" is the number of values in the chunk file, and "k" is the initial number of fragments.

Step 3 .. For each merged file (in other words, for each key), read its values sequentially, using PriorityQueue to maintain 5000 values without loading all the values into memory. Write the contents of the queue to the database.

I spent about a week on this task, mainly because I had not worked with Spring-Batch before and because I tried to focus on scalability, which requires the exact implementation of the multi-threaded part.

The problem is that my manager too quickly considers this task to spend so much time on it.

And the question arises: do you know a more effective solution or may be less effective, which would be easier to implement? And how long will it take you to implement my solution?

I know about MapReduce-like frameworks, but I can’t use them because the application must be running on a simple PC with 3 cores and 1 GB for the Java heap.

Thank you in advance!

UPD: I think I did not clearly state my question. Let me ask another way:

Given the problem and being a project manager or at least a reviewer, do you make my decision? And how much time will you devote to this task?

+8

java algorithm spring-batch batch-processing bigdata

Alexander Solovets Oct 15 '12 at 8:49

source share

4 answers

Storstamp · Answer 1 · 2012-10-15T10:42:53+0000

Are you sure this approach is faster than pre-checking the XML file to extract all the keys, and then parsing the XML file for each key again and again? In this solution, you perform many file management tasks, which is definitely not free.

Since you have three cores, you can parse three keys at the same time (as long as the file system can handle the load).

Louis ricci · Answer 2 · 2012-10-15T15:18:50+0000

Your solution seems reasonable and efficient, however I would probably use SQL.

When parsing key / value pairs, I insert / update the SQL table. Then I will query the table for the top records.

Here is an example of using only T-SQL (SQL 2008, but the concept should work in most any mordern rdbms)

The SQL between / START / and / END / will be the statement that needs to be executed in the code.

BEGIN -- database table DECLARE @tbl TABLE ( k INT -- key , v INT -- value , c INT -- count , UNIQUE CLUSTERED (k, v) ) -- insertion loop (for testing) DECLARE @x INT SET @x = 0 SET NOCOUNT OFF WHILE (@x < 1000000) BEGIN -- SET @x = @x + 1 DECLARE @k INT DECLARE @v INT SET @k = CAST(RAND() * 10 as INT) SET @v = CAST(RAND() * 100 as INT) -- the INSERT / UPDATE code /* START this is the sql you'd run for each row */ UPDATE @tbl SET c = c + 1 WHERE k = @k AND v = @v IF @@ROWCOUNT = 0 INSERT INTO @tbl VALUES (@k, @v, 1) /* END */ -- END SET NOCOUNT ON -- final select DECLARE @topN INT SET @topN = 50 /* START this is the sql you'd run once at the end */ SELECT ak , av FROM ( SELECT ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid] , k , v FROM @tbl ) a WHERE a.rid < @topN /* END */ END

Bohemian · Answer 3 · 2012-10-15T15:49:01+0000

It doesn't seem like a lot of work tried the old-fashioned way of just doing it in memory.

I would try just doing it first, and then if you run out of memory, try one key in one pass (as per @Storstamp's answer).

Storstamp · Answer 4 · 2012-10-15T18:53:39+0000

If using a “simple” solution is not an option due to data size, my next choice would be to use an SQL database. However, since most of them require quite a lot of memory (and scans for a large overload in RAM), perhaps you should redirect your search to something like a NoSQL database, such as MongoDB , which can be very efficient even mainly disk based. (What your environment requires, having only 1 GB of heap).

The NoSQL database will do all the basic accounting for you (storing data, tracking all indexes, sorting them) and probably can do it a little more efficiently than your solution, because all the data can be sorted and indexed already upon insertion, removing the additional steps of sorting lines in / chunk files, merging them, etc.

In the end, you will have a solution that is probably much easier to administer, and it will also allow you to configure different queries, rather than optimizing just for that particular case.

As a project manager, I would not mind your current decision. This is already fast and solves the problem. However, as an architect, I would object that the solution was a little difficult to maintain, and also not to use proven technologies that basically do the same as you yourself encoded. It is hard to beat the tree and hash implementations of modern databases.

Efficient solution for grouping the same values ​​in a large dataset

More articles:

Efficient solution for grouping the same values in a large dataset