Download and analyze huge amounts of data

Question

Download and analyze huge amounts of data

So, for some research work, I need to analyze a ton of raw motion data (currently almost gigantic data and growth) and spit out quantitative information and graphs.

I wrote most of it using Groovy (with JFreeChart for charting), and when performance became a problem, I rewrote the main parts in Java.

The problem is that analysis and plotting takes about a minute, while downloading all the data takes about 5-10 minutes. As you can imagine, this becomes very annoying when I want to make small changes to the graphs and see the result.

I have a couple of ideas to fix this:

Upload all the data to the SQLite database.
Pros: It will be fast. I can run SQL to get aggregated data if I need to.
Cons: I have to write all this code. In addition, for some graphs, I need access to each data point, so downloading several hundred thousand files, some parts may be slow.
Java RMI to return an object. All data is loaded into a single root object, which, when serialized, is about 200 megabytes. I'm not sure how long it will take to pass a 200meg object through RMI. (same customer).
I would have to start the server and load all the data, but this is not very important.
The main pro: recording takes less time
Start the server that loads the data and runs the Groovy script on command in the vm server. All in all, this seems like a better idea (for implementation time and performance, as well as for other long-term benefits)

What I would like to know is that other people have solved this problem?

Post-analysis (3/29/2011): a couple of months after writing this question, I had to study R in order to run some statistics. Using R was much simpler and faster for analyzing and aggregating data than what I did.

In the end, I ended up using Java to run pre-aggregation, and then did the rest in R. R was much easier to create pretty charts than using JFreeChart.

+6

java groovy

Reverend gonzo Nov 04 '09 at 1:57

source share

6 answers

Before deciding, it’s probably worth understanding what is happening with your JVM, as well as with your physical system resources.

There are several factors here:

Jvm heap size
garbage collection algorithms
how much physical memory do you have
how do you download data - is it from a file fragmented throughout the disk?
You even need to download all the data in one go - maybe it will execute packages
If you do this in batches, you can resize the batch and see what happens.
If your system has multiple cores, perhaps you could use more than one thread at a time to process / load data.
if multiple cores are already in use and disk I / O is the bottleneck, you might be able to boot from different drives at the same time.

You should also look at http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp if you are not familiar with the settings of the virtual machine.

+2

anger Nov 04 '09 at 4:31

source share

If your data has relational properties, there is nothing more natural than storing it in some SQL database. There you can solve your biggest problem - performance, costing "just" to write the appropriate SQL code.

It seems to me very clear.

+1

Rubens farias Nov 04 '09 at 2:05

source share

I would look at the analysis using R. This is a statistical language with graphical display capabilities. This may put you ahead, especially if this is the analysis that you intend to do. Why write all this code?

+1

duffymo Nov 04 '09 at 2:07

source share

I would recommend starting the profiler to find out which part of the loading process takes the most time, and if there is a possible optimization of quick winnings. You can download the evaluation license from JProfiler or YourKit .

0

Jason gritman Nov 04 '09 at 2:37

source share

Oh yes: big data structures in Java. Good luck with this, survive "death from garbage collection" and all. It seems that java works best with the UI around some other processing mechanism, although it frees developers from most memory management tasks - for the price. If it were me, I would most likely do a heavy crunch in Perl (I had to transcode several pieces of the batch system into perl instead of java in the last job for performance reasons), and then return the results back to the existing graphic code.

However, given the options you are offering, you probably want to go with the SQL DB route. Just make sure that it is really faster for several selective queries, look at the query plan data and all that (provided that your system will register or display such data interactively)

Edit, (Jim Ferrance) re: java big-N is faster than perl (comment below): the benchmarks you referenced are mostly small “arithmetic” loops, not something that does a few hundred MB I / O and saves it in a Map /% hash / Dictionary / associative array for later review. Java I / O may have gotten better, but I suspect that all abstractness still makes it relatively slow, and I know that GC is a killer. I have not tested this recently, I do not process data files with several GB on a daily basis in my current task, as I used to.

Feeding the trolls (12/21): I measured Perl faster than Java to execute a sequence of sequential string processing . In fact, depending on which computer I used, Perl was 3 and 25 times faster than Java for this kind of work (batch + line). Of course, the specific thrash test I compiled did not include any numerical work, which I suspect would be much better, and did not require caching a lot of data in Map / hash, which I suspect Perl will have done a bit it's better. Please note that Java has greatly improved the use of a large number of threads.

-4

Roboprog Nov 04 '09 at 2:21

source share

Ztranger · Accepted Answer · 2009-11-04T02:06:02+0000

Databases are very scalable if you have a huge amount of data. In MS SQL, we are currently grouping / adding / filtering about 30 GB of data in 4 minutes (somewhere around 17 million records that I think).

If the data will not grow very much, I will try approach number 2. You can create a simple test application that creates a 200-400mb object with random data and checks the transmission performance before deciding whether you want to go this route.

Download and analyze huge amounts of data

More articles: