Build only a subset of points?

I am trying to plot a CDF curve for a large dataset containing about 29 million values โ€‹โ€‹using ggplot. The way I calculate this is as follows:

mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value)) plot = ggplot(mycounts, aes(x=Value, y=ecd)) 

It takes a long time for the plot. I was wondering if there is a clean way to build only a sample of this data set (for example, every tenth or 50th point) without prejudice to the actual result?

+4
source share
2 answers

I'm not sure about your data structure, but a simple call to sample might be enough:

 n <- nrow(mycounts) # number of cases in data frame mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame 
+5
source

Instead of taking every nth point, can you quantize your data to a sufficient resolution before building it? Thus, you do not have to display a resolution that you do not need (or not visible).

Here is one way to do it. (The function I wrote below is generic, but the example uses the names from your question.)

 library(ggplot2) library(plyr) ## A data set containing two ramps up to 100, one by 1, one by 10 tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))), Value = c(1:10 * 10, 1:100)) ## Given a data frame and ddply-style arguments, partition the frame ## using ddply and summarize the values in each partition with a ## quantized ecdf. The resulting data frame for each partition has ## two columns: value and value_ecdf. dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) { value_colname <- deparse(substitute(.value)) ddply(df, ..., .fun = function(rdf) { xs <- rdf[[value_colname]] qxs <- sort(unique(.quantizer(xs))) data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs)) }) } ## Plot each type ECDF (w/o quantization) tens_cdf <- dd_ecdf(tens, .(Type), .value = Value) qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf) ## Plot each type ECDF (quantizing to nearest 25) rounder <- function(...) function(x) round_any(x, ...) tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25)) qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq) 

While the original dataset and ecdf set had 110 lines, the quantized-ecdf set was greatly reduced:

 > dim(tens) [1] 110 2 > dim(tens_cdf) [1] 110 3 > dim(tens_cdfq) [1] 10 3 > tens_cdfq Type value value_ecdf 1 1 0 0.00 2 1 25 0.25 3 1 50 0.50 4 1 75 0.75 5 1 100 1.00 6 10 0 0.00 7 10 25 0.20 8 10 50 0.50 9 10 75 0.70 10 10 100 1.00 

Hope this helps! :-)

+1
source

All Articles