Probable heatmap in ggplot

Question

Probable heatmap in ggplot

I asked this question a year ago and received the code for this "probabilistic heat map": heatmap

numbet <- 32 numtri <- 1e5 prob=5/6 #Fill a matrix xcum <- matrix(NA, nrow=numtri, ncol=numbet+1) for (i in 1:numtri) { x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE) xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet)) } colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep="")) mxcum <- reshape(data.frame(xcum), varying=1+1:numbet, idvar="trial", v.names="outcome", direction="long", timevar="bet") library(plyr) mxcum2 <- ddply(mxcum, .(bet, outcome), nrow) mxcum3 <- ddply(mxcum2, .(bet), summarize, ymin=c(0, head(seq_along(V1)/length(V1), -1)), ymax=seq_along(V1)/length(V1), fill=(V1/sum(V1))) head(mxcum3) library(ggplot2) p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) + geom_rect(aes(fill=fill), colour="grey80") + scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") + scale_y_continuous(formatter="percent") + xlab("Bet") print(p)

(You may need to modify this code a bit because of this )

This is almost what I want. Except that each vertical shaft should have a different number of hoppers, that is, the first should have 2, the second 3, the third 4 (N + 1). On the graph, the shaft 6 +7 has the same number of bins (7), where 7 should have 8 (N + 1).

If I'm right, the reason the code does this is because it is observable data, and if I do more testing, we get more bunkers. I do not want to rely on the number of trials to get the right number of boxes.

How can I adapt this code to get the correct number of mailboxes?

+4

r probability ggplot2

Frank_zafka Jan 16 '13 at 16:37

source share

1 answer

Arun · Accepted Answer · 2013-01-16T23:00:22+0000

I used R dbinom to generate the head frequency for n=1:32 tests and plotted now. It will be what you expect. I read some of your previous posts here on SO and on math.stackexchange . However, I don’t understand why you would like to simulate experimenting and not generate from binomial RV. If you could explain it, that would be great! I will try to work with @Andrie's simulation solution to check if I can match the result shown below. You might be interested in here at the moment.

 set.seed(42) numbet <- 32 numtri <- 1e5 prob=5/6 require(plyr) out <- ldply(1:numbet, function(idx) { outcome <- dbinom(idx:0, size=idx, prob=prob) bet <- rep(idx, length(outcome)) N <- round(outcome * numtri) ymin <- c(0, head(seq_along(N)/length(N), -1)) ymax <- seq_along(N)/length(N) data.frame(bet, fill=outcome, ymin, ymax) }) require(ggplot2) p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) + geom_rect(aes(fill=fill), colour="grey80") + scale_fill_gradient("Outcome", low="red", high="blue") + xlab("Bet")

The plot:

Edit: An explanation of how your old code from Andrie and why it doesn't give what you intend.

In principle, what Andri did (or rather, one way to look at it), you should use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p) , where n, m = size and p = probability of success their sum, X + Y = B(n + m, p) (1). So the goal of xcum is to get the result for all tags n = 1:32 , but for a better explanation, let me build the code step by step. Along with the explanation, the code for xcum will also be very obvious, and it can be created as soon as possible (without the need for for-loop and building cumsum every time.

If you have still followed me, our idea is to create a numtri * numbet with each column ( length = numtri ) with 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000 , then you will have ~ 834 0's and 166 1's * for each numbet column (= 32 here). Build it and check it first.

 numtri <- 1e3 numbet <- 32 set.seed(45) xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE))) # check for count of 1's > apply(xcum, 2, sum) [1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166 163 164 176 162 160 177 157 163 166 146 170 # So, the count of 1 are "approximately" what we expect (around 166).

Now each of these columns is a binomial distribution sample with n = 1 and size = numtri . If we added the first two columns and replaced the second column with this sum, then from (1), since the probabilities are equal, we get a binomial distribution with n = 2 . Similarly, if you add the first three columns and replace the third column with this sum, you get a binomial distribution with n = 3 and so on ... The concept is that if you add each column cumulatively , you get a numbet number binomial distributions (from 1 to 32 here). So do it.

 xcum <- t(apply(xcum, 1, cumsum)) # you can verify that the second column has similar probabilities by this: # calculate the frequency of all values in 2nd column. > table(xcum[,2]) 0 1 2 694 285 21 > round(numtri * dbinom(2:0, 2, prob=5/6)) [1] 694 278 28 # more or less identical, good!

If you split xcum , for now, we created cumsum(1:numbet) for each line as follows:

 xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)

this will be identical to the xcum matrix that exits the for-loop (if you create it with the same seed). However, I do not quite understand the reason for this division of Andri, since it is not necessary to create the required schedule. However, I suppose this has something to do with the frequency values that you mentioned in an earlier post on math.stackexchange

Now about why you had difficulty getting the graph I attached (with n+1 ):

For a binomial distribution with tests n=1:32 5/6 as the probabilities of tails (failures) and 1/6 as probabilities of goals (successes), the probability of heads k is determined by:

 nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k

For the test data that we generated, for n=7 and n=8 (test), the probability of heads k=0:7 and k=0:8 is set:

 # n=7 0 1 2 3 4 5 .278 .394 .233 .077 .016 .002 # n=8 0 1 2 3 4 5 .229 .375 .254 .111 .025 .006

Why do they both have 6 bins rather than 8 and 9 bins? Of course, this is due to numtri=1000 . Let's see what the probabilities of each of these 8 and 9 bunkers are, generating probabilities directly from the binomial distribution, using dbinom to see why this happens.

 # n = 7 dbinom(7:0, 7, prob=5/6) # output rounded to 3 decimal places [1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000 # n = 8 dbinom(8:0, 8, prob=5/6) # output rounded to 3 decimal places [1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000

You see that the probabilities corresponding to k=6,7 and k=6,7,8 , corresponding to n=7 and n=8 , are ~ 0 . They are very low in value. The minimum value here is 5.8 * 1e-7 actually ( n=8 , k=8 ). This means that you have a chance to get 1 value if you simulated 1/5.8 * 1e7 times. If you check the same for n=32 and k=32 , the value will be 1.256493 * 1e-25 . Thus, you will need to simulate this set of values to get at least 1 result, where all 32 results start with n=32 .

This is why your results did not matter for specific bins, because the likelihood that it will be very low for a given numtri . For the same reason, generating probabilities directly from the binomial distribution overcomes this problem / limitation.

Hope I managed to write with clarity enough for you to follow. Let me know if you are unable to get through.

Edit 2: When I simulated the code that I just edited above, numtri=1e6 , I get it for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8 :

 # n = 7 0 1 2 3 4 5 6 7 279347 391386 233771 77698 15763 1915 117 3 # n = 8 0 1 2 3 4 5 6 7 8 232835 372466 259856 104116 26041 4271 392 22 1

Note that now k = 6 and k = 7 for n = 7 and n = 8. In addition, for n = 8 you have a value of 1 for k = 8. With an increase in numtri you will get more other missing boxes. But this will require a huge amount of time / memory (if at all).

Probable heatmap in ggplot

More articles: