Building your first Bayesian Model by Grid Approximation

AI Generated image with the word BAYESIAN on a scrabble board

Note: This article is my notes on Statistical Rethinking (2nd Edition) – Chapter 2

Bayesian Model is all about estimating the posterior probability of the parameters. There are 3 categories of techniques to estimate the posterior probability:

  1. Grid Approximation [This article]
  2. Quadratic approximation
  3. Markov Chain Monte Carlo (MCMC)

In this article, we will walk through what is Grid Approximation, its relation to Bayes Theorem, and how to do it in R. This method can also easily translated to Python.

What is Grid Approximation

First, let’s recap the Bayes Theorem in the context of parameter estimation.

    \[\text{Posterior} = \frac{\text{Probability of the data} \times \text{Prior}}{\text{Average probability of the data}}\]

To be more concrete, let’s consider a case where we are flipping a coin. The outcome could be either head (H) or tail (T). We define the probability p is the probability of getting head (H).

To fit this outcome into the equation, we have

    \[\text{Pr}(p|H,T) = \frac{\text{Pr}(H,T|p) \times \text{Pr}(p)}{\text{Pr}(H, T)}\]

In Bayesian context, we see the coin flipping exercise is just a process of generating outcome, based on different probability 𝑝 of having heads. The question is: we do not know what is the parameter 𝑝. If the coin is fair, then p=0.5 . What if the coin is not fair? could it be 0.4, 0.3, or 0.1?

Our task is to estimate this parameter 𝑝 based on the outcome we observed.

The steps of Grid Approximation

Grid Approximation is simply listing out all the possible values of 𝑝 from 0 to 1, and we do the same process as the formula above:

  1. Prior (\text{Pr}(p)): We assume there is the same chance of having any value from 0 to 1, i.e. Uniform distribution.
  2. Likelihood (\text{Pr}(H,T|p)): For each value of p , we estimate the probability of seeing h number of heads within N trails. Let’s say if we flip it for 10 times, and seeing 3 heads, then h=3 , and N=10. What is the probability of seeing this result if we say p=0.5 ? This becomes a binomial distribution question.
  3. Average Probability of Data ( \text{Pr}(H,T) ): We average out the probability of seeing heads against all the ps that we have tested. This is used as normalisation factor.

Our goal is to estimate the posterior distribution of the parameter p, which is the probability of tossing Head of a the coin. Assume we have 9 trials of the toss, and obtain 6 Heads.

The steps of Grid approximation is as follow:

  1. Define the Grid. This means we decide how many points to use in estimating the posteriors, and then we make a list of parameter values on the grid.
  2. Compute the value of the prior at each parameter value on the grid
  3. Compute the likelihood at each parameter value
  4. Compute the unstandardised posterior at each parameter value, by multiplying the prior by the likelihood

1. Define the grid

R
# Define the grid - From 0 to 1, with 20 steps
# p is the parameter that use to generate likelihood
p_grid <- seq(from = 0, to = 1, length.out = 20)
p_grid
##  [1] 0.00000000 0.05263158 0.10526316 0.15789474 0.21052632 0.26315789
##  [7] 0.31578947 0.36842105 0.42105263 0.47368421 0.52631579 0.57894737
## [13] 0.63157895 0.68421053 0.73684211 0.78947368 0.84210526 0.89473684
## [19] 0.94736842 1.00000000

2. Define prior

Flat Prior from a to b is defined as

    \[\text{Prior}=\frac{1}{(b-a)}=\frac{1}{(1-0)}=1\]

R
# Let's assume a flat prior
# For all values of p they have the same probability
# Flat prior = 1 / (1 - 0) = 1
prior <- rep(1, 20)

# We can also try different priors
# prior <- ifelse(p_trid < 0.5, 0, 1)
# prior <- exp(-5 * abs(p_grid - 0.5))

prior
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

3. Compute the likelihood at each value in the grid

What is the probability of seeing 6H and 3T in 9 trials, if the probability is given as we defined in p_grid?

R
likelihood <- dbinom(6, size = 9, prob = p_grid)
likelihood
##  [1] 0.000000e+00 1.518149e-06 8.185093e-05 7.772923e-04 3.598575e-03
##  [6] 1.116095e-02 2.668299e-02 5.292110e-02 9.082698e-02 1.383413e-01
## [11] 1.897686e-01 2.361147e-01 2.666113e-01 2.714006e-01 2.450051e-01
## [16] 1.897686e-01 1.179181e-01 5.026670e-02 8.853845e-03 0.000000e+00

4. Compute the product of likelihood and prior

R
unstd.posterior <- likelihood * prior
unstd.posterior
##  [1] 0.000000e+00 1.518149e-06 8.185093e-05 7.772923e-04 3.598575e-03
##  [6] 1.116095e-02 2.668299e-02 5.292110e-02 9.082698e-02 1.383413e-01
## [11] 1.897686e-01 2.361147e-01 2.666113e-01 2.714006e-01 2.450051e-01
## [16] 1.897686e-01 1.179181e-01 5.026670e-02 8.853845e-03 0.000000e+00

5. Standardise the posterior, so it sums to 1

R
posterior <- unstd.posterior / sum(unstd.posterior)
posterior
##  [1] 0.000000e+00 7.989837e-07 4.307717e-05 4.090797e-04 1.893887e-03
##  [6] 5.873873e-03 1.404294e-02 2.785174e-02 4.780115e-02 7.280739e-02
## [11] 9.987296e-02 1.242643e-01 1.403143e-01 1.428349e-01 1.289433e-01
## [16] 9.987296e-02 6.205890e-02 2.645477e-02 4.659673e-03 0.000000e+00

Tabulate all the output

R
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- data.frame(p_values = p_grid,
                 prior = prior,
                 likelihood = likelihood,
                 unstd_posterior = unstd.posterior,
                 posterior = posterior)

knitr::kable(df, col.names=c("p", "Prior", "Likelihood", "Unnormalised Posterior", "Posterior")) %>%
  kableExtra::kable_paper("hover", full_width = F)
pPriorLikelihoodUnnormalised PosteriorPosterior
0.000000010.00000000.00000000.0000000
0.052631610.00000150.00000150.0000008
0.105263210.00008190.00008190.0000431
0.157894710.00077730.00077730.0004091
0.210526310.00359860.00359860.0018939
0.263157910.01116090.01116090.0058739
0.315789510.02668300.02668300.0140429
0.368421110.05292110.05292110.0278517
0.421052610.09082700.09082700.0478012
0.473684210.13834130.13834130.0728074
0.526315810.18976860.18976860.0998730
0.578947410.23611470.23611470.1242643
0.631578910.26661130.26661130.1403143
0.684210510.27140060.27140060.1428349
0.736842110.24500510.24500510.1289433
0.789473710.18976860.18976860.0998730
0.842105310.11791810.11791810.0620589
0.894736810.05026670.05026670.0264548
0.947368410.00885380.00885380.0046597
1.000000010.00000000.00000000.0000000

And to visualise the distribution of the posterior probabilities. Based on the posterior distribution, it seems like p is more skewing to the higher side. Maybe from the observation (6 Heads) that we have seen, the coin may not be a fair coin!

R
plot(p_grid, posterior, type = 'b', lty = 1, xlab = "Values of Parameter p", ylab = "Posterior Probability")

Drawbacks

In simple case, we can use Grid Approximation to estimate the Posterior probability of a parameter. However, if the number of parameters increase, there is a huge challenge of using Grid Approximation. Hence, we will look at Quadratic Approximation and Markov Chain Monte Carlo (MCMC) methods in next examples.

Reference