Machine Learning Math Statistics Uncategorized

Understanding the Metropolis Hasting algorithm – A tutorial

The Problem

Say you want to evaluate the expectation of a function over a random variable

$$E[f(x)] = \int p(x)f(x)dx$$,

or perhaps find the max of a probability distribution, typically the posterior,

$$\arg \max p(x|y)$$,

where calculating the derivate is intractable since it depends on some feature that comes from some algorithm or some unknown function.

These situations typically make us resort to approximate methods. One way of solving the expectation is to draw \(N\) random samples from \(p(x)\) and when \(N\) is sufficiently large we can approximate the expectation or the max by

$$E[f(x)] \approx \frac{1}{N} \sum\limits_{i=1}^{N}f(x_i)$$,

We can apply the same strategy to finding \(\arg \max p(x|y)\) by sampling from \(p(x|y)\) and taking the max value in the set of samples. When we have acquired a large enough amount of samples, we can be fairly sure of having found the max.

This way of approximating invariably leads to the question how to sample from a distribution \(p(x)\)(or \(p(x|y)\)). However, how can we sample from a pdf that in its current form already is too complicated?

A common way of sampling from a troublesome distribution is to use some kind of Markov Chain Monte-Carlo(MCMC) method. The core idea of MCMC is to generate a Markov chain that converges to the desired distribution after a sufficient amount of samples.

Markov Chains & Detailed Balance

A Markov chain is a sequence of random variables such that the current state only depends on the previous state,

$$p(X_n|X_{n-1},X_{n-2},X_{n-3}…,X_{1}) = p(X_n|X_{n-1}) $$.

To simulate a Markov chain we must formulate a transition kernel, \(T(x_i,x_j)\). The transition kernel is the probability of moving from a state \(x_i\) to a state \(x_j\); it can be either discrete or continuous. In undergraduate courses one often finds that it is discrete and formulated as a transition

Convergence for a Markov Chain means that it has a stationary distribution, \(\pi\). A stationary distribution implies that if we run the Markov Chain repeatedly, for long enough time, the samples we get for each run will always form the same distribution.

So how do we know when the kernel has a stationary distribution? Detailed balance is a sufficient but not necessary condition for a Markov chain to be stationary. Detailed balance essentially says that moving from state \(x_i\) to \(x_j\) has the same probability as moving from \(x_j\) to \(x_i\). Observe though that this does not mean that all states are equally probable. We state detailed balance more formally as follows,

$$\pi(x_i) T(x_i,x_j) = \pi(x_j)T(x_j,x_i), \; \forall x_i,x_j$$.

The MCMC Idea

The key idea for sampling from a distribution, \(p(x)\), using MCMC should be clear from the above. We put \(p(x)\) as our stationary distribution, find a transition kernel, \(T\), that fulfills the detailed balance condition, and generate samples from the Markov chain. By detailed balance the Markov chain will converge to our desired distribution, and we can use the samples for calculating the expectation or finding the max, or for whatever other reason we wanted samples from the distribution.

Finding Detailed Balance

How do we find a transition kernel that fulfills the detailed balance condition? Say we pick any distribution (transition kernel), \(T(x_i|x_j)\), that we can easily sample from. It will naturally be so that we will move from some state \(x_i\) to state \(x_j\) a bit more probable, that is,

$$\pi(x_i) T(x_i,x_j) > \pi(x_j)T(x_j,x_i)$$.

We can compensate for this by multiplying in another transition term, \(A(x_i,x_j)\) such that,

$$\pi(x_i) T(x_i,x_j) A(x_i,x_j) = \pi(x_j)T(x_j,x_i)$$.

This means that our balancing term will have the form,

$$A(x_i,x_j) = \frac{\pi(x_j)T(x_j,x_i)}{\pi(x_i) T(x_i,x_j)}$$.

If we are in state \(x_i\), and have generated a sample from \(T\), \(x_j\). \(A(x_i,x_j)\) thus becomes the binary probability of either transitioning to state \(x_j\) or staying in state \(x_i\). All we have to do to make a decision is to generate a uniform sample, \(u\), on \([0,1]\) and if \( u < A(x_i,x_j)\) we move to state \(x_j\). Sometimes we will have \(A(x_i,x_j)>1\), this means that moving from \(x_i\) to \(x_j\) is likely but the reverse is very unlikely, implying that the sample will need to automatically pass to the new state every time.

The Full Metropolis Hasting Algorithm

The full MH-algorithm thus takes the following form:

(0) For \(n = 1,2,\ldots, N\), set current state to \(x_i\) and do

(1) Sample from the proposal distribution \(q(x_j|x_i)\).

(2) Evaluate \(\alpha = \frac{q(x_i|x_j)p(x_j)}{q(x_j|x_i)p(x_i)}\)

(3) Accept the proposal sample, \(x_j\) with probability \(\min[1,\alpha]\) otherwise keep the current sample \(x_i\) as the new sample.

4 thoughts on “Understanding the Metropolis Hasting algorithm – A tutorial”

  1. Thanks for your explanation about MH. But I have something to ask. See (2) in ‘The Full Metropolis Hasting Algorithm’. I Think that subscripts of x in p are reverse. That is, p(x_j) and p(x_i) is variable of p in numerator and denominator, respectively. I appreciate that you will check them.

  2. Nice article. I think your explanation for A(x_i,x_j)>1 is reversed though. If alpha > 1 then it is very likely we will go from state i to j.

Leave a Reply

Your email address will not be published. Required fields are marked *