Modern Theory of Markov Chains (2014)

[June 19, 2014]

Summary

One class of reversible Markov chains for which the eigenvalues and eigenvectors are easy to find are random walks on finite commutative groups. The eigenvectors of a random walk on a finite commutative group $(\mathbb{G},+)$ with jump distribution $p:\mathbb{G}\to[0,1]$ turn out to be the characters of the group $\mathbb{G}$ (i.e., the group homomorphisms from $\mathbb{G}$ into the multiplicative group of $\mathbb{C}$), irrespectively of the jump distribution. The eigenvalue corresponding to a character $\gamma$ is $p(\gamma)=\sum_{x\in S}p(x)\gamma(x)$. We gave a minimal review of Fourier analysis on finite commutative groups enough to understand and prove the above claim.

As an example, the characters of $\mathbb{Z}_n$ are the functions $\gamma_k(x)\triangleq\mathrm{e}^{\frac{2k\pi}{n}x i}$ ($k=0,1,\ldots,n-1$). For a lazy simple random walk on $\mathbb{Z}_n$, the eigenvalue corresponding to $\gamma_k$ is $\lambda_k\triangleq\cos^2\frac{k\pi}{n}$. Using the spectral bound we derived last time and a bit of calculation, we obtain a bound $t_\text{mix}(\varepsilon)=O(n^2)$ for its mixing time.

The characters of the product group $\mathbb{Z}_2\times\mathbb{Z}_2\times\cdots\times\mathbb{Z}_2$ ($n$ times) are $$\gamma_A(x_1x_2\cdots x_n)\triangleq (-1)^{\sum_{i\in A} x_i}$$ for $A\subseteq\{1,2,\ldots,n\}$. For the lazy version of the Ehrenfest model, the eigenvalue corresponding to $\gamma_A$ is $\lambda_A\triangleq 1-\frac{k}{n}\leq \mathrm{e}^{-\frac{k}{n}}$, where $k\triangleq|A|$. Using the spectral bound, we obtained the long-awaited upper bound $$t_\text{mix}(\varepsilon)\leq\frac{1}{2}n\log n + O(n)$$ for the mixing time of the Markov chain. This matches the lower bound we had found earlier, hence demonstrating the presence of cut-off phenomenon in the Ehrenfest model.

[June 12, 2014]

Summary

We recalled that the transition matrix of a reversible Markov chain defines a self-adjoint operator on the space of complex-valued functions $f:S\to\mathbb{C}$. As a consequence, its eigenvalues are all real and their corresponding eigenvectors form an orthonormal basis for $\mathbb{C}^S$ (the spectral theorem).

Using a bit of linear algebra, we derived an upper bound $$4\|\mathbb{P}_x(X_t\in\cdot)-\pi\|^{\color{blue}{2}} \leq \sum_{\color{blue}{i=2}}^n \lambda_i^{2t}\varphi_i(x)^2 $$ for the total variation distance between the distribution of the chain at time $t$ and the reversible stationary distribution. Here, $1=\lambda_1,\lambda_2,\ldots,\lambda_n$ are the eigenvalues of the transition matrix and $1\equiv\varphi_1,\varphi_2,\ldots,\varphi_n$ corresponding eigenvectors forming an orthonormal basis. Note that if the chain is irreducible and aperiodic, the eigenvalues $\lambda_2,\ldots,\lambda_n$ (all but the largest) are in $(-1,1)$ and the bound tends to $0$ as $t\to\infty$. In case of a random walk on a finite Abelian group, we could use the symmetry to simplify the bound and obtained $$4\|\mathbb{P}_x(X_t\in\cdot)-\pi\|^2 \leq \sum_{i=2}^n \lambda_i^{2t} \;. $$
The above bound was originally found and exploited by Diaconis and Shahshahani and is quite tight, although it requires the knowledge of all the eigenvalues and eigenvectors, which are often difficult to find. The second largest eigenvalues has the dominant effect and weaker bounds can be given in terms of this eigenvalue alone.

[May 15, 2014]

Summary

We first finished the derivation of the promised lower bound $t_\text{mix}(1-\varepsilon) \geq \frac{1}{2}n\log n - O(n)$ for the mixing time of a lazy random walk on the hypercube $\{0,1\}^n$.

We then started with the algebraic/geometric approach to study Markov chains. A Markov chain with transition matrix $K:S\times S\to[0,1]$ can be understood in terms of iterations of the linear operator $u\in\mathbb{C}^S\mapsto uK$ on probability distributions or its transpose $f\in\mathbb{C}^S\mapsto Kf$ on observables $f:S\to\mathbb{C}$. The standard way to understand the iterates of a linear map is through its eigenvalues and eigenfunctions.

The starting point is the Perron-Frobenius theorem (tailor-made for finite-state Markov chains):

$1$ is an eigenvalue of $K$.
Every eigenvalue of $K$ lies inside the closed unit disk $\{x\in\mathbb{C}: |x|\leq 1\}$.
If $K$ is irreducible, the eigenspace of eigenvalue $1$ is one-dimensional.
If $K$ is irreducible and aperiodic, all its eigenvalues other than $1$ have absolute values strictly less than $1$.

Item 3 implies the uniqueness theorem and item 4 the convergence theorem.

This way of looking at Markov chains becomes particularly neat and powerful in case of reversible Markov chains. In algebraic language, the fact that $K$ is reversible with respect to a (positive) distribution $\pi$ translates to saying that $K$ is self-adjoint with respect to the inner product $\langle f,g\rangle_\pi\triangleq \sum_x \pi(x)f(x)\overline{g(x)}$. That $K$ is self-adjoint means $\langle Kf,g\rangle=\langle f, Kf\rangle$.

If $K$ is self-adjoint, the spectral theorem (simple proof) is applicable:

All the eigenvalues of $K$ are real.
There is an orthonormal basis of $\mathbb{C}^S$ consisting of eigenvectors of $K$.

Next time, we shall see how this leads to a sharp upper bound for the mixing times in terms of eigenvalues and eigenvector.

[May 8, 2014]

Summary

We had a discussion about the projects. Wout is working on Markov random fields, which are similar to Markov chains except that the random variables are indexed with the vertices of a graph. He mentioned a representation of the probabilities using "energy functions". He had also found an interesting application of Markov random fields for image enhancement. Marvin told us about various card-shuffling schemes and his experiments with them. Riffle shuffling seems to mix the cards very quickly, whereas the more usual lazy method takes around 2500 shuffles to mix the deck! Tineke is working on examples with no apparent connection with Markov chains, where reasoning with Markov chains help solve the problem. She is using an article by Olle Häggström as reference, where he presents some examples, but she is also looking for other examples.

In the remaining time, we proceeded further with the calculation of a lower bound for the mixing time of a lazy random walk on the hypercube.

Assignment

The discussions about the projects were quite fun! Let us do this again in four weeks (June 5).

[May 1, 2014]

Summary

We continued talking about lower bounds for mixing times.

We first gave two examples of Markov chains with clear (somewhat exaggerated) bottlenecks. The first example was a lazy random walk on a graph obtained by connecting two complete graphs (of sizes $k$ and $n-k$) with a single edge. If $U$ is the set of vertices in the first complete subgraph, the bottleneck ratio of $U$ turned out to be $\Phi(U)=\frac{1}{1+k(k+1)}$. For $k\leq 1/2$, this gives the bound $t_{\text{mix}}(\frac{1}{2}-\varepsilon) \geq \frac{\varepsilon}{\Phi(U)} = \varepsilon(1+k(k+1))$.

The second example was the Gibbs sampler for the hard-core model with parameter $\lambda$ on a complete bipartite graph with two parts $A$ and $B$ of sizes $k$ and $n-k$. This has a bottleneck between configurations that have particles on $A$ and the configurations that have particles on $B$. We calculated the bottleneck ratio $\Phi(U)=\frac{k}{n}\frac{\lambda}{1+\lambda}\frac{1}{(1+\lambda)^k - 1}$ for the set $U$ of all configurations that have particles on $A$. For $k=n-k=n/2$, this gave the lower bound $t_{\text{mix}}(\frac{1}{2}-\varepsilon) \geq \frac{\varepsilon}{\Phi(U)} \approx c (1+\lambda)^{n/2}$, with some positive constant $c$ depending on $\varepsilon$ and $\lambda$.

We then started analyzing the lazy random walk on the hypercube (the Ehrenfest model). We devised a strategy for obtaining a lower bound (by examining the concentration of the number of $1$s) but left the calculations for next week.

[April 22, 2014] Place: Minnaert 022

Summary

We talked about lower bounds for mixing times.

As an example, we found a lower bound $t_\text{mix}(1-\varepsilon)\geq \frac{\varepsilon^3}{16}n^2 - O(n)$ for the mixing time of a lazy simple random walk on $\mathbb{Z}_n$. This matches (at least in the order of magnitude) the upper bound $t_\text{mix}(\varepsilon)\leq (\frac{1}{4}\log_2\frac{1}{\varepsilon})n^2$ we had found earlier using the coupling method. The idea was to find a test set $A$ with small stationary volume that carries most of the probability mass of $K^t(0,\cdot)$. (Here, $K$ denotes the transition matrix of the chain.)

We then talked about the bottleneck obstacles for rapid mixing. For an (ergodic) Markov chain $(X_t)_t$ with state space $S$ and stationary distribution $\pi$, we defined the bottleneck ratio of a set $U\subseteq S$ as $\Phi(U)\triangleq\mathbb{P}_\pi(X_{t+1}\notin U \,|\, X_t\in U)$. If $\Phi(U)$ is small and $\pi(U)$ is not close to $1$, the chain will have a hard time leaving $U$ and diffusing to the entire space. We made this precise and obtained the lower bound $t_\text{mix}(1-\varepsilon)\geq (\varepsilon - \pi(U))/\Phi(U)$. We will discuss a couple of examples next time.

Assignment

Please work on your projects. In two weeks, we can spend half the class discussing your progress and exchanging ideas. You could, for example, identify what questions you would like to answer, what subject you would like to explore, or what technique you would like to learn. You could also try to work out a simple example and tell us about it as a teaser.

[April 15, 2014]

Summary

We studied the top-to-random card shuffling method to illustrate few more ideas about the speed of convergence. We proved that around $n\log n \pm O(n)$ top-to-random shuffles are necessary and sufficient to mix a deck of $n$ cards. Namely,

Upper bound: $t_\text{mix}(\varepsilon) \leq n\log n + rn$, where $r\triangleq\log\frac{1}{\varepsilon}$.
Lower bound: $t_\text{mix}(1-\varepsilon) \geq n\log n - c n$, for a constant $c>0$ depending only on $\varepsilon$.

This is an example of the so-called cut-off phenomenon, where in a relatively short period ($O(n)$ steps, which is small compared to $n\log n$), the maximum distance from stationarity falls from almost $1$ to almost $0$.

For the upper bound, we used what we called a "coupling argument without coupling". We observed that once the original bottom reaches the top and inserted back in a random place, the deck is completely random. We formulated this by saying that the first time this happens is a strong stationary time, and showed that strong stationary times can be used in the same way as coupling times to give upper bounds for the mixing times.

To get the lower bound, we argued that as long as the original $k$th card from bottom has not reached the top, the distribution of the chain is far from stationary (provided $1/k!$ is small). We used Chebyshev's inequality to give a lower bound for the first time the original $k$th card from bottom reaches the top.

[April 8, 2014]

Summary

We used coupling arguments to estimate the speed of mixing in two more examples: the lazy version of the Ehrenfest model and the Gibbs sampler for the hard-core model at low densities. These examples are remarkable in that although their state space is huge ($2^n$ states in case of the Ehrenfest model with $n$ balls), the Markov chain gets very close to its stationary distribution in a relatively short period (in $n\log n + O(n)$ steps).

For the lazy version of the Ehrenfest model ($\equiv$ lazy random walk on the hypercube), we found that the mixing time satisfies $t_\text{mix}(\mathrm{e}^{-r})\leq n\log n + rn$. For this, we used a coupling that reduced the problem to the so-called coupon collecting problem. To estimate the tail probabilities for the time to collect all coupons, we tried the expectation bound and the variance bound before we found an argument that gave the desired bound.

For the Gibbs sampler for the hard-core model, we obtained a similar bound in the low density parameter regime: we found that for $\lambda$ (the fugacity parameter) sufficiently small, the mixing time is bounded as $t_\text{mix}(\mathrm{e}^{-r})\leq \alpha_\lambda (n\log n + r n)$, where $\alpha_\lambda$ is a constant depending on $\lambda$. For the proof, we constructed a coupling and interpreted the disagreements between the two coupled chains as contaminations of some sort and studied the evolution of the set of contaminated vertices.

Assignment

Update: Think about a topic for your project. Here is a number of suggestions.

[April 1, 2014]

Summary

In order to use Markov chains in Monte Carlo simulations or randomized algorithms, we would like to know how rapidly an ergodic Markov chain approaches its stationary distribution. More specifically, we would like to know how long we should wait to be sure that the distribution of the chain is within distance $\varepsilon$ from the stationary distribution. The smallest such time is called the mixing time (for accuracy $\varepsilon$).

One approach to estimate the mixing time is to devise a thought experiment in which two copies of the chain are run together (a coupling). One such thought experiment we had already used to prove the convergence theorem (independent before meeting, moving together once met). More clever couplings (thought experiments) tailor-made for the particular model at hand could provide good estimates for the mixing time.

For a lazy random walk on a cycle $\mathbb{Z}_n$, we devised a simple coupling that (with the help of the unfortunate pedestrian principle) gave the upper bound $t_\text{mix}(2^{-r})\leq \frac{r}{2}n^2$ for the mixing time with accuracy $2^{-r}$. Similarly, for a lazy random walk on a $d$-dimensional torus $\mathbb{Z}_n^d$, we obtained the bound $t_\text{mix}(2^{-r})\leq \frac{r}{2}d^2n^2$.

[March 25, 2014]

Summary

We discussed the problem of simulating (sampling) random variables/objects with a given distribution. This is a basic practical problem if we want to study probabilistic models via simulations, or when implementing randomized algorithms. We reviewed how to simulate simple discrete or continuous random variables using independent coin flips or uniformly random reals in the unit interval.

We then discussed more complicated examples (a random coloring of a graph, a random configuration of the hard-core model) in which the naive approach does not help, for the reason that the set of possible values is huge and only implicitly known, and that the normalizing factor of the distribution is hard to compute. One approach in such cases is to cook up a easy-to-simulate fast-mixing Markov chain whose unique stationary distribution is the desired distribution.

We demonstrated with examples two typical construction for such a Markov chain: Gibbs samplers and Metropolis chains.

Here is an (inefficient) Mathematica implementation (requires Version 9) of a Gibbs sampler for the hard-core model on the toroidal lattice $\mathbb{Z}_n\times\mathbb{Z}_n$ to play with.

Assignment

Assignment 7 --- due: April 15, 2014

[March 18, 2014]

Summary

Four simple lemmas clarify the notions of recurrence and transience:

state $a$ recurrent $\ \Rightarrow\ $ $\mathbb{P}_a(\text{infinitely many returns to $a$})=1$.
state $a$ transient $\ \Rightarrow\ $ $\mathbb{E}_a(\text{number of visits to $a$})<\infty$.
(In fact, the "number of visits to $a$" has a geometric distribution.)
$a$ recurrent $\ \&\ $ $a\longrightarrow b$ $\ \Rightarrow\ $ $\mathbb{P}_b(T_a<\infty)=1$.
(In particular, $b\longrightarrow a$.)
Recurrence is a class property for the communication relation.
(i.e.,$\ $ $a$ recurrent $\ \&\ $ $a\longleftrightarrow b$ $\ \Rightarrow\ $ $b$ recurrent.)

As an application, we completed the proof of the convergence theorem by showing that in the coupling used in the proof, the two chains will eventually merge (almost surely).

In the second half, we gave a proof of the ergodic theorem of the Markov chains:

If $(X_t)_{t\geq 0}$ is an irreducible Markov chain with stationary distribution $\pi$, then the time averages \[\frac{f(X_0)+f(X_1)+\cdots+f(X_{t-1})}{t}\] of any real-valued function $f$ on the state space converges almost surely to the equilibrium mean $\pi(f)$, provided that $\pi(f)$ is well-defined and finite.

This explains the results of the simulations in problem 4(c,d) of assignment 3.

Assignment

Assignment 6 --- due: April 1, 2014

[March 11, 2014]

Summary

We gave a (partial) proof of the convergence theorem.

We discussed several possible interpretations of the convergence of a sequence of distributions to another distribution. Two examples:

The probability of each state converges to its target value.
The total variation distance from the target distribution goes to zero.

For distributions on a countable set, all the interpretations are equivalent.

To prove the convergence theorem, we exploited a thought experiment (a coupling argument): running two copies of the Markov chain simultaneously. If the two chains are suitably coupled, they will eventually end up in total agreement, and we can argue that the distance between their distributions goes to zero. We offerred a candidate for such a coupling, but it remained to prove that in this coupling, the two chains eventually merge.

Assignment

Assignment 5 --- due: March 25, 2014

[March 4, 2014]

Summary

We mentioned without proof the convergence theorem of Markov chains:

Any Markov chain (on a countable state space) that is irreducible and aperiodic and has a stationary distribution approaches that stationary distribution.

We discussed the problem of finding the stationary distribution(s) of a Markov chain. In many cases, the symmetries of the chain allow us to identify a stationary distribution without much pain. We talked about two types of such symmetries:

[rotations] Random walk on a (finite) group: the uniform distribution is stationary
(e.g., random walk on the cyclic group $\mathbb{Z}_n$; Ehrenfest urn)
[time-reversal] Any distribution satisfying the detailed balance condition is stationary. (e.g., random walk on an undirected graph; Ehrenfest urn)

We found out why virtually any card shuffling scheme (satisfying two rather weak natural conditions) gradually mixes the order of the cards properly if repeated.
Finally, we asked:

Do we get a Markov chain if we reverse the direction of time in a bi-infinite Markov chain?

We found that the time-reversal of any bi-infinite Markov chain in equilibrium is itself a Markov chain.

[February 25, 2014]

Summary

Playing with Markov chains in Mathematica [Notebook (requires Version 9)/ CDF/ PDF]

We discussed the problem of uniqueness of stationary distribution.

An irreducible Markov chain has at most one stationary distribution.

For finite-state Markov chains, we gave one proof of this fact.

We then discussed two theorems which gave dynamical interpretations of the stationary distribution. The first was Kac's recurrence lemma, which links the stationary distribution of an irreducible Markov chain to the expected return times of its states. This gives an implicit proof of the uniqueness theorem in case of countable state space.

The other theorem gives a construction of a stationary distribution in case the chain has a positively recurrent state. Together with Kac's lemma, this provides a necessary and sufficient condition for the existence of a stationary distribution.

A Markov chain has a stationary distribution if and only if it has a positively recurrent state.

(Another necessary and sufficient condition was given in the last exercises.)

Notes on uniqueness and return times. (See also the notes on existence below.)

Assignment

Assignment 4 --- due: March 11, 2014

[February 18, 2014]

Summary

We discussed the problem of existence of stationary distributions.
We first made the setting explicit:

What a Markov chain is mathematically. The key features:
- Markov property
- time-homogeneity
- finite or countably infinite number of possible states
Transition matrix
Suitable sample space: space of all trajectories/infinite paths/sequences of states

In this setting, we formulated what is meant by (statistical) equilibrium. An equilibrium state is identified by a stationary distribution. We then proved the existence theorem for finite-state Markov chains:

Every Markov chain on a finite state space has at least one stationary distribution.

Notes on the existence of stationary distribution.

Assignment

Assignment 3 (slightly updated: 02-19) --- due: March 4, 2014

[February 11, 2014]

Summary

We considered two examples:

Ehrenfest urn model (two formulations: based on whether we keep track of individual balls or merely the number of balls in each urn), and
simple random walk on a cycle of length $n$,

and asked the following questions

What does it mean to be in (statistical) equilibrium?
Which distribution could be an equilibrium distribution (a.k.a. stationary distribution)?
Is the stationary distribution unique?
Does the system approach the equilibrium as time passes by?

For the Ehrenfest urn model (the formulation with distinguishable balls), we found an equilibrium distribution and showed that it is the only one. For the random walk on a cycle, we used a thought experiment (a coupling) to study approach towards equilibrium. It turned out that whether there is convergence to equilibrium depends on whether the cycle is even or odd.

Assignment

Assignment 2 --- due: February 25, 2014

[February 4, 2014]

Summary

We discussed few examples:

A drunkard's walk on a street with $n$ blocks (a.k.a gambler's ruin: example 2.1 in the book): Which place is he going to end up (home or the pub) and what is the expected time of arrival?
Simple random walk on $\mathbb{Z}$: Will it eventually pass through every single site? What about the two-dimensional or three-dimensional variants?
Population of bacteria (Galton-Watson model): Is there a chance that the population survives forever, or will they sooner or later go extinct?
Shuffling a deck of cards, the riffle style: Does repeated shuffling mix the permutation of the cards evenly? If so, how many shuffles are needed to get the deck sufficiently mixed?

Assignment

Assignment 1 --- due: February 18, 2014

Modern Theory of Markov Chains

Time and location

Description

Pre-requisites

Literature

Course webpage

Instructor

Evaluation

Practical information

Weekly outline