Modern Theory of Markov ChainsThis page is for the course in 2014. Are you looking for information about the last year's course? Time and location
DescriptionHow many times do we need to shuffle a deck of $n$ cards in order to make them evenly mixed? Will a population of bacteria survive or eventually go extinct? Will a particle jumping randomly from one place to another eventually return to its original whereabouts? How quickly does a Monte Carlo simulation converge (if at all)? Motivated by modern applications and guided by examples, we develop the basic theory of Markov chains and familiarize ourselves with various mathematical techniques invented to analyze them.This course is a part of the master's program Stochastics and Financial Mathematics, but the local students from Utrecht are also welcome. The local code for the course is WISS123. Pre-requisitesWorking knowledge of basic concepts of probability theory (at a level covered by a first course in probability) is required. Mathematical maturity and computer skills are desirable.Literature
Course webpagehttp://www.staff.science.uu.nl/~taati001/markov/InstructorSiamak TaatiEvaluation
Practical information
Weekly outline
[June 26, 2014]
Summary
§
This was the last class, dedicated to project presentations.
Marvin told us the fascinating story of riffle shuffling.
Wout talked about Markov random fields and application
to image restoration. Tineke presented elegant proofs of
inequalities from percolation theory using couplings
and Markov chains.
[June 19, 2014]
Summary
§
One class of reversible Markov chains for
which the eigenvalues and eigenvectors are easy to find
are random walks on finite commutative groups.
The eigenvectors of a random walk on a finite commutative group $(\mathbb{G},+)$
with jump distribution $p:\mathbb{G}\to[0,1]$
turn out to be
the characters
of the group $\mathbb{G}$ (i.e., the group homomorphisms from $\mathbb{G}$
into the multiplicative group of $\mathbb{C}$), irrespectively of the jump distribution.
The eigenvalue corresponding to a character $\gamma$
is $p(\gamma)=\sum_{x\in S}p(x)\gamma(x)$.
We gave a minimal review of Fourier analysis on finite commutative groups
enough to understand and prove the above claim.
As an example, the characters of $\mathbb{Z}_n$ are the functions $\gamma_k(x)\triangleq\mathrm{e}^{\frac{2k\pi}{n}x i}$ ($k=0,1,\ldots,n-1$). For a lazy simple random walk on $\mathbb{Z}_n$, the eigenvalue corresponding to $\gamma_k$ is $\lambda_k\triangleq\cos^2\frac{k\pi}{n}$. Using the spectral bound we derived last time and a bit of calculation, we obtain a bound $t_\text{mix}(\varepsilon)=O(n^2)$ for its mixing time. The characters of the product group $\mathbb{Z}_2\times\mathbb{Z}_2\times\cdots\times\mathbb{Z}_2$ ($n$ times) are $$\gamma_A(x_1x_2\cdots x_n)\triangleq (-1)^{\sum_{i\in A} x_i}$$ for $A\subseteq\{1,2,\ldots,n\}$. For the lazy version of the Ehrenfest model, the eigenvalue corresponding to $\gamma_A$ is $\lambda_A\triangleq 1-\frac{k}{n}\leq \mathrm{e}^{-\frac{k}{n}}$, where $k\triangleq|A|$. Using the spectral bound, we obtained the long-awaited upper bound $$t_\text{mix}(\varepsilon)\leq\frac{1}{2}n\log n + O(n)$$ for the mixing time of the Markov chain. This matches the lower bound we had found earlier, hence demonstrating the presence of cut-off phenomenon in the Ehrenfest model. [June 12, 2014]
Summary
§
We recalled that the transition matrix of a reversible Markov chain
defines a self-adjoint operator on the space
of complex-valued functions $f:S\to\mathbb{C}$.
As a consequence, its eigenvalues are all real
and their corresponding eigenvectors form an orthonormal
basis for $\mathbb{C}^S$ (the spectral theorem).
Using a bit of linear algebra, we derived an upper bound $$4\|\mathbb{P}_x(X_t\in\cdot)-\pi\|^{\color{blue}{2}} \leq \sum_{\color{blue}{i=2}}^n \lambda_i^{2t}\varphi_i(x)^2 $$ for the total variation distance between the distribution of the chain at time $t$ and the reversible stationary distribution. Here, $1=\lambda_1,\lambda_2,\ldots,\lambda_n$ are the eigenvalues of the transition matrix and $1\equiv\varphi_1,\varphi_2,\ldots,\varphi_n$ corresponding eigenvectors forming an orthonormal basis. Note that if the chain is irreducible and aperiodic, the eigenvalues $\lambda_2,\ldots,\lambda_n$ (all but the largest) are in $(-1,1)$ and the bound tends to $0$ as $t\to\infty$. In case of a random walk on a finite Abelian group, we could use the symmetry to simplify the bound and obtained $$4\|\mathbb{P}_x(X_t\in\cdot)-\pi\|^2 \leq \sum_{i=2}^n \lambda_i^{2t} \;. $$ The above bound was originally found and exploited by Diaconis and Shahshahani and is quite tight, although it requires the knowledge of all the eigenvalues and eigenvectors, which are often difficult to find. The second largest eigenvalues has the dominant effect and weaker bounds can be given in terms of this eigenvalue alone. [June 5, 2014]
Summary
§
Tineke and Wout told us about their progress in their projects.
[May 29, 2014]
Summary
§
Holiday!
[May 22, 2014]
Summary
§
No class --- I will be away the whole week.
[May 15, 2014]
Summary
§
We first finished the derivation of the promised lower bound
$t_\text{mix}(1-\varepsilon) \geq \frac{1}{2}n\log n - O(n)$
for the mixing time of a lazy random walk on the hypercube $\{0,1\}^n$.
We then started with the algebraic/geometric approach to study Markov chains. A Markov chain with transition matrix $K:S\times S\to[0,1]$ can be understood in terms of iterations of the linear operator $u\in\mathbb{C}^S\mapsto uK$ on probability distributions or its transpose $f\in\mathbb{C}^S\mapsto Kf$ on observables $f:S\to\mathbb{C}$. The standard way to understand the iterates of a linear map is through its eigenvalues and eigenfunctions. The starting point is the Perron-Frobenius theorem (tailor-made for finite-state Markov chains):
This way of looking at Markov chains becomes particularly neat and powerful in case of reversible Markov chains. In algebraic language, the fact that $K$ is reversible with respect to a (positive) distribution $\pi$ translates to saying that $K$ is self-adjoint with respect to the inner product $\langle f,g\rangle_\pi\triangleq \sum_x \pi(x)f(x)\overline{g(x)}$. That $K$ is self-adjoint means $\langle Kf,g\rangle=\langle f, Kf\rangle$. If $K$ is self-adjoint, the spectral theorem (simple proof) is applicable:
[May 8, 2014]
Summary
§
We had a discussion about the projects.
Wout is working on Markov random fields, which are similar to Markov chains
except that the random variables are indexed with the vertices of a graph.
He mentioned a representation of the probabilities using "energy functions".
He had also found an interesting application of Markov random fields for image enhancement.
Marvin told us about various card-shuffling schemes and his experiments
with them. Riffle shuffling seems to mix the cards very quickly,
whereas the more usual lazy method takes around 2500 shuffles to mix the deck!
Tineke is working on examples with no apparent connection with Markov chains,
where reasoning with Markov chains help solve the problem.
She is using an article by Olle Häggström as reference, where he presents some examples,
but she is also looking for other examples.
AssignmentIn the remaining time, we proceeded further with the calculation of a lower bound for the mixing time of a lazy random walk on the hypercube.
The discussions about the projects were quite fun! Let us do this again
in four weeks (June 5).
[May 1, 2014]
Summary
§
We continued talking about lower bounds for mixing times.
We first gave two examples of Markov chains with clear (somewhat exaggerated) bottlenecks. The first example was a lazy random walk on a graph obtained by connecting two complete graphs (of sizes $k$ and $n-k$) with a single edge. If $U$ is the set of vertices in the first complete subgraph, the bottleneck ratio of $U$ turned out to be $\Phi(U)=\frac{1}{1+k(k+1)}$. For $k\leq 1/2$, this gives the bound $t_{\text{mix}}(\frac{1}{2}-\varepsilon) \geq \frac{\varepsilon}{\Phi(U)} = \varepsilon(1+k(k+1))$. The second example was the Gibbs sampler for the hard-core model with parameter $\lambda$ on a complete bipartite graph with two parts $A$ and $B$ of sizes $k$ and $n-k$. This has a bottleneck between configurations that have particles on $A$ and the configurations that have particles on $B$. We calculated the bottleneck ratio $\Phi(U)=\frac{k}{n}\frac{\lambda}{1+\lambda}\frac{1}{(1+\lambda)^k - 1}$ for the set $U$ of all configurations that have particles on $A$. For $k=n-k=n/2$, this gave the lower bound $t_{\text{mix}}(\frac{1}{2}-\varepsilon) \geq \frac{\varepsilon}{\Phi(U)} \approx c (1+\lambda)^{n/2}$, with some positive constant $c$ depending on $\varepsilon$ and $\lambda$. We then started analyzing the lazy random walk on the hypercube (the Ehrenfest model). We devised a strategy for obtaining a lower bound (by examining the concentration of the number of $1$s) but left the calculations for next week. [April 22, 2014]
Place: Minnaert 022
Summary
§
We talked about lower bounds for mixing times.
AssignmentAs an example, we found a lower bound $t_\text{mix}(1-\varepsilon)\geq \frac{\varepsilon^3}{16}n^2 - O(n)$ for the mixing time of a lazy simple random walk on $\mathbb{Z}_n$. This matches (at least in the order of magnitude) the upper bound $t_\text{mix}(\varepsilon)\leq (\frac{1}{4}\log_2\frac{1}{\varepsilon})n^2$ we had found earlier using the coupling method. The idea was to find a test set $A$ with small stationary volume that carries most of the probability mass of $K^t(0,\cdot)$. (Here, $K$ denotes the transition matrix of the chain.) We then talked about the bottleneck obstacles for rapid mixing. For an (ergodic) Markov chain $(X_t)_t$ with state space $S$ and stationary distribution $\pi$, we defined the bottleneck ratio of a set $U\subseteq S$ as $\Phi(U)\triangleq\mathbb{P}_\pi(X_{t+1}\notin U \,|\, X_t\in U)$. If $\Phi(U)$ is small and $\pi(U)$ is not close to $1$, the chain will have a hard time leaving $U$ and diffusing to the entire space. We made this precise and obtained the lower bound $t_\text{mix}(1-\varepsilon)\geq (\varepsilon - \pi(U))/\Phi(U)$. We will discuss a couple of examples next time.
Please work on your projects. In two weeks, we can spend half the class
discussing your progress and exchanging ideas.
You could, for example, identify what questions you would like to answer,
what subject you would like to explore, or what technique you would like
to learn. You could also try to work out a simple example
and tell us about it as a teaser.
[April 15, 2014]
Summary
§
We studied the top-to-random card shuffling method
to illustrate few more ideas about the speed of convergence.
We proved that around $n\log n \pm O(n)$ top-to-random shuffles
are necessary and sufficient to mix a deck of $n$ cards. Namely,
For the upper bound, we used what we called a "coupling argument without coupling". We observed that once the original bottom reaches the top and inserted back in a random place, the deck is completely random. We formulated this by saying that the first time this happens is a strong stationary time, and showed that strong stationary times can be used in the same way as coupling times to give upper bounds for the mixing times. To get the lower bound, we argued that as long as the original $k$th card from bottom has not reached the top, the distribution of the chain is far from stationary (provided $1/k!$ is small). We used Chebyshev's inequality to give a lower bound for the first time the original $k$th card from bottom reaches the top. [April 8, 2014]
Summary
§
We used coupling arguments to estimate the speed of mixing
in two more examples: the lazy version of the Ehrenfest model
and the Gibbs sampler for the hard-core model at low densities.
These examples are remarkable in that although their state space
is huge ($2^n$ states in case of the Ehrenfest model with $n$ balls),
the Markov chain gets very close to its stationary distribution
in a relatively short period (in $n\log n + O(n)$ steps).
AssignmentFor the lazy version of the Ehrenfest model ($\equiv$ lazy random walk on the hypercube), we found that the mixing time satisfies $t_\text{mix}(\mathrm{e}^{-r})\leq n\log n + rn$. For this, we used a coupling that reduced the problem to the so-called coupon collecting problem. To estimate the tail probabilities for the time to collect all coupons, we tried the expectation bound and the variance bound before we found an argument that gave the desired bound. For the Gibbs sampler for the hard-core model, we obtained a similar bound in the low density parameter regime: we found that for $\lambda$ (the fugacity parameter) sufficiently small, the mixing time is bounded as $t_\text{mix}(\mathrm{e}^{-r})\leq \alpha_\lambda (n\log n + r n)$, where $\alpha_\lambda$ is a constant depending on $\lambda$. For the proof, we constructed a coupling and interpreted the disagreements between the two coupled chains as contaminations of some sort and studied the evolution of the set of contaminated vertices.
Update:
Think about a topic for your project.
Here is a number of suggestions.
[April 1, 2014]
Summary
§
In order to use Markov chains in Monte Carlo simulations
or randomized algorithms, we would like to know how rapidly
an ergodic Markov chain approaches its stationary distribution.
More specifically, we would like to know how long we should wait
to be sure that the distribution of the chain is within distance
$\varepsilon$ from the stationary distribution.
The smallest such time is called the mixing time
(for accuracy $\varepsilon$).
One approach to estimate the mixing time is to devise a thought experiment in which two copies of the chain are run together (a coupling). One such thought experiment we had already used to prove the convergence theorem (independent before meeting, moving together once met). More clever couplings (thought experiments) tailor-made for the particular model at hand could provide good estimates for the mixing time. For a lazy random walk on a cycle $\mathbb{Z}_n$, we devised a simple coupling that (with the help of the unfortunate pedestrian principle) gave the upper bound $t_\text{mix}(2^{-r})\leq \frac{r}{2}n^2$ for the mixing time with accuracy $2^{-r}$. Similarly, for a lazy random walk on a $d$-dimensional torus $\mathbb{Z}_n^d$, we obtained the bound $t_\text{mix}(2^{-r})\leq \frac{r}{2}d^2n^2$. [March 25, 2014]
Summary
§
We discussed the problem of simulating (sampling) random variables/objects
with a given distribution. This is a basic practical problem if we want to
study probabilistic models via simulations, or when implementing randomized algorithms.
We reviewed how to simulate simple discrete or continuous random variables
using independent coin flips or uniformly random reals in the unit interval.
AssignmentWe then discussed more complicated examples (a random coloring of a graph, a random configuration of the hard-core model) in which the naive approach does not help, for the reason that the set of possible values is huge and only implicitly known, and that the normalizing factor of the distribution is hard to compute. One approach in such cases is to cook up a easy-to-simulate fast-mixing Markov chain whose unique stationary distribution is the desired distribution. We demonstrated with examples two typical construction for such a Markov chain: Gibbs samplers and Metropolis chains. Here is an (inefficient) Mathematica implementation (requires Version 9) of a Gibbs sampler for the hard-core model on the toroidal lattice $\mathbb{Z}_n\times\mathbb{Z}_n$ to play with.
Assignment 7
--- due: April 15, 2014
[March 18, 2014]
Summary
§
Four simple lemmas clarify the notions of recurrence and transience:
Assignment
In the second half, we gave a proof of the ergodic theorem of the Markov chains:
If $(X_t)_{t\geq 0}$ is an irreducible Markov chain with stationary distribution $\pi$,
then the time averages
\[\frac{f(X_0)+f(X_1)+\cdots+f(X_{t-1})}{t}\]
of any real-valued function $f$ on the state space converges almost surely to
the equilibrium mean $\pi(f)$, provided that $\pi(f)$ is well-defined and finite.
This explains the results of the simulations in problem 4(c,d) of assignment 3.
Assignment 6
--- due: April 1, 2014
[March 11, 2014]
Summary
§
We gave a (partial) proof of the convergence theorem.
AssignmentWe discussed several possible interpretations of the convergence of a sequence of distributions to another distribution. Two examples:
To prove the convergence theorem, we exploited a thought experiment (a coupling argument): running two copies of the Markov chain simultaneously. If the two chains are suitably coupled, they will eventually end up in total agreement, and we can argue that the distance between their distributions goes to zero. We offerred a candidate for such a coupling, but it remained to prove that in this coupling, the two chains eventually merge.
Assignment 5
--- due: March 25, 2014
[March 4, 2014]
Summary
§
We mentioned without proof the convergence theorem of Markov chains:
Any Markov chain (on a countable state space) that is
irreducible and aperiodic and has a stationary distribution approaches
that stationary distribution.
We discussed the problem of finding the stationary distribution(s)
of a Markov chain.
In many cases, the symmetries of the chain allow us to identify a stationary distribution
without much pain. We talked about two types of such symmetries:
Finally, we asked:
Do we get a Markov chain if we reverse the direction of time
in a bi-infinite Markov chain?
We found that the time-reversal of any bi-infinite Markov chain in equilibrium
is itself a Markov chain.
[February 25, 2014]
Summary
§
Playing with Markov chains in Mathematica
[Notebook
(requires Version 9)/
CDF/
PDF]
AssignmentWe discussed the problem of uniqueness of stationary distribution.
An irreducible Markov chain has at most one stationary distribution.
For finite-state Markov chains, we gave one proof of this fact.We then discussed two theorems which gave dynamical interpretations of the stationary distribution. The first was Kac's recurrence lemma, which links the stationary distribution of an irreducible Markov chain to the expected return times of its states. This gives an implicit proof of the uniqueness theorem in case of countable state space. The other theorem gives a construction of a stationary distribution in case the chain has a positively recurrent state. Together with Kac's lemma, this provides a necessary and sufficient condition for the existence of a stationary distribution.
A Markov chain has a stationary distribution if and only if
it has a positively recurrent state.
(Another necessary and sufficient condition was given in the last exercises.)Notes on uniqueness and return times. (See also the notes on existence below.)
Assignment 4
--- due: March 11, 2014
[February 18, 2014]
Summary
§
We discussed the problem of existence of stationary distributions.
AssignmentWe first made the setting explicit:
Every Markov chain on a finite state space
has at least one stationary distribution.
Notes on the existence of stationary distribution.
[February 11, 2014]
Summary
§
We considered two examples:
Assignment
Assignment 2
--- due: February 25, 2014
[February 4, 2014]
Summary
§
We discussed few examples:
Assignment
Assignment 1
--- due: February 18, 2014
|
|
Last Update: June 28, 2014
|