Alireza Kazemipour

Some reminders on Linear Algebra

2026-03-22T00:00:00-07:00

In this post I’ll mention a handful of reminders on Linear Algebra.

We know that for an eigenvalue $\lambda$ of a matrix $A$ we have that $Ax = \lambda x$, or equivalently $(A - \lambda I)x = 0$. To find $\lambda$, we solve $\mathrm{det}(A - \lambda I) = 0$. But, why?

The reason is that in order to find non-zero solutions of $(A - \lambda I)x = 0$, $(A - \lambda I)$ must be non-invertible, hence its determinant should be zero!

Facts:

$\mathrm{trace}(A) = \lambda_1 + \lambda_2 + \dots$
$\mathrm{det}(A) = \lambda_1 \cdot \lambda_2 \dots$
Why $\mathrm{Det}(A) = \mathrm{Det}(A^\top)$ for any matrix $A$?

We know that $PA = LU$ (using factorization by elimination), where $P$ is a permutation matrix (each row only has one element equal to one and the rest are zero), $L$ is a lower triangular matrix with its diagonal all equal to one, and $U$ is an upper triangular matrix with the pivots of $A$ on its diagonal. For$(PA)^\top = (LU)^\top$, we have that $A^\top P^\top = U^\top L^\top$. So,

\[\begin{align*} \mathrm{Det}(PA) & = \mathrm{Det}(P)\cdot \mathrm{Det}(A) = \mathrm{Det}(L)\cdot \mathrm{Det}(U), \quad \mathrm{and} \\ \mathrm{Det}\left((PA)^\top\right) & = \mathrm{Det}\left(A^\top\right) \cdot \mathrm{Det}\left(P^\top\right) = \mathrm{Det}\left(U^\top\right)\cdot \mathrm{Det}\left(L^T\right). \end{align*}\]

We know since $L, U$ are triangular $\mathrm{Det}(L), \mathrm{Det}(U) = \mathrm{Det}\left(L^\top\right), \mathrm{Det}\left(U^\top\right)$. If $\mathrm{Det}(P) = \mathrm{Det}\left(P^\top\right)$, then it must be the case that $\mathrm{Det}(A) = \mathrm{Det}\left(A^\top\right)$. Now we prove that $\mathrm{Det}(P) = \mathrm{Det}\left(P^\top\right)$:

Let $\pi \in S_n$ be a permutation and $P_\pi$ be the permutation matrix that is obtained by applying $\pi$ to the identity matrix. Then, $(P_\pi)_{ij} = 1$ if $i = \pi(j)$, and otherwise zero. Using Leibniz formula we have:

\[\mathrm{Det}(P_\pi) = \sum_{\sigma \in S_n}\mathrm{sgn}(\sigma) \Pi_{i = 1}^n (P_\pi)_{i\sigma(i)}.\]

They only way the inner product is non-zero is that if $\sigma = \pi$.
Hence, $\mathrm{Det}(P_\pi) = \mathrm{sgn}(\pi)$. $\mathrm{sgn}(\pi)$ is equal to $(-1)^k$,
where $k = \sum_{i = 1}^n \mathbb{I}${$i \neq \pi(i)$}. Hence, $\mathrm{Det}(P_\pi) = \pm 1$ .
Using the identity $PP^\top = I$, we have that $\mathrm{Det}(P)\cdot\mathrm{Det}(P^\top) = 1$.
Hence, $\mathrm{Det}(P)$ must be equal to $\mathrm{Det}\left(P^\top\right)$, so the identity holds. $\square$

Only full-rank matrices are invertible.

Because if the matrix isn’t full-rank, then it means there is a null space which means there is an eigenvalue of zero,
which means the determinant of the matrix is zero.

If $A$ is an $n \times m$ matrix and $B$ is an $m \times k$ matrix, then

\[\begin{equation*} \mathsf{rank}(AB) \leq \min\{\mathsf{rank}(A) \cdot \mathsf{rank}(B)\}. \end{equation*}\]

Let $S$ be the column space of $AB$, that is the space of vector that linear combinations of columns of $AB$. Let $s \in S$, then there exists a $k\times 1$ vector $v$ such that

\[\begin{equation*} s = (AB)v. \end{equation*}\]

By the associativity of the matrix-vector product, we can write

\[\begin{equation*} s = A(Bv), \end{equation*}\]

where now $Bv$ is an $m \times 1$ vector. Thus, we ended up having $s$ in the column space of $A$. We know that $\mathsf{rank}(A)$ is equal to the dimension of its column space and since we showed that $S$ is a subset of the column space of $A$, its dimension is less than the rank of $A$, and the dimension of $S$ is equal to the rank of $AB$, hence

\[\begin{equation*} \mathsf{rank}(AB) \leq \mathsf{rank}(A). \end{equation*}\]

With an analogous argument regarding the row space of $B$, we will get that

\[\begin{equation*} \mathsf{rank}(AB) \leq \mathsf{rank}(B). \end{equation*}\]

Therefore, we must have

\[\begin{equation*} \mathsf{rank}(AB) \leq \min\{\mathsf{rank}(A) \cdot \mathsf{rank}(B)\}. \tag*{$\square$} \end{equation*}\]

Reference

Introduction to Linear Algebra
Google Gemini

My path to publish a theory paper

2025-12-21T00:00:00-08:00

2025-12-21

I wanna keep this a running post on how I ultimately write a theoretical paper. Writing such a paper is my sticking point; I feel it’s hard for me and seeing others being able to do it makes me insecure. I sit with them in the same room and I see I don’t have anything intelligence wise less than them, yet I’m hopeless to even understand their papers. I also feel like life doesn’t help me in this regard either; I’m surrounded with people that don’t have enough background and motive to understand their intuitions, at the same time insist on what they feel and persist on their gut feeling that is ungrounded and obviously to me wrong.

For example of one my insecurities I can point out to Bo Dai’s or Amin Rakhsha’s publications; elite and elegant. I see them as humans and I see that I’m a human too but unable to do what they do, in spite of the fact that I want to.

I dunno. There is a voice in my head that says don’t give up; maybe the journey has to be exactly like this; maybe it doesn’t matter that you’ll fail eventually but you tried; maybe if you don’t give up you’ll get there eventually. I dunno, I’ll put what I got once in a while here till that paper is published or life made me abandon my yearning. I just know that I won’t be regretful of not trying.

Sorry for venting! It was all I could do today. I really tried to brainstorm, be precise and formal but I failed today. There went nothing. Maybe venting could be a good coping mechanism. Let the time reveal.

2025-12-22

Today, the voice in my head says screw you to giving up (sorry for my ungrace). It says knock and the door shall be opened. It’s so interesting; right after putting down the previous sentence and while I felt lost and hopeless, I did a mental vomit, connected the related topic and realized what PAML paper had done, so I could follow their procedure. It gave me a direction and I made progress! The loss function is well in place!

Some cool questions

2025-10-31T00:00:00-07:00

$\begin{align*} \newcommand{\I}{\mathbb{1}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\N}{\mathbb{N}} \DeclareMathOperator{\EE}{\mathbb{E}} \DeclareMathOperator{\PP}{\mathbb{P}} \newcommand{\Ev}[1]{\EE\left[ #1 \right]} \newcommand{\Pr}[1]{\PP\left( #1 \right)} \end{align*}$

An elementary theorem in number theory states that if two integers $m$ and $n$ are relatively prime (i.e., greatest common divisor equal to 1 ), then there exist integers $x$ and $y$ (positive or negative) such that $mx + ny = 1$. Using this theorem show the following:
A If $m$ and $n$ are relatively prime then the set {$xm + ny : x, y$ positive integers} contains all but a finite number of the positive integers.
B. Let $J$ be a set of nonnegative integers whose greatest common divisor is $d$. Suppose also that $J$ is closed under addition, $m, n \in J \Rightarrow m + n \in J$. Then $J$ contains all but a finite number of integers in the set {$0, d, 2d,\dots$}.

Source: Introduction to Stochastic Processes

answer.

Part (A).

First, note that if we want to contain only positive integers, then $n$ and $m$ must be nonnegative. If one of them is negative and the other is positive, we would inevitably contain the whole integers. Also, to rule out the trivial case, let us assume $n, m \geq 1$.

Since $\mathsf{gcd} (n, m) = 1$, the great common divisor of any multiples of $n$ from $1n$ to $mn$ with $m$ is also one. In other words, the set {$n, 2n, \dots, mn$} forms the residue system modulo $m$. Let $N$ be any nonzero integer, we have $N \equiv ny \pmod m$, where $1 \leq y \leq m$. Equivalently, we can say $N - ny = mx$ for some integer $x$.

We know that since $1 \leq y \leq m$, $y$ is positive. On the other hand, if $N \geq mn + 1$, then $x = \frac{N - ny}{m} > 0$, hence $x$ is also positive. Therefore, we just proved that for positive integers $x$ and $y$, all positive integers bigger than $mn$ are in the set described in the question.

Part (B).

Consider the set $J’ =$ {$\frac{j}{d} \mid j \in J$}. If we show for any integer $k$ in $J’$ that is bigger than some threshold $K$, all but a finite number of nonnegative integers are in $J’$, then we can conclude for any integer $k$ in $J$ that is bigger than $Kd$, all but a finite number of nonnegative integers are in $J$

By construction, the greatest common divisor of $J’$s elements is one and also $J’$ is closed under addition. Now, consider an arbitrary element in $J’$ and let us denote it by $a$. The set of remainder of $J’$s elements when divided by $a$ is $R = $ {$j’ \pmod{a} \mid j’ \in J’$}. Since the greatest common divisor of $J’$ is one, the greatest common divisor of $R$ is also one. Therefore, $R$ forms the residue system modulo $a$, i.e., $R = $ {$0, 1, \dots a - 1$}, where for each element of $r \in R$ there exits at least one element in $J’$ denoted by $j’_r$ such that $j’_r \equiv r \pmod{a}$. Let us denote the largest representative of these elements by $K$:

\[K = \max \left\{j'_0, j'_1, \dots j'_{a - 1} \right\}.\]

Now we show that any integer bigger than $K$ is in $J’$. Let $k$ be an integer bigger than $K$ and let its remainder when divided by $a$ be $r$, i.e., $k = r \pmod{ a}$. We know that there is an element in $J’$ with same remainder, i.e., $j’_r$, therefore:

\[k = j'_r + c\cdot a,\]

for some integer $c$. Since $k > M$ and $j’_r \leq M$, $k - j’_r > 0$ and $c \cdot a$ is positive multiple of $a$.
Since $j’_r, a \in J’$, and $J’$ is closed under addition, both $j’_r$ and $c \cdot a = \underbrace{a + a + \dots + a}_{c\, \mathrm{times}}$ are in $J’$.
Therefore, their sum, $k$, must also be in $J’$. This proves that $J’$ contains all integers greater than $K$.
Consequently, $J$ contains {$(K + 1)d, (K + 2)d, \dots$}, which completes the proof.

What is an ergodic Markov chain?

answer.

It’s an aperiodic, irreducible, and recurrent Markov chain:

Aperiodic:
The period of the chain is one.

Irreducible:
If there is only one communication class, i.e., if for all $i, j$ there exists an $n = n(i,j)$ with $P_n(i,j) > 0$, then the chain is called irreducible.

Recurrent:
If the chain starts in a transient class (set of states), then with probability one it eventually leaves this class and never returns. Classes with this property are called transient classes and the states are called transient states. Other classes are called recurrent classes with recurrent states. A Markov chain starting in a recurrent class never leaves that class.

Why is the left eigenvector of a transition matrix with the eigen value of one equal to the stationary distribution of the Markov chain modelled by the transition matrix?

answer.

Suppose $\bar{\pi}$ is the stationary distribution and $P$ is the transition matrix. Then, we know that since all rows of $P^\infty$ are the same, then for any initial any probability vector $\bar{v}$ we have

\[\bar{\pi} = \lim_{n \to \infty} \bar{v}P^n.\]

Hence, we have,

$\bar{\pi} = \lim_{n \to \infty} \bar{v}P^{n + 1} = \left(\lim_{n \to \infty} \bar{v}P^{n}\right)P = \bar{\pi}P.$
The above display is a left eigenvector of $P with eigenvalue one.

Source: Introduction to Stochastic Processes

Prove that every state in an irreducible Markov chain has the same period.

The period of state $i$ is defined as the greatest common divisor of the set $J_i :=$ {$n \geq 1: P_n(i, i) >0$}. Not that $J_i$ is closed under addition because $P_{n +m} (i, i) = \sum_k P_n(i, k) P_m(k, i) \geq P_n(i, i) P_m(i, i) > 0$. Let $d$ be the greatest common divisor of $J_i$. We have shown before that $J_i$ contains all but a finite number of integers. Hence, $J_i$ contains $md$ for all $m$ greater than some integer $M$.

Let $j$ be another state and $n,m$ such that $P_n(i, j), P_m(j, i) > 0$ (chain is irreducible). Clearly $n + m \in J_i$ and $m + n \in J_j$. If $l \in J_j$, then

\[P_{n+m+l}(i, j) \geq P_{n}(i, j)P_l(j, j)P_m(j, i) > 0.\]

Therefore, $n+m+l \in J_i, J_j$. $d$ used to divide $n + m$, now we showed that it must divide $l$ as well.
So, we have just shown that if $d $divides every element of $J_i$
then it divides every element of $J_j$. From this we see that all states have the same period.

Source: Introduction to Stochastic Processes

— - Consider the Markov chain with state space {1, 2, 3, 4, 5} and matrix

\[P = \begin{pmatrix} 0 & 1/3 & 2/3 & 0 & 0 \\ 0 & 0 & 0 & 1/4 & 3/4 \\ 0 & 0 & 0 & 1/2 & 1/2 \\ 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{pmatrix}\]

What are $P_{1000}(2, 1 ), P_{1000}(2, 2), P_{1000}(2, 4)$?

Due to the structure of the chain at no time step there would be probability for transitioning from state 2 to state 1, nor from state 2 to itself. Hence, $P_{1000}(2, 1) = P_{1000}(2, 2) = 0$

On the other hand to compute $P_{1000}(2, 4)$, since the chain is periodic with the period of three and the remainder of $1000$ when divided by $3$ is one, the same as number $4$, hence using Python software we raise the matrix to the power of $4$ instead, and $P_{1000}(2, 4) \approx P_4(2, 4) \approx 0.4167$.

\[\begin{equation*} P_4 = \begin{pmatrix} 0 & 0.3333 & 0.6665 & 0 & 0\\ 0 & 0 & 0 & 0.4167 & 0.5835 \\ 0 & 0 & 0 & 0.4167 & 0.5835\\ 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{pmatrix} \end{equation*}\]

Source: Introduction to Stochastic Processes

Let $X_1, X_2,\dots$ be the successive values from independent rolls of a standard six-sided die. Let $S_n= X_1 + \dots + X_n$. Let

\[T_1 = \min \{n \geq 1: S_n \, \text{is divisible by 8} \},\] \[T_2 = \min\{n \geq 1: S_{n - 1}\, \text{is divisible by 8}\}.\]

Find $\mathbb{E}[T_1]$ and $\mathbb{E}[T_2]$. (Hint: consider the remainder of $S_n after division by 8 as a Markov chain.)

answer.

The state space of the chain comprises ${0, 1, 2, 3, 4, 5, 6, 7}$. Let $g(i)$ denotes the expected number of rolls until the chain reaches state zero, while starting at state $i$. We have that $g(0) = 0$. Definitely we won’t reach the state $0$ with just one roll because the outcome would be among $1$ and $6$. Hence,

\[\begin{equation*} \mathbb{E}[T_1] = 1 + \frac{1}{6}\sum_{i = 1}^6 g(i) = 1 + \frac{1}{6}\left(g(1) + g(2) + g(3) + g(4) + g(5) + g(6)\right). \end{equation*}\]

On the other hand,

\[\begin{equation*} \sum_{i=1}^7g(i) = \sum_{i=1}^7 \left(1 + \frac{1}{6}\sum_{j=1}^6 g\left(i + j \bmod{8}\right) \right). \end{equation*}\]

That is, the expected number of rolls starting from state $i \neq 0$ is equal to rolling the die and the expected number of rolls of the resulting state. We rearrange the summations to get:

\[\begin{equation*} \sum_{i=1}^7g(i) = 7 + \frac{1}{6}\sum_{j=1}^6\sum_{i=7}^6 g(i + j \bmod{8}). \end{equation*}\]

Now for each fixed $j$, $(i + j \bmod{8})$ is a permutation of ${0, 1, 2, 3, 4, 5, 6, 7}$ excluding the number $j$.
Hence, by denoting $S := \sum_{i = 1}^7 g(i)$, we have:

\[\begin{align*} S = 7 + \frac{1}{6}\sum_{j=1}^6(S - g(j)) \Rightarrow \sum_{j = 1}^6 g(j) = 42. \end{align*}\]

Therefore,

\[\begin{equation*} \mathbb{E}[T_1] = 1 + \frac{1}{6}\sum_{i = 1}^6 g(i) = 1 + \frac{1}{6}\left(g(1) + g(2) + g(3) + g(4) + g(5) + g(6)\right) = 8. \end{equation*}\]

For computing $\mathbb{E}[T_2]$, we only need to roll the die once, because the sum of the random variables before the first roll is zero, so $S_{1 - 1} = S_0 = 0$, and $0$ is divisible by 8.

Source: Introduction to Stochastic Processes

Let $X_n, Y_n$ be independent Markov chains with state space {$0, 1, 2$} and transition matrix

\[P = \begin{pmatrix} 1/2 & 1/4 & 1/4 \\ 1/4 & 1/4 & 1/2 \\ 0 & 1/2 & 1/2 \end{pmatrix}.\]

Suppose $X_0 = 0, Y_0 = 2$ and let

\[T = \inf\{n: X_n = Y_n\}.\]

(A) Find $\mathbb{E}[T]$.

(B) What is $\mathbb{P}{X_T = 2}$?

(C) In the long run, what percentage of the time are both chains in the same state?
[Hint: consider the nine-state Markov chain $Z_n = (X_n, Y_n$).]

answer.

Part(A).

We need to set up a recursive formula connecting the expected number of steps until two chains meet.

Let $g(i, j)$ be the expected number of steps until the meeting time, when ${X_n}$ has started at state $i$ and ${Y_n}$ has started in state $j$. We have that $g(k, k) = 0,: k \in {0, 1, 2}$. For all other state, since ${X_n}$ and ${Y_n}$ are independent, we have that:

\[\begin{equation*} g(i, j) = 1 + \sum_{k=0}^3\sum_{l=0}^3 P(i, k)P(j, l)g(k, j). \end{equation*}\]

This gives a set of 6 equations with 6 unknowns
($g(0, 1), g(0, 2), g(1, 0), g(1, 2), g(2, 0),\, \mathrm{and}\, g(2, 1)$) as follows:

\[\begin{align*} g(0, 1) & = 1 + \frac{1}{2}g(0, 0) + \frac{1}{4}g(0, 1) + \frac{1}{4}g(0, 2) \\ g(0, 2) & = 1 + \frac{1}{2}g(0, 0) + \frac{1}{4}g(0, 1) + \frac{1}{4}g(0, 2) \\ g(1, 0) & = 1 + \frac{1}{4}g(1, 0) + \frac{1}{4}g(1, 1) + \frac{1}{2}g(1, 2) \\ g(1, 2) & = 1 + \frac{1}{4}g(1, 0) + \frac{1}{4}g(1, 1) + \frac{1}{2}g(1, 2) \\ g(2, 0) & = 1 + 0\cdot g(2, 0) + \frac{1}{2}g(2, 1) + \frac{1}{2}g(2, 2) \\ g(2, 1) & = 1 + 0 \cdot g(2, 0) + \frac{1}{2}g(2, 1) + \frac{1}{2}g(2, 2). \end{align*}\]

Solving this system of equations gives us: $g(0, 2) = \frac{118}{35} \approx 3.37$.

Part (B).

We need to set up a recursive formula connecting how the chains can meet at state 2 while starting at state 0 and 2 respectively.

Let $h(i, j)$ be the probability of first meeting at state 2, when ${X_n}$ has started at state $i$ and ${Y_n}$ has started in state $j$. We have that $h(1, 1) = h(0 , 0) = 0$ (first meeting should be at state 2), and $h(2, 2) = 1$. For all other state, since ${X_n}$ and ${Y_n}$ are independent, we have that:

\[\begin{equation*} h(i, j) = \sum_{k=0}^3\sum_{l=0}^3 P(i, k)P(j, l)h(k, j). \end{equation*}\]

This gives a set of 6 equations with 6 unknowns ($h(0, 1), h(0, 2), h(1, 0), h(1, 2), h(2, 0),\, \mathrm{and}\, h(2, 1)$) as follows:

\[\begin{align*} h(0, 1) & = \frac{1}{2}h(0, 0) + \frac{1}{4}h(0, 1) + \frac{1}{4}h(0, 2) \\ h(0, 2) & = \frac{1}{2}h(0, 0) + \frac{1}{4}h(0, 1) + \frac{1}{4}h(0, 2) \\ h(1, 0) & = \frac{1}{4}h(1, 0) + \frac{1}{4}h(1, 1) + \frac{1}{2}h(1, 2) \\ h(1, 2) & = \frac{1}{4}h(1, 0) + \frac{1}{4}h(1, 1) + \frac{1}{2}h(1, 2) \\ h(2, 0) & = 0\cdot h(2, 0) + \frac{1}{2}h(2, 1) + \frac{1}{2}h(2, 2) \\ h(2, 1) & = 0 \cdot h(2, 0) + \frac{1}{2}h(2, 1) + \frac{1}{2}h(2, 2). \end{align*}\]

Solving this system of equations gives us: $h(0, 2) = \frac{15}{28} \approx 0.535$.

Part (C).

Since the chains are independent, we need to multiple their invariant distributions for states ${1, 2, 3}$. Let us find the invariant distribution by solving $\bar{\pi} = \bar{\pi}P$:

\[\begin{equation*} \begin{pmatrix} \bar{\pi}_1 \\ \bar{\pi}_2 \\ \bar{\pi}_3 \end{pmatrix} = \begin{pmatrix} \bar{\pi}_1 \\ \bar{\pi}_2 \\ \bar{\pi}_3 \end{pmatrix}^\top \begin{pmatrix} 1/2 & 1/4 & 1/4 \\ 1/4 & 1/4 & 1/2 \\ 0 & 1/2 & 1/2 \end{pmatrix} \quad \mathrm{and} \quad \bar{\pi}_1 + \bar{\pi}_2 + \bar{\pi}_3 = 1. \end{equation*}\]

The solution of this equation is: $\bar{\pi} = \frac{1}{11}(2, 4, 5)$. Hence, the probability that both chains spend time at the same states in a long run is equal to: $\frac{1}{11^2}(4 + 16 + 25) = \frac{45}{121}$.

Source: Introduction to Stochastic Processes

Let $X_n$ be a Markov chain on state space {1, 2, 3, 4, 5} with transition matrix

\[P = \begin{pmatrix} 0 & 1/2 & 1/2 & 0 & 0\\ 0 & 0 & 0 & 1/5 & 4/5 \\ 0 & 0 & 0 & 2/5 & 3/5 \\ 1 & 0 & 0 & 0 & 0 \\ 1/2 & 0 & 0 & 0 & 1/2 \end{pmatrix}\]

(A) Suppose $X_0 = 1$. What is the expected number of steps until the chain is in state 4?

(B) Suppose $X_0 = 1$. What is the probability that the chain will enter state 5 before it enters state 3?

answer.

Part (A).

We need to set up a recursive formula connecting the expected number of steps until reaching state 4 for each distinct state. Let $g(i)$ be the expected number of steps until reaching state $4$, when the chain has started at state $i$. We have that $g(4) = 0$. For all other state we have that:

\[\begin{equation*} g(i) = 1 + \sum_{j=0, j \neq 4}^5P(i, j)g(j). \end{equation*}\]

This gives a set of 4 equations with 4 unknowns ($g(1), g(2), g(3),\, \mathrm{and}\, g(5)$) as follows:

\[\begin{align*} g(5) & = 1 + \frac{1}{2}g(0) + \frac{1}{2}g(5) \\ g(3) & = 1 + \frac{2}{5}g(4) + \frac{3}{5}g(5) \\ g(2) & = 1 + \frac{1}{5}g(4) + \frac{4}{5}g(5) \\ g(1) & = 1 + \frac{1}{2}g(2) + \frac{1}{2}g(3) \end{align*}\]

Solving this system of equations gives us: $g(1) = \frac{34}{3}$.

Part (B).

We need to set up a recursive formula connecting how the chain can hit state 5 before state 3, while starting at state 1.

Let $h(i)$ be the probability that the chain hits state $5$ before state 3, when the chain has started at state $i$. We have that $h(5) =1$ and $h(3) = 0$. For all other state we have that:

\[\begin{equation*} h(i) = \sum_{j=1}^5P(i, j)h(j). \end{equation*}\]

This gives a set of 3 equations with 3 unknowns ($h(1), h(2),\, \mathrm{and}\, h(4)$) as follows:

\[\begin{align*} h(4) & = h(1) \\ h(2) & = \frac{1}{5}h(4) + \frac{4}{5}h(5) = \frac{1}{5}h(4) + \frac{4}{5}\\ h(1) & = \frac{1}{2}h(2) + \frac{1}{2}h(3) = \frac{1}{2}h(2) \end{align*}\]

Solving this system of equations gives us: $h(1) = \frac{4}{9}$.

Source: Introduction to Stochastic Processes

Theorem. Suppose the state space $S$ and the action space $A$ are finite. Then, there exits a deterministic stationary Markov Blackwell optimal policy.

Proof. Let $\Pi^\mathrm{MD}$ be the set of all deterministic Markov polices for an MDP with finite state and action spaces. Since $\Pi^\mathrm{MD}$ is finite, there exits a sequence of discount factors {$\gamma_n$} converging to one for which there exits a $\pi^* \in \Pi^\mathrm{MD}$ with $\left(\pi^*\right)^\infty = (\pi, \pi, \dots) = (\pi_0, \pi_1, \dots)$ (as the stationary policy) being discount optimal for all $\gamma_n$.

The reason the aforementioned fact is true is that since $\Pi^\mathrm{MD}$ finite and $0 \leq \gamma <1$ is infinite, due to pigeonhole principle some optimal policies are shared for some discount factors. Therefore, we can pick a subsequence of discount factors that increases toward one, which will have an associated optimal policy.

With having the above fact in mind, for each $\pi \in \Pi^\mathrm{MD}$,

\[v_\gamma^{\left(\pi^*\right)^\infty}(s) - v_\gamma^{\pi^\infty}(s) \geq 0,\]

for all states, $\gamma = \gamma_n$ and $n = 1, 2, \dots$ . Each function on L.H.S. is a rational function of $\gamma$, so is their difference. Hence, the difference is zero for all $\gamma$, or equals zero for at most finitely many $\gamma$s. Therefore, there exists a $\gamma_\pi < 1$ for which the above display holds for $\gamma_\pi \leq \gamma < 1$. Since $\Pi^\mathrm{MD}$ is finite the above displays holds for all $\gamma^* \leq \gamma < 1$ where $\gamma^* = \max_\pi \gamma_{\pi}$.

Now that we have fixed $\gamma$, by virtue of the existence of a deterministic stationary Markov policy in the discounted setting, the result follows. $\square$

I’ve been concise and incomplete for this theorem. I’ll add supplementary details one day that I stumbled on this post. :D

Source: Markov Decision Processes: Discrete Stochastic Dynamic Programming

Let $\Omega$ be a measurable set, and let $f: \Omega \to [0, \infty]$ be a non-negative measurable function. Prove that we have $0 \leq \int_\Omega f \leq \infty$. Furthermore, we have $\int_\Omega f = 0$ if and only if $f(x) = 0$ for almost every $x \in \Omega$.

Proof. Since $f$ is a non-negative measurable function, we have that

\[\int_\Omega f = \sup \{\int_\Omega s: s \text{ is non-negative, simple and dominated by } f \}.\]

Step 1: $0 \leq \int_\Omega \leq \infty$.

Consider a fixed $s$. Since $s$ is simple, $\int_\Omega s = \sum_{j = 1}^N c_j \cdot m(E_j)$,
where $N \in \mathbb{N}, c_j > 0, m: \Omega \to \mathbb{R}^*$ is the Lebesgue measure, $E_1, \dots, E_N \in \Omega$,
and $E_i \cap E_j = \varnothing$ for all $i, j \in [N], \mathrm{and}\, i \neq j$. The sum of non-negative terms on the
extended real line $\mathbb{R}^*$ is in $[0, \infty]$, hence is their supremum.

Step 2: If $f = 0$ a.e., then $\int_\Omega f = 0$.

Since $f$ dominates $s$, then $0 \leq s(x) \leq f(x)$, for all $x \in \Omega$. If $f(x) = 0$ for almost every $x \in \Omega$, then $s = 0$ a.e., hence $\int_\Omega s = 0$, and consequently $\sup \int_\Omega s = 0$.

Step 3: If $\int_\Omega f = 0$, then $f = 0$ a.e.

Consider the set

\[E_j := \{x \in \Omega: f(x) > \frac 1j \}, \quad j \geq 1.\]

The function $s_j := \frac 1j \mathbb{1}_{E_j}$ is simple and also dominated by $f$ because $0 \leq s_j \leq f$. By Step 1, we have $\frac 1j m(E_j) = \int_\Omega s_j \leq \int_\Omega f = 0$. This means $m(E_j) = 0$ for all $j$. The domain of $f$ where it is non zero is

\[\Omega' := \{x : f(x) > 0 \} = \cup_{j = 1}^\infty E_j.\]

By the subadditivity property of the Lebesgue measure we have

\[m\left(\Omega' \right) = \sum_{j = 1}^\infty m(E_j) = 0 \Rightarrow f = 0\: \mathrm{a.e.}\]

Source: Analysis II

Consider two independent Poisson processes $X(t)$ and $Y(t)$ with parameters $\lambda$ and $\mu$. Let $k \geq 1$ and define

\[\begin{equation*} \tau = \inf \{t > 0: X(t) =k \}, \quad \tau' = \inf \{t > 0: X(t) =k + 2\}. \end{equation*}\]

Find the distribution of $N = Y(\tau’) - Y(\tau)$.

Solution. First we need to understand what the question actually asking us. $\tau’ - \tau$ is the time that it takes for {$X(t)$} to go from $k$ to $k+2$. We can write

\[\begin{equation*} U := T_{i + 2} - T_{i} = \underbrace{T_{i + 2} - T_{i + 1}}_{U_2} + \underbrace{T_{i + 1} - T_{i}}_{U_1}, \end{equation*}\]

where $T_i = \inf {t > 0: X(t) = i }$. Each $U_i$ represents the arrival time of the $i+1$ “customer” after the fact that the $i$th customer has arrived. Hence, each $U_i$ follows the exponential distribtuion with parameter $\mu$.

The question is asking us how much {$Y(t)$} will increase in the interval between the arrival time of the $k$th and $k+2$ customer at {$X(t)$}. So, we can write that

\[\begin{equation*} \Pr{N = i \mid U = t} = \Pr{Y(U) = i \mid U = t} = \Pr{Y(t) = i} = e^{-\mu t} \cdot \frac{\left(\mu t\right)^i}{i!}. \end{equation*}\]

So,

$\begin{equation*} \Pr{N = i} = \int_0^\infty \Pr{N = i \mid U = t} f_U(t) dt. \end{equation*}$ So, we need the probability mass function of $U$. We said that $U = U_1 + U_2$ and each $U_i$ is comes from an independent exponential distribution. We have,

\[\begin{align*} f_U(t) &\stackrel{\text{(independence)}}{=} \int_0 ^ tf_{U_1}(x)\cdot f_{U_2}(t - x) dx \\ &= \int_0^t \lambda e^{-\lambda x} \cdot \lambda e^{-\lambda (t - x)} dx \\ & = \lambda^2 te^{-\lambda t}. \end{align*}\]

Therefore,

\[\begin{align*} \Pr{N = i} & = \int_0^\infty \Pr{N = i \mid U = t} f_U(t) dt \\ & = \int_0^\infty e^{-\mu t} \frac{\left(\mu t\right)^i}{i!} \cdot \lambda^2 te^{-\lambda t} dt \\ & = \lambda^2\frac{\mu^i}{i!}\int_0^\infty t^{i + 1} \cdot e^{-(\lambda + \mu) t} dt \\ & = \lambda^2\frac{\mu^i}{i!}\int_0^\infty t^{i + 2 - 1} \cdot e^{-(\lambda + \mu) t} dt \\ & = \lambda^2\frac{\mu^i}{i!}\int_0^\infty t^{b - 1} \cdot e^{-a t} dt, \\ \end{align*}\]

where $b = i + 2$ and $a = (\lambda + \mu)$. Let’ see how to compute

\[\begin{equation} \int_0^\infty t^{b - 1} \cdot e^{-a t} dt. \end{equation}\]

We know that for $z > 0$, the Gamma function is defined as

\[\begin{equation*} \Gamma(z) = (z - 1)! = \int_0^\infty t^{z - 1}e^{-t} dt. \end{equation*}\]

So with athe change of variables $u := at$ we have,

\[\begin{equation} \int_0^\infty t^{b - 1} \cdot e^{-a t} dt = \int_0^\infty \left(\frac{u}{a}\right)^{b - 1} \cdot e^{-u} \frac{du}{a} = \left(\frac{1}{a^{b + 1}}\right) \int_0^\infty u^{b - 1} \cdot e^{-u} du = \frac{\Gamma(b)}{a^{b + 1}}. \end{equation}\]

Now replacing $b = i + 2$ and $a = (\lambda + \mu)$,

\[\begin{equation} \int_0^\infty t^{b - 1} \cdot e^{-a t} dt = \frac{\Gamma(i + 2)}{(\lambda + \mu)^{i + 2}}. \end{equation}\]

So,

\[\begin{align*} \Pr{N = i} & = \lambda^2\frac{\mu^i}{i!}\int_0^\infty t^{b - 1} \cdot e^{-a t} dt, \\ & = \lambda^2\frac{\mu^i}{i!} \frac{\Gamma(i + 2)}{(\lambda + \mu)^{i + 2}} \\ & = \lambda^2\frac{\mu^i}{i!} \frac{(i + 1)!}{(\lambda + \mu)^{i + 2}} \\ & = (i + 1)\frac{\lambda^2\mu^i}{(\lambda + \mu)^{i + 2}} \\ & = \binom{i + 2 - 1}{i}\left(\frac{\mu}{\lambda + \mu}\right)^i \left( \frac{\lambda}{\lambda + \mu}\right)^2. \end{align*}\]

The negative binomial distribution with two success each with the probability of $\frac{\lambda}{ \lambda + \mu}$.

Source: [A question on my midterm at STAT 580 that I couldn’t solve during the exam :)]

Let $X_t$ be a Yule process with parameter $\lambda> 0$ and $X_0 =1$. In other words, $X_t$ is birth and death process with the birth rate $\lambda_n = n \lambda$ and the death rate $\mu_n = 0$. Show

\[\begin{equation*} P_n(t) = e^{-\lambda t}\left[1 - e^{-\lambda t}\right]^{n - 1}, \quad n\geq 1. \end{equation*}\]

Then, assume $X_t, Y_t$ are two independent Yule processes with parameter $\lambda$ and $X_0 = Y_0 = 1$. Determine the conditional distribution of $X_t$ given $X_t + Y_t = N,\; N \geq 2$.

Solution. In birth and death processes we have that

\[\begin{align*} \Pr{X_{t + \Delta t} = n \mid X_t = n} &= 1 - (\lambda_n + \mu_n) \Delta t + o(\Delta t) \\ \Pr{X_{t + \Delta t} = n+1 \mid X_t = n} &= \lambda_n \Delta t + o(\Delta t) \\ \Pr{X_{t + \Delta t} = n-1 \mid X_t = n} &= \mu_n \Delta t + o(\Delta t). \end{align*}\]

On the other hand, let $P_n(t) = \Pr{X_t =n}$. We have,

\[\begin{align*} P'_n(t) &= \lim_{\Delta t \to 0} \frac{P_n(t + \Delta t) - P_n(t)}{\Delta t} \\ &= \lim_{\Delta t \to 0} \frac{P_n(t)(1 - \lambda_n + \mu_n)\Delta t + \lambda_{n-1} P_{n - 1}(t) \Delta t + \mu_{n+1} P_{n + 1}(t) \Delta t - P_n(t)}{\Delta t} \\ & \stackrel{(\mu_n = 0)}{=} \lambda_{n - 1} P_{n - 1}(t) - \lambda_n P_n(t) \\ & \stackrel{(\lambda_n = n\lambda)}{=} (n - 1)\lambda P_{n - 1}(t) - n\lambda P_n(t) \\ & \stackrel{\text{(replacing the solution)}}{=} \lambda e^{-\lambda t}\left[ 1 - e^{-\lambda t}\right]^{n - 1}\left(\frac{n - 1}{1 - e^{-\lambda t}} - n\right) \\ & = \lambda e^{-\lambda t}\left[ 1 - e^{-\lambda t}\right]^{n - 1}\left(\frac{ ne^{-\lambda t} - 1}{1 - e^{-\lambda t}}\right). \end{align*}\]

Now we take the derivative of the proposed solution:

\[\begin{align*} -\lambda e^{-\lambda t}\left[1 - e^{-\lambda t}\right]^{n - 1} + (n - 1)\lambda\left[1 - e^{-\lambda t}\right]^{n - 2} e^{-2\lambda t} &= \lambda e^{-\lambda t}\left[ 1 - e^{-\lambda t}\right]^{n - 1} \left(-1 + \frac{e^{-\lambda t}(n - 1)}{1 - e^{-\lambda t}} \right) \\ & = \lambda e^{-\lambda t}\left[ 1 - e^{-\lambda t}\right]^{n - 1} \left(\frac{ne^{-\lambda t} - 1}{1 - e^{-\lambda t}} \right). \end{align*}\]

For the second part let $q = e^{-\lambda t}$. We have:

\[\begin{align*} \Pr{X_t = k \mid X_t + Y_t = N} &= \frac{\Pr{X_t = k, X_t + Y_t = N}}{\Pr{X_t + Y_t = N}} \\ &\stackrel{\text{(independence)}}{=} \frac{\Pr{X_t = k}\cdot \Pr{X_t + Y_t = N}}{\Pr{X_t + Y_t = N}} \\ &= \frac{\Pr{X_t = k}\cdot \Pr{Y_t = N - k}}{\sum_{i = 1}^{N - 1} \Pr{X_t = i} \cdot \Pr{Y_t = N - i}} \\ &\stackrel{\text{(part (a))}}{=} \frac{q(1 - q)^{k - 1} \cdot q(1 - q)^{N - k - 1}}{\sum_{i = 1}^{N - 1}q(1 - q)^{i - 1} \cdot q(1 - q)^{N - i - 1}} \\ &= \frac{(1 - q)^{N - 2}}{\sum_{i = 1}^{N - 1}(1 - q)^{N - 2}} = \frac{1}{N - 1}. \end{align*}\]

The uniform distribution.

Source: [A question on my midterm at STAT 580 that I couldn’t solve during the exam :)]

Convergence of Positive Supermartingales

2025-10-27T00:00:00-07:00

\[\begin{align*} \newcommand{\I}{\mathbb{1}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\N}{\mathbb{N}} \newcommand{\sB}{\mathscr{B}} \DeclareMathOperator{\EE}{\mathbb{E}} \DeclareMathOperator{\PP}{\mathbb{P}} \newcommand{\Ev}[2]{\EE^{#2}\left[ #1 \right]} \newcommand{\Pr}[2]{\PP^{#2}\left( #1 \right)} \end{align*}\]

I really enjoyed reading Chapter two of Discrete Parameter Martingales and I thought why not making a blog post.

The book is quite advanced, but I’ll try my best state everything clearly and simply given my temporal boredom threshold. Haha :)!

Let me fix the probability space first. Our probability space (as usual) is $(\Omega, \mathscr{F}, \mathbb{P})$. HOWEVER, in what follows it is more convenient to define a subset of the above probability space as $(\Omega’, \mathscr{B}, \mathbb{P})$ (because we’re going to work with filterations and naturally they are sub-$\sigma$-algebras that are growing, so we’re always in the subset case). A mapping $X$ only defined on $\Omega’$ is random variable on $\Omega’$ if it is measurable with respect to (w.r.t.) the trace $\sigma$-algebra $\mathscr{F} \cap \Omega’$. So, what is a trace $\sigma$-algebra? I found it here.

Definition. Let $\Omega$ be a set, and let $\mathscr{F}$ be a $\sigma$-algebra on $\Omega$. Let $\Omega’ \subseteq \Omega$ be a subset of $\Omega$. Then, the trace $\sigma$-algebra (of $\Omega’$ in $\mathscr{F}$), $\mathscr{B}$, is defined as:

\[\mathscr{B} := \{\Omega' \cap F : F \in \mathscr{F}\}.\]

It is a $\sigma$-algebra on $\Omega’$.

Now we can define what a positive supermartingale is. Let $(\Omega, \mathscr{F}, \mathbb{P})$ be our probability space and $(\mathscr{B}_n, n \in \mathbb{N})$ an increasing sub-$\sigma$-algebrs of $\mathscr{F}$ (a.k.a a filteration).

Definition. An adapted sequence $(X_n, n \in \mathbb{N})$ of positive random variables is called a positive supermartingale if the almost sure (a.s.) inequality

\[X_n \geq \mathbb{E}^{\mathscr{B}_n}(X_{n + 1})\]

is satisfied for all $n \in \mathbb{N}$. A supermartingale is by definition a sequence of random variables (r.v.s) which “decrease in conditional mean”. For a sequence positive r.v.s denoting the sequence of values of the fortune of a gambler, the supermartingale condition expresses the property that at each play the game is unfavourable to the player in conditional mean.

We note that the inequality defining a supermartingale implies that

\[X_m \geq \mathbb{E}^{\mathscr{B}_m}(X_p), \; \forall p > m.\]

In fact the defintion implies that,

\[\mathbb{E}^{\mathscr{B}_m}(X_n) \geq \mathbb{E}^{\mathscr{B}_m}\mathbb{E}^{\mathscr{B}_n}(X_{n + 1}) \stackrel{\text{tower rule}}{=} \mathbb{E}^{\mathscr{B}_m}(X_{n + 1}),\]

if $n \geq m$. Therefore, the sequence ($\mathbb{E}^{\mathscr{B}_m}(X_n)$) is decreasing.

The following proposition is so cool and is so fundamental.

Proposition (Maximal inequality). For every positive supermartingale, the r.v. $\sup_n X_n$ is a.s. finite on the set {$X_0 < \infty$}, and satisfies the following inequality

\[\mathbb{P}^{\mathscr{B}_0}(\sup_n X_n \geq a) \leq \min \left(\frac{X_0}{a}, 1\right)\]

for all constants $a > 0$.

Before going into the proof, remember that $\mathbb{P}^{\mathscr{B}}(A) = \mathbb{E}^{\mathscr{B}}(\mathbb{1}_A)$ and

\[\int_B\mathbb{P}^{\mathscr{B}}(A)d\mathbb{P} = \mathbb{P}(A \cap B), \quad B \in \mathscr{B}.\]

First, we need an auxiliary lemma which constitutes the switching principle for supermartingales. But before that, I need to talk about the concept of the stopping time.

Stopping time (unsurprisingly) is a concept from gambling. A stopping rule for the player is a rule for leaving the game, based at each time on information it his/her disposal at that time. By this definition, a “dishonest” player who decides to leave the game any time already knowing certain subsequent outcomes of the game is excluded. Also, note that the stopping time could be $\infty$ meaning that the game never ends.

Definition. Let $\mathbb{N}^*$ denote $\mathbb{N} \cup {\infty}$. A mapping $\nu: \Omega \to \mathbb{N}^*$ is called a stopping time if

\[\{ \nu = n \} \in \mathscr{B}_n, \quad \forall n \in \mathbb{N}.\]

The $\sigma$-algebra $\mathscr{B}_\nu$ is associated with the stopping time as the subsets of $\Omega$ defined by

\[\mathscr{B}_\nu = \{B: B \in \mathscr{B}_\infty, B \cap \{\nu = n\} \in \mathscr{B}_n, \forall n \in \mathbb{N}\}.\]

Note that events in $\mathscr{B}_\nu$ are prior to $\nu$.

Back to the considered lemma:

Lemma (master). Given tow positive supermartingales $(X^{(i)}_n)(i =1, 2)$ and a stopping time $\nu$ such that $X_\nu^{(1)} \geq X_\nu^{(2)}$ on {$\nu < \infty$}, the formula

\[X_n(\omega) = \begin{cases} X_\nu^{(1)} & \mathrm{if}\; n < \nu(\omega)\\ X_\nu^{(2)} & \mathrm{if}\; n \geq \nu(\omega) \end{cases} \quad (n \in \mathbb{N}),\]

defines a new positive supermartingale.

Proof. Indeed, the defining formula of the $X_n$ can also be written

\[X_n = \mathbb{1}_{\{v > n\}}X_n^{(1)} + \mathbb{1}_{\{v \leq n\}}X_n^{(2)},\]

then it is clear that $X_n$ is $\mathscr{B}_n$ measurable. The supermartingale property of $X^{(i)}_n$ allows us to write

\[\begin{align*} X_n &= \mathbb{1}_{\{v > n\}}X_n^{(1)} + \mathbb{1}_{\{v \leq n\}}X_n^{(2)} \\ &\geq \mathbb{1}_{\{v > n\}} \mathbb{E}^{\mathscr{B}_n}\left[X_{n+1}^{(1)}\right] + \mathbb{1}_{\{v \leq n\}} \mathbb{E}^{\mathscr{B}_n}\left[X_{n+1}^{(2)}\right] \\ & = \mathbb{E}^{\mathscr{B}_n}\left[\mathbb{1}_{\{v > n\}} X_{n+1}^{(1)} + \mathbb{1}_{\{v \leq n\}} X_{n+1}^{(2)}\right]. \tag{since $\nu$ is $\mathscr{B}_n$-measurable} \end{align*}\]

The assumption $X_\nu^{(1)} \geq X_\nu^{(2)}$ implies that $X_{n + 1}^{(1)} \geq X_{n + 1}^{(2)}$ on {$\nu = n + 1$}, and

\[\mathbb{1}_{\{v > n\}} X_{n+1}^{(1)} - \mathbb{1}_{\{v > n + 1\}} X_{n+1}^{(1)} + \mathbb{1}_{\{v \leq n\}} X_{n+1}^{(2)} - \mathbb{1}_{\{v \leq n + 1\}} X_{n+1}^{(2)} = X_{n+1}^{(1)} - X_{n+1}^{(2)}.\]

Hence,

\[\mathbb{1}_{\{v > n\}} X_{n+1}^{(1)} + \mathbb{1}_{\{v \leq n\}} X_{n+1}^{(2)} \geq \mathbb{1}_{\{v > n + 1\}} X_{n+1}^{(1)} + \mathbb{1}_{\{v \leq n + 1\}} X_{n+1}^{(2)} = X_{n + 1}. \quad\square\]

Using the above lemma, let us associate with the positive supermartingale of the proposition the stopping time defined by

\[\nu_a = \begin{cases} \min(n:X_n > a) = 0 & \mathrm{if}\; \sup_n X_n > a\\ \infty & \mathrm{if}\; \sup_n X_n \leq a \end{cases}\; .\]

Since $X_{\nu_a} > a$ on {$\nu_a < \infty$} and since the constant $a$ can be considered a supermartingale, the formula,

$Y_n = \begin{cases} X_n & n < \nu_a\\ a & n \geq \nu_a \end{cases}$ defines a new supermartingale by our previous lemma. Hence, $Y_0 \geq \mathbb{E}^{\mathscr{B}_0}[Y_n]$. Since $Y_0$ is equal to $X_0$ or $a$ according to the relation between $X_0$ and $a$, and since $Y_n \geq a \mathbb{1}_{{\nu_a \leq n}}$, we have:

\[a \mathbb{P}^{\mathscr{B}_0}(\nu_a \leq n) \leq \min(X_0, a).\]

Letting $n$ tend to infinity, we obtain

\[\mathbb{P}^{\mathscr{B}_0}(\sup_n X_n > a) \leq \min\left(\frac{X_0}{a}, 1\right),\]

Since {$\nu_a < \infty$} = {$\sup_n X_n > a$}. It suffices to replace $a$ by $a\left(1 - k^{-1}\right)$ in the inequality above and let $k$ tend to infinity to obtain the same inequality with $\geq$ instead of $>$ on the left-hand side. Let us integrate both sides over the event {$X_0 < \infty$}, which belongs to $\mathscr{B}_0$. Let $A := {\sup_n X_n > a}$ we find that

\[\begin{align*} \int_{\{X_0 < \infty\}}\mathbb{P}(A \mid \mathscr{B}_0) d\mathbb{P} & \leq \int_{\{X_0 < \infty\}} \min\left(\frac{X_0}{a}, 1\right)d\mathbb{P} \Rightarrow \\ \int_{\mathscr{B}_0}\mathbb{E}(\mathbb{1}_A \mid \mathscr{B}_0) d\mathbb{P} &\leq \int_{\{X_0 < \infty\}} \min\left(\frac{X_0}{a}, 1\right)d\mathbb{P} \Rightarrow \\ \mathbb{E}[\mathbb{1}_{\mathscr{B}_0} \cdot \mathbb{E}(\mathbb{1}_A \mid \mathscr{B}_0)] &\leq \int_{\{X_0 < \infty\}} \min\left(\frac{X_0}{a}, 1\right)d\mathbb{P} \Rightarrow \\ \mathbb{E}[\mathbb{E}(\mathbb{1}_{\mathscr{B}_0} \cdot\mathbb{1}_A \mid \mathscr{B}_0)] &\leq \int_{\{X_0 < \infty\}} \min\left(\frac{X_0}{a}, 1\right)d\mathbb{P} \Rightarrow \\ \mathbb{E}(\mathbb{1}_{\mathscr{B}_0} \cdot\mathbb{1}_A) &\leq \int_{\{X_0 < \infty\}} \min\left(\frac{X_0}{a}, 1\right)d\mathbb{P} \Rightarrow \\ \mathbb{P}(\mathscr{B}_0 \cap A) &\leq \int_{\{X_0 < \infty\}} \min\left(\frac{X_0}{a}, 1\right)d\mathbb{P} \Rightarrow \\ \mathbb{P}(X_0 < \infty, \sup_n X_n > a) &\leq \int_{\{X_0 < \infty\}} \min\left(\frac{X_0}{a}, 1\right)d\mathbb{P}. \end{align*}\]

As $a$ tends to infinity, the R.H.S. tends to zero by the dominated convergence theorem and we have

\[\mathbb{P}(X_0 < \infty, \sup_n X_n = \infty) = 0. \; \square\]

Box. (Analysis II) A box $B$ in $\mathbb{R}^n$ is any set of the form

\[B = \Pi_{i = 1}^n (a_i, b_i) := \{(x_1, \dots, x_n) \in \mathbb{R}^n: x_i \in (a_i, b_i)\: \text{for all}\: 1 \leq i \leq n\},\]

where $b_i \geq a_i$ are real numbers. The volume of this box is defined as

\[\mathrm{vol}(B) := \Pi_{i=1}^n (b_i - a_i).\]

Outer Measure. (Analysis II) If $\Omega$ is a set, we define the outer measure $m^*(\Omega)$ of $\Omega$ to be quantity

\[m^*(\Omega) := \inf \left\{\sum_{j \in J}\mathrm{vol}(B_j): (B_j)_{j \in J}\: \mathrm{covers}\: \Omega;\: J\: \text{at most countable}\right\}.\]

Measurable Sets. (Analysis II) Let $E$ be a subset of $\R^n$. We say $E$ is Lebesgue measurable, or measurable for short, iff we have the identity

$m^*(A) = m^*(A \cap E) + m^*(A-E)$ for every subset $A$ of $\R^n$. If $E$ is measurable, we define the Lebesgue measure of $E$ to be $m(E)= m^*(E)$; if $E$ is not measurable, we leave $m(E)$ undefined. In words, $E$ being measurable means that if we use the set $E$ to divide up an arbitrary set $A$ into two parts, we keep the additivity property.

Measurable Functions. (Analysis II) Let be a measurable subset of $\mathbb{R}^n$ and let $f: \Omega \to \R^m$ be a function. A function $f$ is measurable iff $f^{-1}(V)$ is measurable for every open set $V \subseteq \R^m$. Another characterization of measurable functions is given by: Let $\Omega$ be a measurable subset of $\R^n$. A function $f: \Omega \to \R^*$ is said to be measurable iff $f^{-1}((a, \infty]))$ is measurable for every real number $a$.

Absolutely Integrable. (Analysis II) Let $\Omega$ be a measurable subset of . A measurable function $f: \Omega \to \mathbb{R}^* := \mathbb{R} \cup {-\infty, +\infty}$ is said to be absolutely integrable if the integral $\int_\Omega f$ (w.r.t. the Lebesgue measure) is finite.

Pointwise Convergence. (Analysis II) The most obvious notion of convergence of functions is pointwise convergence, or convergence at each point of the domain. Let $\left(f^{(n)} \right)_{n = 1}^\infty$ be a sequence of functions from one metric space $(X, d_X$) to another $(Y, d_Y)$, and let $f: X \to Y$ be another function. We say that $\left(f^{(n)} \right)_{n = 1}^\infty$ converges pointwise to $f$ if we have

\[\lim_{n \to \infty}f^{(n)}(x) = f(x), \quad \forall x \in X,\]

i.e.,

\[\lim_{n \to \infty}d_Y\left(f^{(n)}(x), f(x)\right) = 0.\]

We call the function f the pointwise limit of the functions $f^{(n)}$.

Limits of measurable functions are measurable (Analysis II) Let $\Omega$ be a measurable subset of $\mathbb{R}^n$, for each positive integer $n$, let $f_n: \Omega \to \mathbb{R}^*$ be a measurable function. Then, the functions $\sup_n f_n, \inf_n f_n, \limsup_{n \to \infty} f_n, \mathrm{and}\, \liminf_{n \to \infty} f_n$ are also measurable. In particular, if $f_n$ converge pointwise to another function $f: \Omega \to \mathbb{R}^*$, then $f$ is also measurable.

Proof. We first prove the claim about $\sup_n f_n$. Call this function $g$. We have to prove that $g^{-1}((a, \infty]))$ is measurable for every $a$. First we show that

\[g^{-1}((a, \infty])) = \cup_{n \geq 1}f_n^{-1}((a, \infty])),\]

then the claim follows since the countable union of measurable sets is measurable. To show the above display consdier two cases:

$\subseteq$

If $x \in g^{-1}((a, \infty]))$, then $g(x) = sup_n f_n(x) > a$. If every $f_n(x) \leq a$, then $\sup_n f_n(x) \leq a$ which is a contradiction. Hence, there exists some $n$ with $f_n(x) > a$. Thus, $x \in f^{-1}((a, \infty]))$ for that $n$, so $x$ belongs to the union.

$\supseteq$

If $x \in \cup_{n \geq 1}f_n^{-1}((a, \infty]))$, then for some $n$ we have $f_n(x) > a$. Therefore, $g(x) = \sup_n f_n(x) \geq f_n(x) > a$ and $x \in g^{-1}((a, \infty]))$.

Since both inclusions hold, the sets are equal.

One important note that we know {$g > a$} $= \cup_n$ {$f_n > a$}. But, {$g \geq a$} $\neq \cup_n$ {$f_n \geq a$}. Because if the supremum is at least $a$ does not mean that at least one of the $f_n$s will match it. We can fix this by using an approximation from below:

\[\{g \geq a\} = \cap_{m = 1}^\infty \cup_{n = 1}^\infty \{f_n \geq a - \frac 1m\}.\]

Back to the main proposition, a similar argument works for $\inf_n f_n$. The claim for $\limsup$ and $\liminf$ then follow from the identities

\[\limsup_{n \to \infty} f_n = \inf_{N \geq 1} \sup_{n \geq N} f_n, \quad \liminf_{n \to \infty} f_n = \sup_{N \geq 1} \inf_{n \geq N} f_n.\]

Lebesgue Monotone Convergence Theorem. (Analysis II) Let $\Omega$ be a measurable subset of $\R^n$, and let $f_n$ be a sequence of non-negative functions from $\Omega$ to $[0, \infty]$, which are increasing in the sense that

\[0 \leq f_1(x) \leq f_2(x) \leq \dots \quad \forall x \in \Omega,\]

Note we are assuming that $f_n(x)$ is increasing with respect to $n$; this is a different notion from $f_n(x)$ increasing with respect to $x$. We have

\[0 \leq \int_\Omega f_1 \leq \int_\Omega f_2 \leq \dots\]

and

\[\int_\Omega \sup_n f_n = \sup_n \int_\Omega f_n.\]

Proof. For the first conclusion, we should prove that

If $0 \leq f(x) \leq g(x)$ for all $x \in \Omega$ for measurable non-negative functions $f, g$, then we have $\int_\Omega f \leq \int_\Omega g$. Let $h := g - f \geq 0$. We proceed by showing that $0 \leq \int_\Omega h \leq \infty$. Since $f, g$ are measurable, $h$ is measurable and since $h$ is non-negative as well, its Lebesgue integral is equal to

\[\int_\Omega h = \sup \left\{\int_\Omega s: s \text{ is simple, non-negative and dominated by } h \right\}.\]

Fix an $s$. Since $s$ is simple, it can be written as $s = \sum_{j = 1}^{N} c_j m(E_j)$, where $c_j > 0, \cup_j E_j = \Omega$. The sum of non-negative numbers is between zero and $\infty$, hence their $\sup$ is between zero and infinity.

Now the second conclusion. Following a similar argument of above,

\[\int_\Omega \sup_m f_m \geq \int_\Omega f_n\]

for every $n$ including its superimum, i.e.,

\[\int_\Omega \sup_m f_m \geq \sup_n \int_\Omega f_n.\]

If we show that $\int_\Omega \sup_m f_m \leq \sup_n \int_\Omega f_n$, then the proof will be completed. To proceed, by the definition of $\int_\Omega \sup_m f_m$, it is sufficient to show that for all non-negative measurable function $s$ that it is dominated by $\sup_m f_m$ the following holds

$\int_\Omega s \leq \sup_n \int_\Omega f_n.$ If we show that

$\int_\Omega s \leq \sup_n \int_\Omega f_n + \epsilon\int_\Omega s$ for every $0 < \epsilon < 1$ (note that our functions are all non-negative hence the range of the epsilon); the claim follows by taking limits as $\epsilon \to 0$. $(\star)$

By construction,

\[s(x) \leq \sup_n f_n (x), \quad \forall x \in \Omega.\]

Hence, for all $x \in \Omega$, there exists an $N$ such that

$f_N(x) \geq (1 - \epsilon)s(x).$ Since $f_n$ are increasing, this implies taht $f_n(x) \geq (1 - \epsilon)s(x)$ for all $n \geq N$.

Define the sets

$E_n := \{x \in \Omega: f_n(x) \geq (1 - \epsilon) s(x)\}.$ Since $f_n$ is measurable by assumption, $s$ is measurable because it’s a simple function, the function $f_n(x) - s(x) + \epsilon f(x)$ is measurable, hence is each $E_n$. We also have that $E_1 \subseteq E_2 \subseteq \dots$, and $\cup_{n = 1}^\infty E_n = \Omega$. We have that

\[(1 - \epsilon) \int_{E_n}s = \int_{E_n} (1 - \epsilon)s \leq \int_{E_n} f_n \stackrel{(\star\star)}{\leq} \int_{\Omega} f_n,\]

if we show that $(\star\star)$ holds. Then by taking limits as $\epsilon \to 0$ and taking suprema, it suffices to show that

\[\sup_n \int_{E_n}s = \int_{\Omega}s,\]

which would prove $(\star)$.

Since $s$ is a simple function, $\int_\Omega s = \sum_{j = 1}^N c_j \cdot m(F_j)$, where $F_1, F_2, \dots$ are disjoint and in $\Omega$. Similarly, $\int_{E_n} s = \sum_{j = 1}^N c_j \cdot m(F_j \cap E_n)$. Since the sums are finite and the summands are positive, it suffices to show that $\sup_n m(F_j \cap E_n) = m(F_j)$. $(\star\star\star)$

So to summarize, we need to prove the following that prove $(\star)$.

$\int_{E_n} f_n \leq \int_{\Omega} f_n$. $(\star\star)$
$\sup_n m(F_j \cap E_n) = m(F_j)$. $(\star\star\star)$

The following proposition proves $(\star\star)$:

Let $\Omega$ be a measurable set, and $f: \Omega \to [0, \infty]$ be a non-negative measurable function. If $\Omega’ \subseteq \Omega$ is measurable, then $\int_{\Omega’} f = \int_{\Omega} f \mathbb{1}_{{\Omega’}} \leq \int_\Omega f$.

Proof. By the definition of the Lebesgue integral

\[\begin{align*} \int_\Omega f &= \sup \left\{\int_\Omega s: s \text{ is simple, non-negative and dominated by } f \right\}, \\ \int_{\Omega'} f &= \sup \left\{\int_{\Omega'} s: s \text{ is simple, non-negative and dominated by } f \right\}. \end{align*}\]

Fix $s$. We have

\[\begin{align*} \int_\Omega s &= \sum_{j = 1}^N c_j \cdot m(F_j), \\ \int_{\Omega'} s &= \sum_{j = 1}^N c_j \cdot m(F_j \cap \Omega'). \end{align*}\]

But, $F_j \cap \Omega’ \subseteq F_j$. Therefore, $m\left(F_j \cap \Omega’\right) \leq m(F_j)$. By multiplying both sides by positive constants $c_j$ and summing over $j$ we get that $\int_{\Omega’} s \leq \int_{\Omega} s$. Taking suprema from both sides competes the proof.

The following proposition helps proving $(\star\star\star)$:

If $A_1 \subseteq A_2 \subseteq \dots$ is an increasing sequence of measurable sets, then

\[m\left(\cup_{j = 1}^\infty A_j \right) = \lim_{j \to \infty} m(A_j).\]

Proof.

Define

\[B_1 := A_1, \quad B_j = A_j \setminus A_{j - 1}, \quad j \geq 1.\]

Then, $A := \cup_{j = 1}^\infty A_j = \cup_{j = 1}^\infty B_j$, and $A_N := \cup_{j = 1}^N A_j = \cup_{j = 1}^N B_j$. Hence,

\[m(A) = m\left(\cup_{j = 1}^\infty B_j\right) = \lim_{N \to \infty} m\left(\cup_{j = 1}^N B_j\right) = \lim_{N \to \infty} m(A_N).\]

So in $(\star\star\star)$, we have

\[\sup_n m(F_j \cap E_n) \stackrel{\text{(monotonicity)}}{=} \lim_{n \to \infty} m(F_j \cap E_n) = m\left(F_j \cap \left(\cup_{n = 1}^\infty E_n\right) \right) = m\left(F_j \cap \Omega \right) = m(F_j).\]

Fatou’s Lemma. (Analysis II) Let $\Omega$ be a measurable subset of $\mathbb{R}^n$, and let $f_1, f_2, \dots$ be a sequence of non-negative functions from $\Omega$ to $[0, \infty]$. Then

\[\int_\Omega \liminf_{n \to \infty} f_n \leq \liminf_{n \to \infty}\int_\Omega f_n.\]

Proof.

\[\begin{align*} \int_\Omega \liminf_{n \to \infty} f_n & = \int_\Omega \sup_{m \geq 1} \inf_{n \geq m} f_n \\ & = \sup_{m \geq 1} \int_\Omega \inf_{n \geq m} f_n \tag{monotone convergence}. \end{align*}\]

$\inf_{n \geq m} f_n \leq f_j$ for all $j \geq m$. Hence, $\int_\Omega \inf_{n \geq m} f_n \leq \int_\Omega f_j$. By taking infima w.r.t. $j$, we have

\[\int_\Omega \inf_{n \geq m} f_n \leq \inf_{j \geq m}\int_\Omega f_j.\]

Therefore, we obtained the following that completes the proof

\[\int_\Omega \liminf_{n \to \infty} f_n \leq \sup_{m \geq 1}\int_\Omega \inf_{n \geq m} f_n \leq \sup_{m \geq 1}\inf_{j \geq m}\int_\Omega f_j = \liminf_{n \to \infty} \int_{\Omega} f_n.\]

Lebesgue Dominated Convergence Theorem. (Analysis II) Let $\Omega$ be a measurable subset of $\mathbb{R}^n$, and let $f_1, f_2, \dots$ be a sequence of measurable functions from $\Omega$ to $\mathbb{R} \cup {-\infty, +\infty}$ which converge pointwise. Suppose also that there is an absolutely integrable function $F: \Omega \to [0, \infty]$ such that $|f_n(x)| < F(x)$ for all $x \in \Omega$ and all $n = 1, 2, 3, \dots$. Then,

\[\int_{\Omega} \lim_{n \to \infty} f_n = \lim_{n \to \infty}\int_{\Omega} f_n.\]

Proof. If $F$ was infinite on a set of non-zero meaure, then $F$ would not be absolutely integrable thus the set where $F$ is infinite has zero measure. We may delete this set from (this does not affect any of the integrals) and thus assume without loss of generality that $F(x)$ is finite for every $x \in \Omega$, which implies the same assertion for the $f_n(x)$.

Let $f:\Omega \to \mathbb{R} \cup$ {$-\infty, +\infty$} be the function $f(x) := \lim_{n \to \infty}f_n(x)$ which exists by assumption. Since $f$ is the limit of measurable functions, it’s measurable. Also, since $|f_n(x)| \leq F(x)$ for all $n$ and all $x \in \Omega$ , we see that each $f_n$ is absolutely integrable, and by taking limits we obtain $| f(x)| \leq F(x)$ for all $x \in \Omega$ , so $f$ is also absolutely integrable. Our task is to show that $\lim_{n \to \infty}\int_{\Omega} f_n = \int_\Omega f$. The functions $F + f_n$ are non-negative and converge pointwise to $F + f$. So by Fatou’s lemma

\[\int_\Omega F + f \leq \int_\Omega F + \liminf_{n \to \infty} \int f_n \Rightarrow \int_\Omega f \leq \liminf_{n \to \infty} \int f_n.\]

But also, $F - f_n$ are non-negative and converge pointwise to $F + f$. So, again by Fatou’s lemma

\[\int_\Omega F - f \leq \int_\Omega F + \limsup_{n \to \infty} \int f_n \Rightarrow \int_\Omega f \geq \limsup_{n \to \infty} \int f_n.\]

Hence, we showed that

\[\int_\Omega f \leq \liminf_{n \to \infty} \int f_n \leq \limsup_{n \to \infty} \int f_n \leq \int_\Omega f.\]

Hence, the $\liminf$ and $\limsup$ of $\int_\Omega f_n$ are equal as we wanted.

Remark. The preceding proof is valid without any change if we change the constant $a$ with any $\sB_0$-measurable r.v. Say $A$ is such an r.v. Then, it means if $\mathbb{P}^{\sB_0}(A \leq \sup_n X_n) = 1$, then $A \leq X_0$, i.e., $X_0$ is largest $\sB_0$-measurable r.v. dominated by $\sup_n X_n$. More generally, for any $\sB_p$-measurable r.v. $A$ that satisfies $A \leq \sup_{n \leq p} X_n$ can be seen as applying the preceding results to the supermartigale $(\sup_{n \leq p} X_n, X_{p+1}, X_{p+2}, \dots)$ adapted to the $(\sB_p, \sB_{p + 1}, \dots)$.

Theorem. Every positive supermartingale $(X_n)$ almost surely converges. Furthermore, the limit $X_\infty = \lim_{n \to \infty} X_n$ a.s. satisfies the following inequality

\[\Ev{X_\infty}{\sB_n} \leq X_n, \quad n \ \in \mathbb{N}.\]

Proof.

First, we’ll discuss what it means that a sequence of real numbers converges.

Given a sequence of real number $(x_n)$ in $\R^*$ and a pair of real number $a, b$ where $a \[\begin{align*} \nu_1 &= \min\{n: n \geq 1, x_n \leq a\} \\ \nu_2 &= \min\{n: n \geq \nu_1, x_n \geq b\} \\ \nu_3 &= \min\{n: n \geq \nu_2, x_n \leq a\} \\ \nu_4 &= \min\{n: n \geq \nu_3, x_n \geq b\} \\ & \vdots . \end{align*}\]

If some $\nu_k$ is not defined, we put it equal to $\infty$ and all subsequent indices. Let $\beta_{a, b}$ be the largest value of $p$ where $\nu_{2p}$ is finite. If all $\nu_k$ are finite, then $\beta_{a, b} = \infty$. $\beta_{a, b}$ denote the number of upcrossings of the sequence $(x_n$) on $[a, b]$. We can see that

\[\liminf_{n \to \infty} x_n < a < b < \limsup_{n \to \infty} x_n \Rightarrow \beta_{a, b} = \infty \Rightarrow \liminf_{n \to \infty} x_n \leq a < b \leq \limsup_{n \to \infty} x_n.\]

Therefore, we can deduce that a sequence in $\R^*$ is convergent iff $\beta_{a, b} <\infty$ for every $a < b \in R$. Now, let’s considering the case of the sequence of r.v.s $(X_n)$. Note that the r.v.s $\nu_k(\omega$) are $\sB_n$-measurable. This is due to

$\{\nu_{2p} = n\} = \cup_{m < n} \{\nu_{2p - 1} = m\: \mathrm{and}\: X_{m + 1} < b, \dots, X_{n - 1} < 1, X_n \geq b\},$ and a similar argument for the odd indices. The event {$\beta_{a, b} \geq p$} $=$ {$\nu_{2p} < \infty$} shows that $\beta_{a, b}$ is also an r.v. Hence, the convergence criterion is

\[\{X_n \to \cdot\} = \cap_{a < b \in \R} \{\beta_{a, b} < \infty\} \stackrel{\Q\, \text{is dense}}{=} \cap_{a < b \in \Q} \{\beta_{a, b} < \infty\}.\]

Rationals are dense. (Analysis 1) If $x$ and $y$ are two rationals that $x < y$, then there exists a third rational $z$ such that $x < z < y$.

Proof.

Set $z := \frac{x + y}{2}$. Since $x

A homeomorphism. (Topology) Let $X$ and $Y$ be topological spaces (like open intervals); let $f: X \to Y$ be a bijection. If both the function $f$ and the inverse function $f^{-1}: Y \to X$ are continuous, then $f$ is called homeomorphism. You may have studied in modern algebra the notion of an isomorphism between algebraic objects such as groups or rings. An isomorphism is a bijective correspondence that preserves the algebraic structure involved. The analogous concept in topology i s that of homeomorphism; i t is a bijective correspondence that preserves the topological structure involved.

Hence, to prove our proposition, we need to prove $\beta_{a, b} < \infty$ a.s. for every $0 < a < b \in \R$. We only considered positive numbers since $(X_n)$ is positive and we have a homeomorphism between $\R^*$ and $[0, \infty]$ using $f(x) = x, x \geq 0$.

To this end, we need the Dubin’s inequalities.

Dubin’s inequalities. For every positive super martingale $(X_n)$ the upcrossing numbers are r.v.s satisfying

\[\Pr{\beta_{a, b} \geq k}{\sB_0} \leq \left(\frac{a}{b}\right)^k \min\left(\frac{X_0}{a}, 1\right),\]

for every integer $k \geq 1$ and real numbers $a < b$. Therefore, r.v.s $\beta_{a, b}$ are a.s. finite.

Proof. First note that the set of stopping times {$\omega: \nu_k(\omega) =n$} belong to $\sB_n$ so $\nu_k$ are $\sB_n$-measurable. Now extending the master lemma, we can define the following supermartingale:

\[\begin{align*} Y_n & =1 \quad \mathrm{if}\: 0 \leq n < \nu_1, \\ &= \frac{X_n}{a} \quad \mathrm{if}\: \nu_1 \leq n < \nu_2, \\ &= \frac{b}{a} \cdot 1 \quad \mathrm{if}\: \nu_2 \leq n < \nu_3, \\ & \vdots \\ & = \left(\frac{b}{a}\right)^{k -1} \cdot \frac{X_n}{a} \quad \mathrm{if}\: \nu_{2k - 1} \leq n < \nu_{2k}, \\ & = \left(\frac{b}{a}\right)^{k} \quad \mathrm{if}\: n \geq \nu_{2k}. \end{align*}\]

In fact we have that:

\[\begin{align*} 1 &\geq \frac{X_{\nu_1}}{a}, \\ \frac{X_{\nu_2}}{a} &\geq \frac{b}{a}, \\ &\vdots \\ \left(\frac{b}{a}\right)^{k - 1} \cdot \frac{X_{\nu_{2k}}}{a} &\geq \left(\frac{b}{a}\right)^k. \end{align*}\]

By construction we have that $Y_0 = \min\left(1, \frac{X_0}{a}\right)$. Since {$Y_n$} is a supermartingale we have that $Y_0 \geq \Ev{Y_n}{\sB_0}$. Also on the set {$n \geq \nu_{2k}$}, $Y_n \geq \left(\frac{b}{a}\right)^{k}$, or equivalently $Y_n \geq \left(\frac{b}{a}\right)^{k} \cdot \I${$n \geq \nu_{2k}$}. Using all these facts, we have:

\[\begin{align*} \left(\frac{b}{a}\right)^k \Pr{\nu_{2k} \leq n}{\sB_0} \leq \min\left(1, \frac{X_0}{a}\right). \end{align*}\]

All that remains is to let $n \to \infty$ and noting that {$\nu_{2k} < \infty$} $=$ {$\beta_{a,b} \geq k$}. $\square$ ___

The final part of the proof involves showing that the limit $X_\infty = \lim_{n \to \infty} X_n$ a.s. satisfies the following inequality

$\Ev{X_\infty}{\sB_n} \leq X_n, \quad n \ \in \mathbb{N}.$ The inequality

\[\Ev{\inf_{m \geq n} X_m}{\sB_p} \leq \Ev{X_n}{\sB_p} \leq X_p\]

is valid if $n > p$. Let $n$ tend to infinity, then

\[\Ev{X_\infty}{\sB_p} = \lim_{n \to \infty} \Ev{\inf_{m \geq n} X_m}{\sB_p} \leq X_p, \quad p \in \mathbb{N}.\]

where we used the Lebesgue dominated convergence theorem to exchange the expectation and the limit because $\left(\inf_{m \geq n} X_m, \, n \in \mathbb{N}\right)$ is an increasing bounded convergent sequence in $n$. The proof is completed.

We had already showed that $\sup_n X_n < \infty$ on {$X_0 < \infty$}, which implies that $X_\infty < \infty$ on {$X_0 < \infty$}; applying this result to the supermartingale $(X_n, n \geq p)$ adapted to the sequence $(\sB_n, n \geq p)$, we find that $X_\infty$ a.s. on {$X_p < \infty$} for all $p \in \mathbb{N}$ which is another way of completing the last part of our proof.

The end! :)

via GIPHY

Some cool things about finite Markov chains and thereof

2025-09-23T00:00:00-07:00

In this post, I wanna summarize Sections 1.5 and 1.6 of Introduction to Stochastic Processes and kinda blew my mind.

Okay, so we’re dealing with finite Markov chains {$X_n$} that have transition matrix $P$. A markov chain can have some recurrent classes and some transient states. Therefore, we can divide $P$ in blocks that represent these concepts. Concretely,

\[P = \left(\begin{array}{c|c} \tilde{P} & 0 \\ \hline S & Q \end{array}\right),\]

where $\tilde{P}$ contains the transitions of recurrent classes and $Q$ contains the transitions of transient states. The matrix $Q$ is a substochastic matrix and since it only contains transient states, its eigenvalues are strictly less than one, which implies $\lim_{n \to \infty} Q^n \to 0$.

So, now let’s prove that if a matrix $A$ has eigenvalues less than one, then $\lim_{n \to \infty} A^n \to 0$. We know that we can write any matrix $A$ in its Jordan form as $A = M J M^{-1}$, hence $A^n = MJ^nM^{-1}$. The matrix $J$ is of the form

\[J = \begin{pmatrix} J_1 & \\ & \ddots \\ & & J_s\end{pmatrix},\]

where

\[J_i =\begin{pmatrix} \lambda_i & 1 & \\ & . & . & \\ & & . & 1 \\ & & & \lambda_i \end{pmatrix}.\]

As $n$ tends to infinity, $J^n$ tends to zero, and as a result $A^n$ tends to zero.

Let $i$ be a transient state and $Y_i$ denote the total number of visits to $i$ which by the virtue of being a transient state is finite almost surely. Suppose $X_0 =j$, where $j$ is another transient state. Then,

\[\begin{align*} \mathbb{E}[Y_i \mid X_0 = j] & = \mathbb{E}\left[\sum_{n=0}^\infty \mathbb{I}\left\{X_n = i \right\} \mid X_0=j \right] \\ & = \sum_{n=0}^\infty \mathbb{P}\left(\{X_n = i\} \mid X_0=j \right) \\ & = \sum_{n = 0}^\infty p_n(j, i) \\ & = \left(I + P + P^2 \dots \right)_{ji} \\ & = \left(I + Q + Q^2 \dots \right)_{ji} & \text{(We kept the indices)}. \end{align*}\]

However, we’ll show right away that $\left(I + Q + Q^2 \dots \right)(1 - Q) = I$.

By ChatGPT.

Let $A$ be any matrix with eigenvalues’s magnitude strictly less than one, then

\[S := I + A + A^2 + \dots = (I - A)^{-1}.\]

Proof.

Consider the partial sum

$S_n = \sum_{k =0}^n A^k.$ Then,

\[\begin{align*} (I - A) S_n \stackrel{\text{telescopic sum}}{=} I - A^{n + 1} \end{align*}.\]

Hence,

$\begin{align*} S_n = (I - A)^{-1}\left(I - A^{n + 1}\right) \end{align*}.$ As $n$ tends to infinity, $S_n$ tends to $(I - A)^{-1}$.

Therefore, define $M = (1 - Q)^{-1}$ to get $\mathbb{E}[Y_i \mid X_0 = j] = M_{ji}$. The upshot is: If we want to compute the expected number of steps until the chain enters a recurrent class, assuming $X_0 = j$, we need only sum $M_{ji}$ over all transient states $i$.

We can also use this technique to determine the expected number of steps that an irreducible Markov chain takes to go from one state $j$ to another state $i$. We first write the transition matrix $P$ for the chain with $i$ being the first pivot:

\[P = \left( \begin{array}{c|c} P(i, i) & R \\ \hline S & Q \end{array} \right).\]

We then change $i$ to an absorbing state

\[\widetilde{P} = \left( \begin{array}{c|c} 1 & \mathbf{0} \\ \hline S & Q \end{array} \right).\]

Let $T_i$ be the number of steps needed to reach state $i$. For any other state $k$ let $T_{i,k}$ be the number of visits to $k$ before reaching $i$.Then,

\[\mathbb{E}\left[T_i \mid X_0=j \right] = \mathbb{E}\left[\sum_{k\neq i} T_{i, k} \mid X_0=j \right] = \sum_{k \neq i}M_{jk}.\]

We now suppose that there are at least two different recurrent classes and ask the question: starting at a given transient state $j$, what is the probability that the Markov chain eventually ends up in a particular recurrent class?

In order to answer this question, we can assume that the recurrent classes consist

of single points $r_1,\dots , r_k$ with $p(r_i, r_i) = 1$. If we order the states so that the recurrent states $r_1,\dots , r_k$ precede the transient states $t_1, \dots, t_s$, then

\[\widetilde{P} = \left( \begin{array}{c|c} I & \mathbf{0} \\ \hline S & Q \end{array} \right).\]

For $i = 1,\dots , s, \, j = 1,\dots , k$, let $\alpha(t_i, r_j)$ be the probability that the chain

starting at $t_i$ eventually ends up in recurrent state $r_j$. We set $\alpha(r_i, r_i) = 1$ and $\alpha(r_i, r_j) = 0 \; \mathrm{if}\, i \neq j$. For any transient state $t_i$ and some $n$,

\[\begin{align} \alpha(t_i, r_j) &= \mathbb{P}(X_n = r_j \mid X_0 = t_i) \\ & = \sum_{x \in S} \mathbb{P}(X_1 = x \mid X_0 = t_i) \mathbb{P}(X_n = r_j \mid X_1 = x) \\ & = \sum_{x \in S} P(t_i, x) \alpha(x, r_j). \end{align}\]

In the matrix form, if $A$ is an $s \times k$ matrix with $\alpha(t_i, r_j)$ entries, then the above display can be written as:

$\begin{pmatrix} I \\ A \end{pmatrix} = P \begin{pmatrix} I \\ A \end{pmatrix} = \begin{pmatrix} I & 0 \\ S & Q \end{pmatrix} \begin{pmatrix} I \\ A \end{pmatrix}.$ Hence,

\[A = S + QA \Rightarrow A = (I - Q)^{-1}S = MS.\]

Example. Consider a random walk with absorbing boundary on {$0, 1, \dots, 4$}. If we order the states {$0, 4, 1, 2, 3$} so that the recurrent states precede the transient states then

\[P = \left( \begin{array}{cc|ccc} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ \hline 1/2 & 0 & 0 &1/2 & 0 \\ 0 & 0 & 1/2 & 0 & 1/2 \\ 0 & 1/2 & 0 & 1/2 & 0 \end{array} \right)\]

Then,

\[S = \begin{pmatrix} 1/2 & 0 \\ 0 & 0 \\ 0 & 1/2 \end{pmatrix} \quad M = \begin{pmatrix} 3/2 & 1 & 1/2 \\ 1 & 2 & 1 \\ 1/2 & 1 & 3/2 \end{pmatrix} \quad MS = \begin{pmatrix} 3/4 & 1/4 \\ 1/2 & 1/2 \\ 1/4 & 3/4 \end{pmatrix}.\]

Hence , starting at state 1 the probability that the the walk is eventually absorbed at state 0 is 3/4.

Example (Gambler’s ruin). Consider the random walk with absorbing boundary on {$0, \dots , N$}. Let $\alpha(j) = \alpha(j, N)$ be the probability that the walker starting at state $j$ eventually ends up absorbed in state $N$. The boundary conditions are $\alpha(0) = 0, \alpha(N) = 1$. For state $0 < j < N$ we have:

\[\alpha(j) = (1 - p)\alpha(j - 1) + p \alpha(j+1)\]

This is a linear difference equation. So, let’s solve it. We need to test $\alpha(j) = \lambda^j$ in the equation:

\[\begin{align} &\lambda^j = (1 - p)\lambda^{j - 1} + p \lambda^{j + 1} \Rightarrow \\ &\lambda = (1 - p) + p\lambda^2 \Rightarrow \\ &\lambda = \frac{-1/p \pm \sqrt{1/p^2 - 4\frac{1 - p}{p}}}{2} \\ &= \frac{-1 \pm \sqrt{1 - 4(1 - p)p}}{2p} \\ & = \frac{-1 \pm \sqrt{4p^2 - 4p + 1}}{2p} \\ & = \frac{-1 \pm \sqrt{(2p - 1)^2}}{2p} \Rightarrow \\ &\lambda = \begin{cases} \frac{p - 1}{p} & p\neq 1/2 \\ 1 & p\neq 1/2 \\ -1 & p = 1/2 \end{cases}. \end{align}\]

Hence,

\[\alpha(j) = \begin{cases} c_1 + c_2(1 - 1/p)^j & p \neq 1/2 \\ c_1 + jc_2 & p = 1/2 \end{cases}.\]

After applying the boundary conditions:

\[\alpha(j) = \begin{cases} \frac{(1 - 1/p)^j - 1}{1 - (1 - 1/p)^N} & p \neq 1/2 \\ j/N & p = 1/2 \end{cases}.\]

Suppose $p = 1/2$, and let $T$ be the time it takes for the random walk to reach $0$ or $N$, and let

\[G(j) = \mathbb{E}[T \mid X_0 = j].\]

$G(0) = 0, G(N) = 0$ and by considering one step we can that

\[G (j) = 1 + 1/2 G (j - 1) + 1/2 G (j + 1), \quad j = 1,\dots , n - 1.\]

As an inhomogeneous linear difference equation, $G(j) = j^2$ is a solution. After solving the homogeneous version, we get:

\[G(j) = j^2 + c_1 + c_2j.\]

After applying the boundary conditions, we get:

\[G(j) = j(N - j).\]

Example Simple Random Walk on a Circle. Let $N \geq 2$ be an integer. We can consider {$0, 1, … , N -1$} to be a “circle” by assuming that $N - 1$ is adjacent to $0$ as well as $N -2$. The invariant probability is the uniform distribution (why?). Assume that $X_0 = 0$ and let $T_k$ denote the first time at which the number of distinct points visited equals $k$. Then $T_N$ is the first time that every point has been visited. By definition $T_1 = 0$, and $T_2 = 1$. We compute $r(k) = \mathbb{E}[T_k - T_{k-1}]$ for $k = 3,\dots , N$; a little thought will show that the value depends only on $k$ and not on $N$. At time $T_{k-1}$ the chain is at a boundary point so that one of the neighbors of $X_{T_{k - 1}}$, has been visited and the other has not. In the next step we will either visit the new point or we will go to an interior point. If we go to the interior point, the random walk has to continue until it reaches a boundary point and then we start afresh. By $G(j) = j(N - j)$ above, the expected time that it takes the random walk from the interior point (next to the boundary point) to reach a boundary point within {$1, \dots k - 1$} is

\[G(2) = (2-1) (k - 1 - 2) = k - 3\]

We therefore get the equation

$r(k) = 1/2 + 1/2[1 + (k - 3) + r(k)].$ Therefore,

\[\mathbb{E}[T_N] = \sum_{k = 2}^{n} r(k).\]

Example (Urn Model). Suppose there is an urn with $N$ balls. Each ball is colored either red or green. In each time period, one ball is chosen at random from the urn and with probability $1/2$ is replaced with a ball of the other color; otherwise, the ball is returned to the urn. Let $j$ denote the number of red balls. This chain would tend to keep the number of red balls and green balls about the same. In fact, the invariant probability is given by the binomial distribution

$\pi(j) = {N \choose j}\frac{1}{2^N}.$ Proof.

\[\begin{align} (\pi P)(j) &= \sum_{k = 0}^N\pi(k) P(k, j) \\ & = \pi(j - 1)P(j - 1, j) + \pi(j)P(j, j) + \pi(j + 1)P(j + 1, j) \\ & = \frac{1}{2^N}{N \choose j - 1}\frac{N - (j - 1)}{2N} + \frac{1}{2^N}{N \choose j}\frac{1}{2} + \frac{1}{2^N}{N \choose j + 1}\frac{j + 1}{2N} \\ & = \frac{1}{2^N} \left[{N \choose j - 1}\frac{N - (j - 1)}{2N} + {N \choose j}\frac{1}{2} + {N \choose j + 1}\frac{j + 1}{2N}\right] \end{align}.\]

And,

\[{N \choose j + 1}\frac{j + 1}{2N} = {N \choose j}\frac{N - j}{2N}, \quad {N \choose j - 1}\frac{N - (j - 1)}{2N} = {N \choose j}\frac{j}{2N}.\]

Hence, their sum equals ${N \choose j}\frac{1}{2}$ and we have:

\[(\pi P)(j) = \sum_{k = 0}^N\pi(k) P(k, j) = {N \choose j}\frac{1}{2^N} = \pi(j). \square\]

Example (Cell Genetics). Suppose each cell has $N$ particles each of type $I$ or $II$. Let $j$ be the number of particles of type $I$. In reproduction, it is assumed that the cell duplicates itself and then splits, distributing the particles. After duplication, the cell has $2j$ particles of type $I$ and $2(N - j)$ particles of type $II$. It then selects $N$ of these $2N$ particles for the next cell. By using the hypergeometric distribution we see that this gives rise to transition probabilities

\[P(j, k) = \frac{\binom{2j}{k}{2(N - j) \choose N - k}}{\binom{2N}{N}}.\]

This Markov chain has two absorbing states, $0$ and $N$. Eventually, all cells will have only particles of type $I$ or of type $II$.

Example (Card shuffling). Consider a deck of cards numbered $1, \dots, n$. At each time we draw a card at random and placing it at the top of the deck. This can be thought of as a Markov chain on $S_n$, the set of permutations of $n$ elements. If $\lambda$ denotes any permutation (one-to-one

correspondence of {$1,\dots, n$} with itself), and $\nu_j$ denotes the permutation corresponding to moving the $j$th card to the top of the deck, then the transition probabilities for this chain are given by

\[p(\lambda, \nu_j \lambda) = \frac{1}{n}, \quad j = 1, \dots ,n.\]

This chain is irreducible and aperiodic (we have self-loops). The unique invariant probability is the uniform measure on $S_n$, the measure that assigns probability $1/n!$ to each permutation (how?).

A reminder on Student’s t-distribution and p-values

2025-09-14T00:00:00-07:00

Let’s see what Sir Larry Wasserman has to tell us about it.

So we’re in the realm of hypothesis testing [Aaditya Ramdas calls it stochastic proof by contraction, I
love this phrase]. Suppose we divide our parameter space $\Theta$ into two disjoint sets $\Theta_0$ and $\Theta_1$.
We wish to test

$H_0: \theta \in \Theta_0 \quad\text{versus} \quad H_1: \theta \in \Theta_1$.

We call $H_0$ the null hypothesis that we’d like to reject. Because it says nothing interesting is going on
(hence the name null). Let $\mathbb{P}_\theta$ be a probability distribution with support $\mathcal{X}$
parameterized by $\theta$. Define $\mathcal{R} \subset \mathcal{X}$ called the rejection region.
Let $X \sim \mathbb{P}$_$\theta$$(\cdot)$. Then, if

\[X \in \mathcal{R} \Rightarrow \text{ reject } H_0, \\ X \notin \mathcal{R} \Rightarrow \text{ retain } H_0.\]

Usually, the rejection region is of the form

\[R = \{x \in \mathcal{X}: T(x) > c \},\]

where $T$ is a test statistic and $c$ is a critical value. The problem in hypothesis testing is to find an appropriate test statistic $T$ and an appropriate critical value $c$.

P.S.: A test statistic is a single number calculated from sample data that is used to evaluate a hypothesis in statistical analysis. It quantifies the difference between the observed data and what would be expected if the null hypothesis were true. Essentially, it helps determine how compatible your data is with a specific hypothesis.

Definition: The power function of a test with rejection region $\mathcal{R}$ is defined by

\[\beta(\theta) = \mathbb{P}_\theta(X \in \mathcal{R}).\]

The size of a test is defined to be

\[\sup_{\theta \in \Theta_0} \beta(\theta).\]

A test is said to have level $\alpha$ if its size is less than or equal to $\alpha$.
Basically, the level $\alpha$ specifies the maximum probability of rejection the null hypothesis.

Definition (The Wald test). Consider testing:

\[H_0: \theta = \theta_0 \quad \mathrm{versus} \quad H_1: \theta = \theta_1.\]

Assume under $H_0$, $\hat{\theta}$ is asymptotically normal, i.e., as $n$ tends to infinity $\frac{\hat{\theta} - \theta_0}{\hat{\sigma}}$ tends to $\mathcal{N}(0, 1)$ due to central limit theorem. Then, the size $\alpha$ Wald test is: reject $H_0$, when $|W|\geq F_W^{-1}(\frac \alpha2)$, where $W = \frac{\hat{\theta} - \theta_0}{\hat{\sigma}}$ and $F_W$ is the law of $W$.

The Wald test uses the central limit theorem and is only asymptotically valid. The t-test is instead used where we have finite sample size, with the caveat that we assume the data is coming from a Normal distribution. A random variable $T$ has a (Student’s) t-distribution with $k$ degrees of freedom with the following density function:

\[f(t_k) = \frac{\Gamma(\frac{k + 1}{2})}{\sqrt{k\pi}\Gamma(\frac{k}{2})(1 + \frac{t^2}{k})^{\frac{k+1}{2}}},\]

where $\Gamma(\alpha) = \int_0^\infty y^{\alpha - 1}e^{-y}dy$ is the gamma function. The t-distribution is equal to the Cauchy distribution $\left(f(x) = \frac{1}{\pi(1 + x^2)}\right)$ when $k=1$, and as $k$ tends to infinity, it tends to the normal distribution.

Now, assume $X_1, X_2 \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)$, where $\theta = (\mu, \sigma)$ are both unknown. Suppose we want to test $\mu = \mu_0$ versus $\mu \neq \mu_0$. Let

$T = \frac{\hat{\mu} - \mu_0}{\hat{\sigma}}.$ The same as the Wald test, for large $n$, $T \sim \mathcal{N}(0, 1)$ under $H_0$. However, the exact distribution of $T$ under $H_0$ has the $f(t_{n -1})$ density, i.e., the t-distribution with $n - 1$ degrees of freedom. Hence if we reject when $|T| \geq F_T^{-1}(\frac \alpha 2)$ then we get a size $\alpha$ test.

To define the $p$-values, unfortunately Larry Wassermann doesn’t transition smoothly and uses $\alpha$ for different things that made me confused. That said, for the following don’t assume that $\alpha$ was defined above.

Definition. Let $X^n = (X_1, X_2, \dots, X_n)$ (n-fold Cartesian product of the random variable $X$). Suppose
that for very $\alpha \in (0, 1)$, we have a size $\alpha$ test with rejection region $\mathcal{R}_\alpha$, Then,

\[p\text{-value} = \inf\{\alpha: T(X^n) \in \mathcal{R}_\alpha\}\]

That is, the $p$-value is the smallest level (probability) at which we can reject $H_0$ [quite the opposite of the definition above that involved $\beta$].

Induction by Contradiction

2025-09-12T00:00:00-07:00

In this post, we wanna prove the correctness of the proof by induction itself. Lol

Axiom (Well-ordering property). The well-ordering property of $\mathbb{N}$ states that if $S \subseteq \mathbb{N}$ and $S \neq \varnothing$, then there exists an $x \in S$ such that $x \leq y$ for all $y \in S$. In other words, there is always a smallest element.

Theorem (Induction). This concept was invented by Pascal in 1665. Let $P (n)$ be a statement depending on $n \in \mathbb{N}$. Assume that

(Base case) $P(1)$ is true and
(Inductive step) if $P(m)$ is true then $P(m + 1)$ is true.

Then, $P(n)$ is true for all $n \in \mathbb{N}$.

Proof.

Let $S = $ {$n \in \mid P(n)$ is not true}. We wish to show that $S= \varnothing$. We will prove this by contradiction. When we prove something by contradiction, we assume the conclusion we want is false, and then show that we will reach a false statement. Rules of logic thus imply that the initial statement must be false.

Suppose that $S \neq \varnothing$. Then, by the well-ordering property of $\mathbb{N}$, $S$ has a least element $m \in S$. Since according to the theorem $P(1)$ is true, then it must be that $m > 1$. Since $m$ is a least element of $S$, $m - 1 \notin S \Rightarrow P(m−1)$ is true. According to the theorem, this implies that $P(m)$ is true which implies that $m \in S$ by assumption. But, this is a contradiction. Thus $S = \varnothing$, and hence $P(n)$ is true for all $n \in \mathbb{N}$.

References

Real Analysis

When is a deterministic optimal policy in MDPs attainable?

2025-09-10T00:00:00-07:00

In non-theory research papers, I often see that the optimal (deterministic Markov) policy for MDPs is defined as

\[\pi^* \in \arg\max_\pi v_\pi(s), \quad \forall s \in S,\]

where for a real-valued function $g$ on set $X$ $\arg\max$ is defined as

\[\arg\max_{x \in X} g(x) := \{x' \in X: g(x') \geq g(x), \: \forall x \in X \}.\]

But, if the $\max$ is unattainable the definition of the optimal policy is ill-defined. To rectify that, we can define the optimal policy $\pi^*$ as the policy such that

\[v^*: = v_{\pi^*}(s) = \sup_\pi v_\pi(s), \quad \forall s \in S.\]

In this post, I wanna dig into Proposition 4.4.3 in Martin L. Puterman’s book. So, here is the question: When is the supremum defined below attainable, i.e., $\sup = \max$?

\[u^*_t(h_t) = \sup_{a \in A_s} \left\{r_t(s_t, a) + \sum_{j \in S}p_t(j \mid s_t, a)u^*_{t + 1}(h_t, a, j)\right\}.\]

In the above display, $u_t^*$ is optimal history dependent value function at time step $t$, $h_t = s_0, a_0, \dots, s_t$ is the history until time step $t$, $s_t$ is the state at time step $t$, $r_t$ is the deterministic reward function at time step $t$, $p_t$ is the transition dynamics at time step $t$, and $A_s$ is the set of available actions at state $s \in S$.

Let’s first revisit the background we need.

Background

We will take limits, hence we need to make sure we can. For this, we need define the concepts of completeness and separability. I’ve already defined completeness in this post about Sobolev space, so I only revisit separability here.

A separable metric space

First we need to know what is a dense a set.

A dense set

Let $(X, D)$ be a metric space. Let $E \subset X$ be as subset and $\bar{E}$ be its closure. $E$ is dense if $\bar{E} = X$.

A topological space X (which naturally can be thought of being generated by a metric distance that are open with respect to the metric distance) is separable if it has a countable dense subset, e.g., irrationals in reals. Or said differently: You can “approximate” any point in the space by points from a countable subset.

Weak convergence

Let {$\mu_n$}$_{n \in \mathbb{N}}$ be a sequence of probability measures on $(X, \mathcal{F})$. We say that $\mu_n$ converges weakly to a probability measure $\mu$ on $(X, \mathcal{F})$ if $\int f d\mu_n = \int f d\mu$ for all bounded $\mathcal{F}$-measurable functions $f$.

Semicontinuous functions (a.k.a. almost everywhere continuous functions, not jumpy functions)

First we need to know what an open set is.

An open set

Let $(X, d)$ be a metric space. Given any $x \in X$ and $r > 0$, define the open ball $B(x, r)$ centred at $x$ with radius $r$ to be the set of all $y \in X$ such that $d(x, y) < r$. Given a set $E$, we say $x$ is an interior point of $E$ if there is some open ball centred at $x$, which is contained in $E$. A set is open if every point is an interior point

Let $X$ be a complete separable metric space and $f$ a real-valued function on $X$. We say that $f$ is upper semicontinuous (u.s.c.) if, for any sequence ${x_n}$ of $X$’s elements which converges to $x^*$,

\[\lim\sup_{n \to \infty} f(x_n) \leq f(x^*).\]

\[f(x) = \begin{cases}x + 1 & \mathrm{if}\, x \geq 1 \\ x & x < 1 \end{cases}\]

and $x_n \to 1$. Similarly, $f$ is said to be lower semicontinuous (l.s.c.) whenever $-f$ is u.s.c., or equivalently, $\lim\inf_{n \to \infty} f(x_n) \geq f(x^*).$ A continuous function is both u.s.c and l.s.c.

\[f(x) = \begin{cases}x + 1 & x > 1\\ x & x \leq 1\end{cases}\]

when $x_n \to 1$ is l.s.c.

Lemma 1. Let $X$ be a complete separable metric space. Then,

If $f \geq 0$ and $g \geq 0$ are u.s.c. on $X$, then $fg$ is u.s.c. on $X$.
If $f,g$ are u.s.c. on $X$, then $f + g$ is u.s.c. on X.
If ${f_n}$ is a decreasing sequence of nonpositive u.s.c. functions on $X$, then $\lim_{n \to \infty} f_n$ is u.s.c. on $X$.

Proof. [ChatGPT]

Part (1). One convenient characterization of upper semicontinuity is that: $h$ is u.s.c. iff for every $\alpha \in \mathbb{R}$, the set {$x: h(x) < \alpha$} is open.

We will show that for every $c \in \mathbb{R}$ the set ${x \in X: f(x)g(x) < c}$ is open. If $c \leq 0$, then $U_c = \varnothing$ because $f,g$ are nonnegative, and $\varnothing$ is open. So, fix $c > 0$. We have

\[\{x \in X: f(x)g(x) < c\} = \cup_{r > 0} \{x \in X: f(x) < r\; \mathrm{and} \;g(x) < \frac{c}{r}\}.\]

So, $f,g$ are u.s.c, the sets ${x \in X: f(x) < r } \; \mathrm{and} \; {x \in x: g(x) < \frac{c}{r}}$ are open for any $r$, hence their intersection is open, and the union of open sets are open. $\square$

Part (2). A proof can be given similar to Part (1), however the result is also immediate from the definition of u.s.c.

Part(3). Note that $f(x) := \lim_{n \to \infty}f_n(x) = \inf_{n}f_n(x)$. This indicates $f_1(x) \geq f_2(x) \geq \dots f_n(x) \geq \dots f(x)$. Hence since the sequence is decreasing, for each of the $f_n(x)$ that $f_n(x) < c$ holds, it will hold for all $m > n$, where $c$ is a real number. Hence,

\[\{x \in X: f(x) < c\} = \cup_{n =1}^\infty \{x \in X: f_n(x) < c\}.\]

The proof is completed by stating that since $f_n$s are u.s.c., ${x \in X: f_n(x) < c}$ is open and the union of open sets is open. $\square$

Proposition 1. Suppose $C$ is a compact subset of a complete separable metric space $X$, and $f$ is u.s.c. on $X$. Then there exists an $x^*$ in $C$ such that $f(x^*) \geq f(x)$ for all $x \in C.$

Proof. Let $y^* = \sup_{x \in C}f(x)$ and the corresponding $x$ by $x^*$. Let ${x_n}$ be a sequence in $C$ for which $\lim_{n \to \infty} f(x_n) = y^*$. Then, since $C$ is a compact subset of a complete separable metric space, there exists a subsequence ${x_{n_k}}$ which has a limit $x^*$. By since $f$ is u.s.c., then $f(x^*) \geq \lim_{k \to \infty}f(x_{n_k}) = y^*$. Hence, $f(x^*) = y^*$. $\square$

Proposition 2. Let $X$ be a countable set, $Y$ a complete separable metric space and $q(x, y)$ a bounded nonnegative real-valued function that is l.s.c. in $y$ for each $x \in X$. Let, $f(x)$ be bounded nonpositive real-values function on $X$ for which $\sum_{x \in X}f(x)$ is finite. Then,

\[h(y) = \sum_{x \in X}f(x)q(x, y)\]

is u.s.c on $Y$.

Proof. Based on the part (1) of the above Lemma 1, for each $x \in X, \; f(x)q(x, y) \leq 0$ is u.s.c. Let ${x_n}$ be an increasing sequence of finite subsets of $X$ such that $\cup_{n =1}^\infty X_n = X$. Then, by Lemma 1 we have that $h_n(y) := \sum_{x \in X_n} f(x)q(x, y)$ is u.s.c. in $y$ for each $n$. Since $h_n(y)$ is decreasing in $n$, by Part (3) of Lemma 1, $h(y) = \lim_{n \to \infty} h_n(y)$ is u.s.c. $\square$

The following corollary. generalizes the previous proposition to nondiscrete sets. Note that a kernel $q(\cdot \mid y)$ on Borel subset of $X$ is continuous if $q(\cdot \mid y_n)$ converges weakly to $q(\cdot \mid y)$ whenever {$y_n$} converges to $y$. It means for a bounded measurable real-valued function $f$ on $X$,

\[\lim_{n \to \infty} \int_X f(x) q(dx \mid y_n) = \int_X f(x)q(dx \mid y).\]

Corollary. Let $X, Y$ be complete separable metric spaces, $f(x)$ a bounded u.s.c. function on $X$, and $q(\cdot \mid y)$ a continuous kernel on the Borel sets of $X$. Then, $h(y) := \int_X f(x) q(dx \mid y)$ is u.s.c. on $Y$.

Main body

Proposition. Assume $S$ is finite or countable, and that

$A_s$ is finite for each $s \in S$, or
$A_s$ is a compact subset of a complete separable metric space, $r_t(s, a)$ is continuous in $a$ for each $s \in S$, there exists an $M < \infty$ for which $|r_t(s_t, a)| < M$ for all $a \in A_s$ and $s \in S$, and $p_t(j \mid s, a )$ is continuous in $a$ for each $j \in S$ and $s \in S$ and $t = 1,2,. . . , N$, or
$A_s$ is a compact subset of a complete separable metric space, $r_t(s, a)$ is u.s.c. in $a$ for each $s \in S$, there exists an $M < \infty$ for which $|r_t(s_t, a)| < M$ for all $a \in A_s$ and $s \in S$, and $p_t(j \mid s, a )$ is l.s.c. in $a$ for each $j \in S$ and $s \in S$ and $t = 1,2,. . . , N$. Then there exists a deterministic Markov policy which is optimal.

Proof.

We need to show that for each state $s$, there exists an action $a’ \in A_s$, for which

\[r_t(s, a') + \sum_{j \in S}p_t(j \mid s, a') u^*_{t + 1}(j) = \sup_{a \in A_s} \left\{r_t(s, a) + \sum_{j \in S}p_t(j \mid s, a) u^*_{t + 1}(j)\right\}.\]

If $A_s$ is finite the result is immediate. So, Part (1) follows naturally. Now consider the setting of Part (3). Since $|r_t(s_t, a)| < M$ for all $a \in A_s$ and $s \in S$ and $t \in [N]$, Therefore, for each $t$, $u^*_t(s) - NM \leq 0$. Now we apply Proposition 2, where $X = S, f(x) = u^*_{t + 1}(x), \, \mathrm{and}\, q(x, y) = p_t(j \mid s, y)$ for a fixed state $x$. Then,

\[\sum_{j \in S}p_t(j \mid s, a)\left[u_{t + 1}^*(j) - NM\right]\]

is u.s.c., from which we can conclude that $\sum_{j \in S}p_t(j \mid s, a)u_{t + 1}^*(j)$ is u.s.c. because shifting by a constant doesn’t change the continuity. By, Part (2) of Lemma 1 we can conclude that $r_t(s, a) + \sum_{j \in S}p_t(j \mid s, a)u_{t + 1}^*(j)$ is u.s.c. in $a$ for each state $s$. Hence, by Proposition 1, the supermum over $A_s$ is attained. Part (2) is a special case of Part(3) since continuous functions are both upper and lower s.c. $\square$

References

Stochastic Approximation Part 1

2025-09-07T00:00:00-07:00

What struck my curiosity to investigate why $Q$-learning and SARSA converge was the realization that these methods use their estimates of action values at time step $t$ to estimate the action values at time step $t + 1$ in their update targets. This sounded really weired to me as though the convergence should not happen. So, I dug in and discovered the answer lies in the topic of stochastic approximation. Stochastic approximation is fairly a big topic, hence I cover it in four separate parts. In the last part I’ll turn my attention to Q-Learning and SARSA eventually.

I’ll mention the assumption required to each proposition during their corresponding proofs to see where those assumptions were inevitably needed.

Needed background

In this section I’ll review some concepts needed throughout the document.

Level sets of a function

The $\alpha$-sublevel set of a function $f: \mathbb{R}^n \to \mathbb{R}$ is defined as

\[C_\alpha = \{x \in \mathsf{dom}(f): f(x) \leq \alpha \}.\]

Stationary point of a function

Stationary point of a function is where its derivative is zero

Limit point of a sequence

Let $\left(a_i \right)^\infty_{i=m}$ be a sequence of real numbers, let $x$ be a real number, and let $\varepsilon>0$ be a real number. We say that $x$ is $\varepsilon$-adherent to $\left(a_i \right)^\infty_{i=m}$ if and only if there exists $n \geq m$ such that $a_n$ is $\varepsilon$-close to $x$. We say that $x$ is continually $\varepsilon$-adherent to $\left(a_i \right)^\infty_{i=m}$ if and only if it is $\varepsilon$-adherent to $\left(a_i \right)^\infty_{i=N}$ for every $N \geq m$. We say that $x$ is a limit point of $\left(a_i \right)^\infty_{i=m}$ if and only if it is continually $\varepsilon$-adherent to $\left(a_i \right)^\infty_{i=m}$ for every $\varepsilon > 0$.

Smooth function

A continuously differentiable function $f: \mathbb{R}^n \to \infty$ is $\beta$-smooth if

\[f(y) \leq f(x) + \nabla f(x)^\top(y - x) + \frac \beta2 \lVert y - x\rVert^2 \qquad \forall y,x \in \mathbb{R}^n,\, \text{and } \beta > 0.\]

The above condition is equivalent to a Lipschitz continuity over the gradients, i.e.,

\[\lVert \nabla f(y) - \nabla f(x)\rVert \leq \beta \lVert y - x\rVert, \qquad \forall y,x \in \mathbb{R}^n\, \text{and } \beta > 0.\]

This assumption is satisfied, in particular, if $f$ is twice differentiable and all of its second derivatives are bounded globally by some constant.

proof. Assume $\lVert \nabla^2 f(x)\rVert \leq M$, then by mean value theorem we have $\lVert \nabla f(x) - \nabla f(y) \rVert \leq M \lVert x - y\rVert. \square$

Proof of the equivalence. Fix two vectors $r$ and $z$, let $\xi$ be a scalar parameter, and let $g(\xi) = f(r + \xi z)$. The chain rule yields $\frac{d}{d\xi}g(\xi) = z^\top\nabla f(r + \xi z)$. We have $\begin{align*} f(r + z) - f(r) & = g(1) - g(0) = \int_0^1\frac{d}{d\xi}g(\xi)d\xi = \int_0^1\ z^\top\nabla f(r + \xi z) d\xi \\ & \leq \int_0^1\ z^\top\nabla f(r) d\xi + \left\lvert \int_0^1\ z^\top\Bigl(\nabla f(r + \xi z) - \nabla f(r)\Bigr) d\xi \right\rvert \\ & \leq z^\top\nabla f(r) + \int_0^1\ \lVert z \rVert \cdot \lVert\nabla f(r + \xi z) - \nabla f(r)\rVert d\xi. \end{align*}$

Now assume there exists a $\beta \geq 0$ such that $\lVert \nabla f(x_1) - \nabla f(x_2)\rVert \leq \beta \lVert x_1 - x_2\rVert$ for all $x_1, x_2$ in the domain of $f. Then,

\[\begin{align*} f(r + z) - f(r) & \leq z^\top\nabla f(r) + \int_0^1\ \lVert z \rVert \cdot \lVert\nabla f(r + \xi z) - \nabla f(r)\rVert d\xi \\ & \leq z^\top\nabla f(r) + \lVert z \rVert \int_0^1\ \beta\xi\lVert z \rVert d\xi \\ & = \leq z^\top\nabla f(r) + \frac \beta2 \lVert z \rVert^2. \end{align*}\]

By replacing $r = x$ and $r + z = y$ the proof is completed. $\square$

Filterations

Given a measurable space $(\Omega, \mathcal{F})$, a filteration is a sequence $\left(\mathcal{F}_t\right)^n_{t = 0}$ of sub-$\sigma$-algebras of $\mathcal{F}$, where $\mathcal{F}_t \subseteq \mathcal{F}_{t + 1}$ for all $t < n$, $\mathcal{F}_n \subseteq \mathcal{F}$, and $\mathcal{F}_0 = {\varnothing, \Omega}$ (note that the set of $\mathcal{F}_0$-measurable functions is the set of constant functions on $\Omega$). A sequence of random variables $(X_t)^n_{t = 1}$ is adapted to filtration $\mathbb{F} = \left(\mathcal{F}_t\right)^n_{t = 0}$ if $X_t$ is $\mathcal{F}_t$-measurable for each t. We also say in this case that $(X_t)_t$ is $\mathbb{F}$-adapted.

(Super) Martingale difference sequence

A $\mathbb{F}$-adapted sequence of random variables $(X_t)_{t \in \mathbb{N}}$ is a $\mathbb{F}$-adapted martingale if

$X_t$ is integrable, i.e., $\mathbb{E}[|x|] < \infty$.
$\mathbb{E}[X_{t} \mid \mathcal{F}_{t - 1}] = X_{t - 1}$ almost surely for each $t \in {2, 3, \dots}$. If the equality in the second point is replaced with a less-than, then we call $(X_t)_t$ a supermartingale

Martingle convergence theorem

Let $X_t, t = 0, 1, 2, \dots$, be a sequence of random variables and let $\mathcal{F}_t, t = 0, 1,2 \dots$, be sets of random variables such that $\mathcal{F}_t \subseteq \mathcal{F}_{t + 1}$ for all $t$. Suppose that:

$X_t$ is $\mathcal{F}_t$-measurable.
For each $t$, we have $\mathbb{E}[X_{t + 1} \mid \mathcal{F}_t] = X_t$.
There exists a constant $M$ such that $\mathbb{E}[|X_t|] \leq M$ for all $t$. Then $X_t$ converges to a random variable $X_\infty$ almost surely.

The proof essentially comes from the fact that $\mathbb{E}[|X_t|] \leq M$ make any sub/super martingale upper/lower bounded, thus converging. Hence, the martingale will also converge. An example of the proof is given in the supermartingale convergence section.

Supermartingale convergence theorem

Let $Y_t, X_t,\, \mathrm{and}\, Z_t, t = 0, 1, 2, \dots,$ be three sequences of random variables and let $\mathcal{F}_t, t = 0, 1, 2, \dots,$ be sets of random variables such that $\mathcal{F}_t \subseteq \mathcal{F}_{t+1}$ for all $t$. Suppose that:

The random variables $Y_t, X_t,\, \mathrm{and}\, Z_t$ are nonnegative, and are functions of the random variables in $\mathcal{F}_t$.
For each $t$, we have $\mathbb{E}[Y_{t + 1} \mid \mathcal{F}_{t}] \leq Y_t - X_t + Z_t$.
There holds $\sum_{t = 0}^\infty Z_t < \infty$.

Then, we have $\sum_{t = 0}^\infty X_t < \infty$, and the sequence $Y_t$ converges to a nonnegative random variable $Y$, with probability (w.p.) 1.

Proof (Gemini). Define the following $\mathcal{F}_t$ measurable process:

\[U_t := Y_t + \sum_{i = 0}^{t - 1}(X_i - Z_i).\]

We show that ${U_t}$ is a supermartingale:

\[\begin{align*} \mathbb{E}[U_{t + 1} \mid \mathcal{F}\_t] &= \mathbb{E}[Y_{t + 1} \mid \mathcal{F}\_t] + \sum\_{i = 0}^{t}(Z_i - X_i) \leq Y_t +Z_t - X_t + \sum_{i = 0}^{t}(X_i - Z_i) \\ & = Y_t + \sum_{i = 0}^{t - 1}(X_i - Z_i) = U_t. \end{align*}\]

We now show that ${U_t}$ is bounded below:

\[U_t := Y_t + \sum_{i = 0}^{t - 1}(X_i - Z_i) \geq 0 + 0 -\sum_{i = 0}^{t - 1}Z_i \geq -\sum_{i = 0}^{\infty}Z_i\]

Since $S_\infty := \sum_{i = 0}^{\infty}Z_i$ is bounded by the assumption, then $U_t \geq -S_\infty$. Since ${U_t}$ is a supermartingale that is bounded below, the Supermartingale Convergence Theorem implies that $U_t$ must converge to a finite limit w.p. 1. Let’s call this limit $U$:

\[\lim_{t\to \infty} U_t = U < \infty.\]

Substituting the definition of $U_t$:

$\lim_{t \to \infty} \left( Y_t + \sum_{i = 0}^{t - 1}(X_i - Z_i)\right) = U,$ which means

\[\lim_{t \to \infty} \left( Y_t + \sum_{i = 0}^{t - 1}X_i\right) = U + S_\infty.\]

For the sum of two non-negative sequences to converge to a finite limit, both sequences must be bounded. Hence, $\lim_{t \to \infty} Y_t = Y_\infty < \infty$, and $\sum_{i = 0}^{\infty}X_i < \infty$, which means $X_t \to 0$. $\square$

Mathematical optimization

A mathematical optimization problem is finding the maximum/minimum of a real-valued function.

Iterative algorithms

Iterative algorithms solve problems by moving towards the solution one iteration at a time under two conditions:

The algorithm is loop invariant meaning that the found solution up until iteration $t$ that is not necessarily the terminal iteration is correct.
The algorithm terminates with probability one.

(Iterative) Stochastic approximation

Optimization problems are often solved by the means of the iterative algorithms. (Iterative) Stochastic approximation algorithms are algorithms that perform the optimization iteratively even in the presence of noise in the available information.

Conceretly, let $H: \mathbb{R}^n \to \mathbb{R}^n$ be an operator and $r \in \mathbb{R}^n$ a real-valued vector. Then, we wanna solve

\[\begin{equation*} Hr = r. \end{equation*}\]

One possible solution is $r := Hr$ or any convex combination of $r$ and $Hr$, i.e., $r:= (1 - \gamma)r + \gamma Hr$ for $\gamma < 1$, which is true since $r := Hr$. An interesting case is the optimization problem

\[\begin{equation*} r = Hr - \nabla f(r) \end{equation*},\]

where the solution $Hr = r$ mandates that $\nabla f(r) = 0$, which means $r$ is the minimizer of $f$ [Spoiler: the fixed point of $H$ is the minimizer of $f$].

In general, we might not have direct access to $Hr$, but instead some noisy corrupted measurement of it. Then our optimization problem takes the following form, where $w$ denotes the random noise

\[\begin{equation*} r := (1 - \gamma) r + \gamma (Hr + w) \end{equation*}.\]

More explicitly, suppose $Hr = \mathbb{E}[g(r, v)]$, where $v$ is a random variable with the a known distribution $v \sim P(\cdot \mid r)$ and $g$ is a known function. We would like to the equation:

\[r := (1 - \gamma)r + \gamma \mathbb{E}[g(r, v)],\]

but computing $\mathbb{E}[g(r, v)]$ is generally intractable. Since $P(\cdot \mid r)$ is known, through simulation, we can obtain $k$ samples use the following target:

\[Hr = \frac 1k \sum_{i = 1}^k g(r, v_i),\]

and as $k$ gets large, the law of large numbers gives the confidence that we’ll find the right answer. On the other extreme, we can set $k$ to one and use a single sample, an approach called the Robbins-Monro algorithm, and get

\[r := (1-\gamma)r + g(r, v_1).\]

In this case we have

\[r := (1 - \gamma)r +\gamma \left(\mathbb{E}[g(r, v) + g(r, v_1) - g(r, v)] \right) = (1 - \gamma)r +\gamma \left(Hr + w \right),\]

where $w = g(r, v_1) - \mathbb{E}[g(r, v)]$ is the zero mean noise term.

So, in general we’re looking for the fixed point of $H$ and we use the following iterative algorithm

\[r_{t + 1}(i) = (1 - \gamma_t(i))r_t(i) + \gamma_t(i)((Hr_t)(i) + w(i)).\]

Note that we wrote the function equation for each component of $r$, and we made the stepsize $\gamma$ dependent on the iteration. The reason behind making stepsize dependent on the iteration is make sure our iterative algorithm eventually converges to the fixed point and a fixed stepsize doesn’t necessary gives us this guarantee. Specifically, the stepsize should meet the following two conditions known as the Robbins-Monro conditions

\[\text{A)} \sum_{t = 0}^\infty \gamma_t(i) = \infty, \qquad \text{B) }\sum_{t = 0}^\infty \gamma^2_t(i) < \infty, \quad \forall i.\]

The first condition says that the stepsize should be big enough so we can make progress and also due to the upper limit of its summation, it mandates that every component should be updated infinitely-often. The second condition says that the stepsize should be small enough so we can converge (even in the presence of noise). Let’s see why these to conditions are necessary:

Why do we need $\sum_{t = 0}^\infty \gamma_t(i) = \infty$?

The reason is that if we compute how much we have progressed from the initial iteration

\[|r_t(i) - r_0(i)| \stackrel{\text{traingle inequality}}{\leq} \sum_{\tau = 0}^{t - 1} \gamma_\tau(i) |(Hr_\tau)(i) + w_\tau - r_\tau(i)|,\]

and if the magnitude of the updates $|(Hr_\tau)(i) + w_\tau - r_\tau(i)|$ is bounded and $\sum_{t = 0}^\infty \gamma_t(i) = A < \infty$, then the algorithm will be confined within a fixed radius and if the desired solution is outside of that radius, we’ll never succeed in getting it.

Why do we need $\sum_{t = 0}^\infty \gamma^2_t(i) < \infty$?

Suppose we want to apply the Robbins-Monro algorithm on sequence of i.i.d. random variable with a unknown mean $\mu$ and known variance of one.

\[r_{t + 1} = (1 - \gamma_t)r_t + \gamma_t v_t = (1 - \gamma_t)r_t + \gamma_t \mu + \gamma_t (v_t - \mu).\]

If we want the above equation converge to $\mu$ we need to make sue that the total variance converges to zero. So, we need to show, $\sum_{t = 0}^\infty \mathbb{E}\left[\left(\gamma_t (v_t - \mu)\right)^2\right] \leq \infty$ (Infinite sum of non-zero variables can’t become bounded unless they become zero at some point).

\[\sum_{t = 0}^\infty \mathbb{E}\left[\left(\gamma_t (v_t - \mu)\right)^2\right] \leq \sum_{t = 0}^\infty \gamma_t^2.\]

So, the only way that the total variance is bounded is that $\sum_{t = 0}^\infty \gamma^2_t(i) < \infty$.

Throughout, we implicitly assume that the stepsize sequence meet the Robbins-Monro conditions and I don’t mention explicitly again. The stepsize can also easily become a random variable which is the case in reinforcement learning for example, which I’ll touch upon it later.

There are three paradigm of optimization problems that can be solved by stochastic approximation. I’ll dig into each of separately. Instead of mentioning the required assumptions in the beginning, I’ll explain the required assumptions during proofs to see where they’re needed (except the Robbins-Monro conditions on the stepsize that I’ve assumed are met implicitly throughout).

Convergence under a smooth Lyapunov or potential function

One way if determining the convergence to the fixed point $r^*$ is introducing a Lyapunov or in other words, a potential function that act as a distance such that $f(r_{t +1}) < f(r_t)$ whenever $r_t \neq r^*$. Since noise is involved, instead of requiring $f(r_{t +1}) < f(r_t)$, it’s more appropriate to want the expected direction of update is a direction of $f$’s decrease.

In this section our algorithm is of the form

\[r_{t+1} = r_t + \gamma_t s_t,\]

and for simplicity we assume a universal stepsize sequence for all components of $r,$ $\Vert \cdot \rVert$ represents the Euclidean norm, i.e., $\lVert r \rVert = r^\top \cdot r$, and $\mathcal{F}_t$ represents the history ($\sigma$-algebra) of the algorithm until time $t$ as

$\mathcal{F}_t = \{r_0, \dots, r_t, s_0, \dots, s_{t - 1}, \gamma_0, \dots \gamma_t \}.$ Note that only the sequence of update until time $t-1$ is include in $\mathcal{F}_t$. I’ll mention two examples to problems that fit into this paradigm.

Stochastic gradient algorithm

In this example we have that

\[r_{t + 1} = r_{t} - \gamma_t (\nabla f(r_t) + w_t),\]

where $s_t = -\nabla f(r_t) - w_t$, the potential function is $f$, and the expected direction of the update, assuming that $\mathbb{E}[w_t \mid \mathcal{F}_t] = 0$, is $\mathbb{E}[s_t \mid \mathcal{F}_t] = -\nabla f(r_t).$

Euclidean norm pseudo-contractions

Recall the Robbins-Monro algorithm, $r_{t + 1} = (1 - \gamma_t)r_t + \gamma_t(Hr_t + w_t),$ where $H$ is a pseudo-contraction with respect to (w.r.t) the Euclidean norm, i.e., $\lVert Hr - r^*\rVert \leq \beta \lVert r - r^* \rVert, \beta \in [0, 1], \: \mathrm{and} \: r^* \, \text{is the fixed point of } H.$ In this algorithm $f(r) = \frac 12 \lVert r - r^* \rVert^2$ is the potential function, $\nabla f(r) = r - r^*$, and $s_t = Hr_t - r_t + w_t$ is the update direction. Now, we can verify that the expected direction of the update $s_t$ is in a direction of $f$’s decrease:

Using Hölder’s inequality and the fact that $H$ us a pseudo-contraction w.r.t the Euclidean norm, we have

\[(Hr - r^*)^\top(r - r^*) \leq \lVert Hr - r^*\rVert \cdot \lVert r - r^* \rVert \leq \beta \lVert r - r^* \rVert^2.\]

Subtract $(r - r^*)^\top(r - r^*)$ from both sides, and we get

\[(Hr - r)^\top(r - r*) \leq -(1 - \beta) \lVert r - r^* \rVert^2.\]

With $r = r_t$, the inequality can be rewritten as

\[\mathbb{E}[s_t \mid \mathcal{F}_t]^\top\nabla f(r_t) \leq -(1 - \beta)\lVert \nabla f(r_t) \rVert^2,\]

which means that $\mathbb{E}[s_t \mid \mathcal{F}_t]$ and $\nabla f(r_t)$ are not orthogonal, and they are in the opposite direction.

Proposition. Consider the algorithm $r_{t + 1} = r_t + \gamma_ts_t,$ with the potential function $f: \mathbb{R}^n \to \mathbb{R}$. Under certain assumptions that will be stated in the proof, the following holds with probability one:

a): The sequence $f(r_t)$ converges.

b): We have $\lim_{t \to \infty} \nabla f(r_t) = 0$.

c): Every limit point of $r_t$ is a stationary point of $f$.

Note that the above proposition says nothing about the convergence or the boundedness of the sequence $r_t$, however if $f$ has bounded level sets, part (1) implies the sequence $r_t$ is bounded. Moreover, if $f$ has unique stationary point $r^*$, part (3) implies that $r^*$ is the only limit point of $r_t$, and hence $r_t$ converges to $r^*$.

Proof.

We need our first assumption to begin the proof.

One: We need to assume $f$ is $L$-smooth.

Hence,

\[f(\bar r) \leq f(r) + \nabla f(r)^\top(\bar r - r) + \frac L2 \Vert \bar r - r \Vert^2.\]

By replacing $r = r_t$ and $\bar r = r_{t + 1} = r_t + \gamma_t s_t$ we have

\[f(r_{t + 1}) \leq f(r_t) + \gamma_t \nabla f(r_t)^\top s_t + \gamma_t^2 \frac L2 s_t^2.\]

Now we need the next two assumptions. We want the magnitude of the update to be comparable to the gradient of $f$, and the expected direction of the update and the direction of $f$’s gradient never get orthogonal.

Two: There exists positive constants $K_2, K_2$ such that

\[\mathbb{E}\left[\Vert s_t \Vert^2 \mid \mathcal{F}_t\right] \leq K_1 + K_2 \Vert f(r_t) \Vert^2, \: \forall t.\]

Note that $s_t$ is allowed to be nonzero [because of the noise] even if $\nabla f(r_t)$ is zero.

Three: There exists a positive constant $c$ such that

\[c\Vert f(r_t) \Vert^2 \leq -\nabla f(r_t)^\top \mathbb{E}[s_t \mid \mathcal{F}_t], \: \forall t.\]

We have

\[\begin{align*} \mathbb{E}[f(r_{t + 1}) \mid \mathcal{F}_t] & \leq f(r_t) - c\gamma_t \Vert \nabla f(r_t) \Vert^2 + \gamma_t^2 \frac L2 \left(K_1 + K_2 \Vert\nabla f(r_t)\Vert^2\right) \\ & = f(r_t) - \gamma_t \left( c - \frac{LK_2\gamma_t}{2} \right)\Vert \nabla f(r_t) \Vert^2 + \frac{LK_1\gamma_t^2}{2}. \end{align*}\]

We want to come to conclusion about the convergence of $f(r_t)$s and we already have relationship between $\mathbb{E}[f(r_{t + 1}) \mid \mathcal{F}_t]$ and $f(r_t$) which already is a reminisce of a (sub/super) martingale difference sequence. This relationship can be exploited if we can establish if we actually facing a (sub/super) martingale difference sequence. To so do, define

\[\begin{align*} X_t &= \begin{cases} \gamma_t \left( c - \frac{LK_2\gamma_t}{2} \right)\Vert \nabla f(r_t) \Vert^2, & \mathrm{if}\; LK_2\gamma_t \leq 2c, \\ 0, & \mathrm{otherwise}, \end{cases}\\ &\mathrm{and}\\ Z_t &= \begin{cases} \frac{LK_1\gamma_t^2}{2}, & \mathrm{if}\; LK_2\gamma_t \leq 2c, \\ \frac{LK_1\gamma_t^2}{2} - \gamma_t \left( c - \frac{LK_2\gamma_t}{2} \right)\Vert \nabla f(r_t) \Vert^2, & \mathrm{otherwise}. \end{cases} \end{align*}\]

Note that $X_t$ and $Z_t$ are nonnegative and $\mathcal{F}_t$ measurable. Using the assumption $\sum_{t=0}\infty \gamma_t^2 < \infty$, $\gamma_t$ converges to zero and there exists some finite time after which $LK_2\gamma_t \leq 2c$. Hence, after some finite time we have $Z_t = \frac{LK_1\gamma_t^2}{2}$ and therefore $\sum_{t = 0}^\infty Z_t < \infty$. Therefore, to use the positive supermartingale convergence theorem, we introduce the next assumption we need.

Four: $f(r) \geq 0, \forall r \in \mathbb{R}^n$.

Now, the positive supermartingale convergence applies: $f(r_t)$ converges and $\sum_t X_t < \infty$. Since $\gamma_t$ converges to zero, we have $LK_2\gamma_t \leq c$ after some finite time, and

\[X_t = \gamma_t \left( c - \frac{LK_2\gamma_t}{2} \right)\Vert \nabla f(r_t) \Vert^2 \geq \frac c2 \gamma_t\lVert \nabla f(r_t)\rVert^2.\]

Hence,

\[\sum_{t = 0}^\infty \gamma_t\lVert \nabla f(r_t)\rVert^2 < \infty.\]

If $\lVert \nabla f(r_t)\rVert$ doesn’t converge to zero, then the condition $\sum_{t =0}^\infty \gamma_t = \infty$ creates a contradiction to the above finite inequality. Hence, it must be case that $\lVert \nabla f(r_t)\rVert$ gets infinitely-often arbitrarily close to zero, i.e., $\liminf_{t \to \infty} \lVert \nabla f(r_t)\rVert=0$. Why $\liminf$ and not simply $\lim$? Because of the noise $\lVert \nabla f(r_t)\rVert$ fluctuates, so now we need to prove that its fluctuations dampen as well so that actually $\lim_{t \to \infty}\lVert \nabla f(r_t)\rVert = 0$.

To prove that $\lVert \nabla f(r_t)\rVert$ won’t be oscillating, now we show that it has finite upcrossings.

Fix a positive constant $\epsilon$. We say that the interval ${t, t+1, \dots, \bar{t}}$ is an upcrossing interval from $\epsilon/2$ to $\epsilon$ if

\[\Vert \nabla f(r_t) \Vert < \frac{\epsilon}{2}, \quad \Vert \nabla f(r_\bar{t}) \Vert > \epsilon,\]

and

\[\frac{\epsilon}{2} \leq \Vert \nabla f(r_\tau) \Vert \leq \epsilon, \quad t < \tau < \bar{t}.\]

To show the finiteness of the upcrossings for any sample path ${r_t}$, we need to show that the effect of the noise terms $w_t$ will be dampened out. Define $\bar{s}_t = \mathbb{E}[s_t \mid \mathcal{F}_t]$, so $w_t = s_t - \bar{s}_t$. Using assumption 2 we have

\[\Vert \bar{s}_t \Vert^2 + \mathbb{E}\left[\Vert w_t\Vert^2 \mid \mathcal{F}_t\right] = \mathbb{E}\left[\Vert s_t\Vert^2 \mid \mathcal{F}_t\right] \leq K_1 + K_2 \Vert \nabla f(r_t) \Vert^2, \quad \forall t.\]

We define the following $\mathcal{F}_t$ measurable indicator random variable that indicates whether an upcrossing has occurred or not:

\[\chi_t = \begin{cases}1, & \mathrm{if}\, \Vert \nabla f(r_\tau) \Vert \leq \epsilon \\ 0, & \mathrm{otherwise}. \end{cases}\]

The following lemma states that the cumulative discounted effect of noise on the events that upcrossings happen converges almost surely, which we will show later convergence is to zero.

Lemma. The sequence defined by $u_t$ converges w.p. 1:

\[u_t = \sum_{\tau =0}^{t - 1}\chi_\tau\gamma_\tau w_\tau, \quad u_0 := 0.\]

Proof. Initially assume that $\sum_{t=0}^\infty \gamma_t^2 \leq A$, for some constant $A$. Note that:

\[\mathbb{E}[\chi_t\gamma_t w_t \mid \mathcal{F}\_t] = \chi_t\gamma_t \mathbb{E}[w_t \mid \mathcal{F}\_t]=0,\]

and therefore $\mathbb{E}\left[u_{t + 1} \mid \mathcal{F}_t\right] = \mathbb{E}\left[u_t + \chi_t\gamma_t w_t \mid \mathcal{F}_t\right] = u_t,$ and ${u_t}$ is a martingale difference sequence.

If $\chi_t$ is zero, then $\mathbb{E}\left[\Vert u_{t+1} \Vert^2 \mid \mathcal{F}_t \right] = \mathbb{E}\left[\Vert u_{t}\Vert^2 \mid \mathcal{F}_t \right] = \Vert u_{t}\Vert^2$. If on the other hand, $\chi_t$ is one then,

\[\begin{align} \mathbb{E}\left[\Vert u_{t+1} \Vert^2 \mid \mathcal{F}_t \right]& = \mathbb{E}\left[\Vert u_t + \gamma_t w_t\Vert^2 \mid \mathcal{F}_t \right] \stackrel{\text{triangle inequality}}{\leq} \mathbb{E}\left[\left(\Vert u_t \Vert + \Vert \gamma_t w_t\Vert \right)^2 \mid \mathcal{F}_t \right] \\ & \leq \Vert u_{t} \Vert^2 + 2u^\top_t\gamma\mathbb{E}[w_t \mid \mathcal{F}_t] + \gamma^2_t\mathbb{E}[\Vert w_t \Vert^2 \mid \mathcal{F}_t] \\ & \leq \Vert u_{t} \Vert^2 + \gamma^2_t\left(K_1 + K_2\epsilon^2\right). \end{align}\]

Now, we take an unconditional expectation from the both sides and apply the tower rule, and sum over $t$ and apply the telescopic sum to obtain

\[\mathbb{E}\left[\Vert u_t \Vert^2 \right] \leq (K_1 + K_2\epsilon^2)\sum_{\tau = 0}^\infty \gamma_\tau^2 \leq (K_1 + K_2\epsilon^2)A, \quad \forall t,\]

and since $\Vert u_t \Vert \leq 1 + \Vert u_t \Vert^2$, we have $\sup_t \mathbb{E}[\Vert u_t \Vert] < \infty$. Now, we can apply the martingale convergence theorem to ${u_t}$ and conclude it almost surely converges.

Now let’s consider the case that $\gamma_t$ is stochastic and $\sum_{t=0}^\infty \gamma^2_t$ is finite (which implies $\gamma_t$ has finite variance) but not by a deterministic constant. Consider any arbitrary positive integer $k$ and let $u^k_t$ represent the process that is equal to $u_t$ as long as $\sum_{t=0}^\infty \gamma_t \leq k$ and stays constant afterward. Let $\Omega_k$ denote the set of sample paths $(r_0, r_1, \dots)$ for which $u^k_t$ doesn’t converge. Since $\sum_{t=0}^\infty \gamma^2_t < \infty$ is finite, for every sample path there and $k$, there exists a time $t_0$, where $\sum^\infty_{t=t_0}\gamma^2_t \leq k$ almost surely, hence the set $\cup_{k=1}^\infty \Omega_k$ has measure zero, for every sample path and $k$, there exists a time $u_t = u_t^k$ for all $t \geq t_0$ and $u_t$ converges almost surely.

Let us now consider a sample path with an infinity of upcrossings and let ${t_k, \dots, \bar{t}_k}$ be the $k$th such interval. Using the above lemma we obtain: $\lim_{k \to \infty} \sum_{t = t_k}^{\bar{t}\_k - 1} \gamma_t w_t = 0,$ which also implies that $\lim_{k \to \infty} \gamma_{t_k} w_{t_k} = 0$. Now we have

\[\begin{align*} \Vert \nabla f(r_{t_k + 1}) \Vert - \Vert \nabla f(r_{t_k}) \Vert & \leq \Vert \nabla f(r_{t_k + 1}) - \nabla f(r_{t_k})\Vert \\ &\leq L \Vert r_{t_k + 1} - r_{t_k} \Vert \\ &= L \gamma_{t_k} \Vert \bar{s}_{t_k} + w_{t_k} \Vert \\ & \leq L \gamma_{t_k} \Vert \bar{s}\_{t_k} \Vert + L \gamma_{t_k} \Vert w_{t_k} \Vert. \end{align*}\]

The right hand side of the above display goes to zero as $k$ goes to infinity because $\Vert \bar{s}_{t_k} \Vert^2$ is bounded by $K_1 + K_2 \epsilon^2$ and $\gamma_t$ goes to zero, and we have just proved it for $\Vert w_{t_k} \Vert$ as well. Then, since $\Vert \nabla f(r_{t_k + 1}) \Vert \geq \epsilon/2$ (the condition of an upcrossing interval in the definition of $\chi_{t}, \, t \in [t_k, \, \bar{t}_k]$), we have $\Vert \nabla f(r_{t_k}) \Vert \geq \epsilon/4$. Hence, for every $k$ we have

\[\begin{align*} \frac \epsilon2 & \leq \Vert \nabla f(r_{\bar{t}_k}) \Vert - \Vert \nabla f(r_{t_k}) \Vert \\ & \leq \Vert \nabla f(r_{\bar{t}_k}) - \nabla f(r_{t_k}) \Vert \\ & \leq L \Vert r_{\bar{t}_k} - r_{t_k} \Vert \\ & \leq L \sum_{t = t_k}^{\bar{t}\_{t_k} -1} \gamma_t\Vert \bar{s}_t\Vert + L \sum_{t = t_k}^{\bar{t}_{t_k} -1} \gamma_t\Vert w_t\Vert. \end{align*}\]

We have proved that the second term on the right-hand side of the above display goes to zero. Also, for $t_k \leq t \leq \bar{t}_k - 1, \, \Vert \bar{s}_t \Vert^2 \leq K_1 + K_2 \epsilon^2$, which by the inequality $x \leq x^2 +1$ implies that $\Vert \bar{s}_t \Vert \leq 1 + K_1 + K_2 \epsilon^2:=d$. By taking the $\liminf_{k \to \infty}$ from the above display we have

\[\liminf\_{k \to \infty} \sum_{t = t_k}^{\bar{t}\_k - 1} \gamma_t \geq \frac{\epsilon}{2Ld}.\]

For $t_k \leq t \leq \bar{t}_k - 1$ we have

\[\begin{align*} \liminf_{k \to \infty} \sum_{t = t_k} ^{\bar{t}_k - 1} \gamma_t \Vert \nabla f(r_t)\Vert^2 \geq \frac{\epsilon}{2Ld} \cdot \frac{\epsilon^2}{16}. \end{align*}\]

This means by summing over all upcrossing intervals we get that $\sum_{t=0}^{\infty} \gamma_t \Vert \nabla f(r_t)\Vert^2 =\infty$ (infinite sum of positive numbers is infinite). This is a contradiction because after assumption 4 we had shown that $\sum_{t=0}^{\infty} \gamma_t \Vert \nabla f(r_t)\Vert^2 < \infty$, hence the number of upcrossings should be finite.

Given that $\Vert \nabla f(r_t) \Vert$ comes infinitely often arbitrarily close to zero and since there are finitely many upcrossings, it follows that $\Vert \nabla f(r_t) \Vert$ can exceed $\epsilon$ only a finite number of times, and lim $\limsup_{t \to \infty}\Vert \nabla f(r_t) \Vert \leq \epsilon$. Since $\epsilon$ was arbitrary, it follows that $\limsup_{t \to \infty}\Vert \nabla f(r_t) \Vert = 0$, and part (b) of the proposition has been proved. Finally, if $r$ is a limit point of $r_t$, is the limit of some subsequence of $\nabla f(r_t)$ and must be equal to 0, which establishes part (c).

\[\begin{equation*}\textbf{END OF PART 1!}\end{equation*}\]

References

Neuro-Dynamic Programming
Convex Optimization
CPSC 327: Data Structures and Algorithms Spring 2025
CSC 236 H1F, Lecture 8
Calculus
Bandit algorithms
Instructor: Parimal Parag
Introduction to Online Convex Optimization
Analysis 1
ChatGPT and Google Gemini [Both a lot]

The law of total variance in practice

2025-08-19T00:00:00-07:00

In this post, I’ll solve an example that requires the use the law of total variance in RL.

Background

Background on some technical tools used in the main section. Before we start, note that for two random variables $X\, \mathrm{and}\, Y$, $\mathbb{E}[Y \mid X]$ is a shorthand notation for $\mathbb{E}[Y \mid \sigma(X)]$!

The law of total variance

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, $\mathcal{G}_1 \subseteq \mathcal{G}_2 \subseteq \mathcal{F}$ two sub-$\sigma$-algebras of $\mathcal{F}$, $X$ an integrable random variable on $(\Omega, \mathcal{F}, \mathbb{P})$ with finite variance. The following holds true:

$\mathbb{V}(X \mid \mathcal{G}_1) = \mathbb{\mathbb{E}}[\mathbb{V}(X \mid \mathcal{G}_2) \mid \mathcal{G}_1] - \mathbb{\mathbb{V}}[\mathbb{E}(X \mid \mathcal{G}_2) \mid \mathcal{G}_1].$

The law of the unconscious statistician (LOTUS)

Let $\mathbb{P}_X$ be the pushforward of the random element $X \in \mathcal{X}$. For any real-valued, $f: \mathcal{X} \to \mathbb{R}$ measurable function,

\[\mathbb{E}[f(X)] = \sum_xf(x)\mathbb{P}_X(x),\]

\[\mathbb{E}[f(X)] = \int_\mathcal{X} f(x)\mathrm{d}\mathbb{P}_X(x),\]

provided that either the right-hand side, or the left-hand side exist. This is known as the “law of the unconscious statistician”, or LOTUS.

Jensen’s inequality

Let $\bar{\mathbb{R}} = \mathbb{R} \cup$ 2025-08-19-ltv.md 2025-08-19-ltv.md{$ -\infty, +\infty$}, and $\mathrm{dom}(f) = {x \in \mathbb{R}^d: f(x) < \infty }$ for a real-valued function $f$ on $\mathbb{R}^d$.

Jensen’s inequality: Let $f: \mathbb{R}^d \to \bar{\mathbb{R}}$ be a measurable convex function and $X$ be an $\mathbb{R}^d$-valued random element on some probability space such that $\mathbb{E}[X]$ exists and $X \in \mathrm{dom}(f)$ holds almost surely. Then,

\[\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]).\]

An example the law of total variance in practice

We want to compare the variance of the target for SARSA and expected SARSA. The update rule of is

\[Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) - Q_t(S_t, A_t) \right].\]

And the update rule for expected SARSA is

\[Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha \left[R_{t + 1} + \gamma \sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) - Q_t(S_t, A_t) \right],\]

where $\pi$ is the fixed policy that was used to generate the data $S_0, A_0, R_1, \, \dots\;$.

Let $H_t = (S_0, A_0, R_1, \dots, S_t)$, and $H’_t = (S_0, A_0, R_1, \dots, S_t, A_t)$.

Show that

\[\mathbb{V}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H_{t + 1}\right] \geq \mathbb{V}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert H_{t + 1}\right].\]

First, note that using Markov property, we can replace $H_{t + 1}$ in the above expressions with $S_{t + 1}$. Second,

\[\mathbb{V}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert S_{t + 1}\right] = 0.\]

Because given $S_{t + 1}$ (it is not random anymore), the randomness in the actions is averaged out using the expectation, so there is no randomness remaining more. Third,

\[\begin{align*} \mathbb{V}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1}\right] & = \mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1})^2 \middle\vert S_{t + 1}\right] - \mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1}\right]^2 \quad \text{(def of variance)} \\ & = \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 - \left(\sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1}) \right)^2 \quad \text{(def of expectation and LOTUS)} \\ &\geq \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 - \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1})^2 Q_t(a, S_{t + 1})^2 \quad \text{(Cauchy-Swhartz)} \\ &\geq \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 - \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 \\ &\geq 0. \end{align*}\]

Right off the bat, we already knew that the variance is always non-negative anyway. 😅

When would have the equality happened?

Well you’re asking when the variance of a random variable is zero. Then, the answer is when it’s deterministic. For a deterministic random variable $X$ we have that $\mathbb{E}[X] = X$. Hence, we want

\[\mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1}\right] = Q_t(S_{t + 1}, A_{t + 1}),\]

which can only happen when the policy is deterministic.

Show that

\[\mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H'_t \right] \geq \mathbb{V}\left[R_{t + 1} + \gamma \sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert H'_t\right],\]

that is, the appropriate conditional variance of the SARSA target is always at least as large as that of for the expected SARSA target.

For convenience, let $Z_t = (S_t, A_t)$. We have,

\[\begin{align*} & \mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H'_t \right] - \mathbb{V}\left[R_{t + 1} + \gamma \sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert H'_t\right] = \\ & \mathbb{V} [R_{t + 1} \mid Z_t] + \gamma^2 \mathbb{V} \left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] + 2\mathrm{Cov}\left[R_{t + 1}, \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \mathbb{V} [R_{t + 1} \mid Z_t] - \gamma^2 \mathbb{V}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert Z_t\right] - 2\mathrm{Cov}\left[R_{t + 1}, \gamma \sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert Z_t \right]= \\ & \gamma^2 \mathbb{V} \left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] + 2\mathrm{Cov}\left[R_{t + 1}, \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \gamma^2 \mathbb{V}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert Z_t\right] - 2\mathrm{Cov}\left[R_{t + 1}, \gamma \sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert Z_t \right]. \end{align*}\]

For covariance terms, note that although $R_{t + 1}$ is not independent of $Q_t$ but given $Z_t$, it is (it is naturally independent of $S_{t + 1}$ and $A_{t + 1}$). So, covariance terms are zero. Hence, we end up with

\[\mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H'_t \right] - \mathbb{V}\left[R_{t + 1} + \gamma \sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert H'_t\right] = \gamma^2 \left(\mathbb{V} \left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \mathbb{V}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert Z_t\right]\right).\]

Now we apply the law of total variance

\[\begin{align*} &\gamma^2 \left(\mathbb{V} \left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \mathbb{V}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert Z_t\right]\right) = \\ &\gamma^2 \left(\mathbb{E}\left[\mathbb{V}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] - \mathbb{V}\left[\mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] \right) - \\ &\gamma^2 \left( \mathbb{E}\left[\mathbb{V}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert S_{t + 1}\right] \middle\vert Z_t \right] - \mathbb{V}\left[\mathbb{E}\left[\sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert S_{t + 1}\right] \middle\vert Z_t \right] \right) \end{align*}\]

First, $\mathbb{E}\left[\mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] = 0$. Given $S_{t + 1}$ everything inside becomes deterministic. Also,

\[\mathbb{V}\left[\mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] = \mathbb{V}\left[ \sum_{a'} \pi(a' \mid S_{t + 1})Q_t(S_{t + 1}, a') \middle\vert Z_t \right]= \mathbb{V}\left[\mathbb{E}\left[ \sum_{a'} \pi(a' \mid S_{t + 1})Q_t(S_{t + 1}, a') \middle\vert S_{t + 1}\right] \middle\vert Z_t \right]\]

So, we have

\[\mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H'_t \right] - \mathbb{V}\left[R_{t + 1} + \gamma \sum_{a' \in \mathcal{A}} \pi\left(a' \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a' \right) \middle\vert H'_t\right] = \gamma^2 \mathbb{E}[\mathbb{V}[Q_t(S_{t + 1}, A_{t + 1}) \mid S_{t + 1}] \mid Z_t] \geq 0.\]

Since the variance is always positive.