DEV Community

MasterOfMagicMatrices
MasterOfMagicMatrices

Posted on

Why Does Q-Learning Algorithm Converge?

I'm reading Watkins's version. He built a stochastic process called the action replay process basing on the MDP. I'm endeavoring to understand every step in this paper, but I do face some difficuties in proving lemma B.3 and the main theorem.

Question 1: how to prove this lemma?

Lemma 1:
Let ${\xi_n}$ be a sequence of bounded random variables with expectation $\mathfrak{E}$, and let $0 \le \beta_n < 1$ satisfy $\sum_{i=1}^{\infty} \beta_i = +\infty$ and $\sum_{i=1}^{\infty} \beta_i^2 < +\infty$.
Define the sequence $X_{n+1} = X_n + \beta_n(\xi_n - X_n)$. Then:

$$
P\left( \lim_{n \to \infty} X_n = \mathfrak{E} \right) = 1
$$

Question 2: How to prove the main theorem?

Theorem: Convergence of the Q-Learning Algorithm

Suppose the Markov Decision Process (MDP) is finite, and from any state, the exploration strategy ensures the process never terminates. Under this policy, every action has a non-zero probability. The learning rate is chosen within (0,1), such that its series diverges and the series of its squares converges. Then the Q-learning algorithm converges.

Below is my comprehension of this material.


Preface

A commonly used inequality

$$
-x > \ln(1 - x), \quad 0 < x < 1
$$

Proof: Let $f(x) = \ln(1 - x) + x$, for $0 < x < 1$. Then $f(0) = 0$.

$$
f'(x) = \frac{-1}{1 - x} + 1 = \frac{x}{x - 1} < 0
$$

Hence, $-x > \ln(1 - x), \quad 0 < x < 1$. Q.E.D.


Fundamental Theorem
If $a_n > -1$, then

$$
\prod_{n=1}^\infty (1 + a_n) = 0 \Leftrightarrow \sum_{n=1}^\infty \ln(1 + a_n) = -\infty
$$

Proof: Let $P_k = \prod_{n=1}^k (1 + a_n)$, then

$$
\ln P_k = \ln\left(\prod_{n=1}^k (1 + a_n)\right) = \sum_{n=1}^k \ln(1 + a_n)
$$

Thus,

$$
\sum_{n=1}^\infty \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \sum_{n=1}^k \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \ln P_k = -\infty \Leftrightarrow \lim_{k \to \infty} P_k = 0
$$

Q.E.D.


Corollary
If $0 \le b_n < 1$ and $\sum_{n=1}^\infty b_n = +\infty$, then

$$
\prod_{n=1}^\infty (1 - b_n) = 0
$$

Proof: Consider the subsequence ${b_{n_k}}$ consisting of non-zero $b_n$. Since $-b_{n_k} > -1$, and applying the fundamental theorem, we have:

$$
\prod_{n=1}^\infty (1 - b_n) = \prod_{k=1}^\infty (1 - b_{n_k}) = 0 \Leftrightarrow \sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty
$$

We now show $\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$.
Given $0 < 1 - b_{n_k} < 1$, we have $\ln(1 - b_{n_k}) < 0$, and $\sum_{k=1}^\infty b_{n_k} = +\infty$. It’s not immediately obvious, so we proceed by contradiction:

Assume $\sum_{k=1}^\infty \ln(1 - b_{n_k}) \ne -\infty$. Since each term is negative, this implies convergence, i.e.,

$$
\sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty
$$

But $\sum_{k=1}^\infty (-b_{n_k}) = -\infty \ge \sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty$, a contradiction.
Therefore, $\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$, and so

$$
\prod_{n=1}^\infty (1 - b_n) = 0
$$

Q.E.D.


The Essence of Mathematical Truth: Induction
Observe a linear-looking relation, fantasize wildly, then coldly examine whether it is truly valid.

Given $X_1$ and the recursive formula:

$$
X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n)X_n + \beta_n \xi_n
$$

Show that

$$
X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)
$$

Proof:

  1. Base case: $n = 1$

$$
X_2 = (1 - \beta_1)X_1 + \beta_1 \xi_1 = \xi_1 \beta_1 + X_1 (1 - \beta_1)
$$

holds.

  1. Inductive step: assume true for $n$, prove for $n+1$:

$$
X_{n+2} = (1 - \beta_{n+1})X_{n+1} + \beta_{n+1} \xi_{n+1}
$$

Plug in inductive hypothesis:

$$
= (1 - \beta_{n+1})\left[\sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)\right] + \beta_{n+1} \xi_{n+1}
$$

$$
= \sum_{j=1}^{n+1} \xi_j \beta_j \prod_{i=j}^{n} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n+1} (1 - \beta_i)
$$

  1. By induction, the formula holds for all positive integers $n$. Q.E.D.

Now, let's relax for a while — it's movie time.


1. Definition of Action Replay Process

Given an $n$-step finite MDP with a possibly varying learning rate $\alpha$, in step $i$, the agent is in state $x_i$, takes action $a_i$, receives random reward $r_i$, and transitions to a new state $y_i$.

Action Replay Process (ARP) is a re-examination of state $x$ and action $a$ within a given MDP.

Suppose we focus on state $x$ and action $a$, and consider an MDP consisting of $n$ steps.

We add a step 0 in which the agent immediately terminates and receives reward $Q_0(x,a)$.

During steps 1 to $n$, due to MDP randomness, the agent may take action $a$ in state $x$ at time steps $1 \le n^{i_1}, n^{i_2}, ..., n^{i_*} \le n$.

If action $a$ is never taken at $x$ in this episode, the only opportunity for it is at step 0.

When $i_* \ge 1$, to determine ARP's next reward and state, we sample an index $n^{i_e}$ as follows:

$$
n^{i_e} =
\begin{cases}
n^{i_}, & \text{with probability } \alpha_{n^{i_}} \
n^{i_{-1}}, & \text{with probability } (1 - \alpha_{n^{i_}})\alpha_{n^{i_{-1}}} \
\vdots \
0, & \text{with probability } \prod_{i=1}^{i_
}(1 - \alpha_{n^i})
\end{cases}
$$

Then, after one ARP step, the state $$ transitions to $$, and the reward is $r_{n^{i_e}}$.
Clearly, $n^{i_e} - 1 < n$, so ARP terminates with probability 1. Thus, ARP is a finite process almost surely.

To summarize, the core transition formula is:

$$
\overset{a}{\rightarrow} , \quad \text{reward } r_{n^{i_e}}
$$


2. Properties of the Action Replay Process

We now examine ARP’s properties, particularly in comparison to MDPs. Given an MDP rule and a (non-terminating) instance, we can construct an ARP accordingly.

Property 1

$$
\forall n, x, a,\quad Q^*_{ARP}(, a) = Q_n(x, a)
$$

Proof:
Using mathematical induction on $n$:

  1. Base case $n=1$:
  • If the MDP did not take $a$ at $x$ in step 1, ARP gives reward $Q_0(x,a) = 0 = Q_1(x,a)$
  • If $(x,a) = (x_1, a_1)$, then:

    $$
    Q^*_{ARP}(, a) = \alpha_1 r_1 + (1 - \alpha_1) Q_0(x,a) = \alpha_1 r_1 = Q_1(x,a)
    $$

    1. Inductive step: Assume $Q^*{ARP}(, a) = Q{k-1}(x,a)$, show for $k$:
  • If $(x,a) \ne (x_k, a_k)$, then:

    $$
    Q_k(x,a) = Q_{k-1}(x,a) = Q^*_{ARP}(, a)
    $$

  • If $(x,a) = (x_k, a_k)$, then:

    $$
    Q^*{ARP}(, a) = \alpha_k [r_k + \gamma \max_a Q{k-1}(y_k,a)] + (1 - \alpha_k) Q_{k-1}(x,a) = Q_k(x,a)
    $$

    1. Therefore, $Q^*_{ARP}(, a) = Q_n(x,a)$. Q.E.D.

Property 2 In the ARP ${}$, for all $l, s, \epsilon > 0$, there exists $h > l$ such that for all $n_1 > h$,

$$
P(n_{s+1} < l) < \epsilon
$$

Proof:

Let us first consider the final step, that is, the case where $n^{i_e} < n^{i_l}$ or even lower.
Given in the ARP, starting from $$, after taking action $a$, the probability of reaching a level lower than $l$ in one step is:

$$
\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right]
= \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right]
= \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right]
$$

But note that:

$$
\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] = 1
$$

Therefore,

$$
\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}}
$$

As long as every subsequence of ${\alpha_n}$ diverges, then as $h \to \infty$:

$$
\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}} \to 0
$$

Moreover, since the MDP is finite, we have:

$$
\forall l_j \in \mathbb{N}^*, \forall \eta_j > 0, \exists M_j > 0, \forall n_j > M_j, \forall X_j, a_j,
$$

starting from $$, after taking action $a_j$,

$$
P(n_{j+1} \ge l_j) = 1 - \eta_j
$$

Using the index $j$, we recursively apply this conclusion from step $s$ back to step 1. Then, the probability of reaching at least $l = l_s$ is at least:

$$
\prod_{j=1}^{s} (1 - \eta_j) = 1 - \epsilon
$$

where $n_{j+1} \ge l_j$, and $$ is reached from $$ after executing $a_j$. Q.E.D.

Now, define:

$$
P_{xy}^{(n)}[a] = \sum_{m=1}^{n-1} P_{,}^{ARP}[a]
$$


Lemma 1 (Robbins-Monro):
Let ${\xi_n}$ be a sequence of bounded random variables with expectation $\mathfrak{E}$, and let $0 \le \beta_n < 1$ satisfy $\sum_{i=1}^{\infty} \beta_i = +\infty$ and $\sum_{i=1}^{\infty} \beta_i^2 < +\infty$.
Define the sequence $X_{n+1} = X_n + \beta_n(\xi_n - X_n)$. Then:

$$
P\left( \lim_{n \to \infty} X_n = \mathfrak{E} \right) = 1
$$

My attempt:

$$
X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n) X_n + \beta_n \xi_n
$$

By induction, we obtain:

$$
X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n} (1 - \beta_i)
$$

From a corollary of a fundamental theorem:

$$
\prod_{i=1}^{\infty} (1 - \beta_i) = 0
$$

Hence:

$$
\lim_{n \to \infty} X_n = \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i)
= \frac{ \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - 0}
$$

$$
= \frac{ \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - \prod_{i=1}^{\infty} (1 - \beta_i)}
= \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \frac{ \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - \prod_{i=1}^{n} (1 - \beta_i)}
$$


Property 3

$$
P\left{\lim_{n\to\infty}P_{xy}^{(n)}[a]=P_{xy}[a]\right}=1, \quad P\left[\lim_{n\to\infty}\mathfrak{R}{x}^{(n)}(a)=\mathfrak{R}{x}(a)\right]=1
$$

The two convergence expressions inside the probability operator $P$ actually represent uniform convergence over $x$, $y$, and $a$. In other words, they converge almost surely uniformly with respect to the probability measure.

Proof:
We will use Lemma 1 and construct a clever proof to demonstrate this property.

From the definition of ARP, we have:

$$
\mathfrak{R}{x}^{(n^{i+1})}(a)=\alpha{n^{i+1}}r_{n^{i+1}}+(1-\alpha_{n^{i+1}})\mathfrak{R}{x}^{(n^{i})}(a)=\mathfrak{R}{x}^{(n^{i})}(a)+\alpha_{n^{i+1}}[r_{n^{i+1}}-\mathfrak{R}_{x}^{(n^{i})}(a)]
$$

Although after the MDP process completes, $r_{n^{i+1}}$ is just a deterministic sequence with respect to $i$, its value can be viewed as a sample from a certain random variable. Therefore, as long as:

  1. $0 \le \alpha_n < 1$
  2. $\sum_{i=1}^\infty \alpha_i^2 < +\infty$

then

$$
P\left[\lim_{i\to\infty}\mathfrak{R}{x}^{(n^i)}(a)=\mathfrak{R}{x}(a)\right]=1,
$$

which implies

$$
P\left[\lim_{n\to\infty}\mathfrak{R}{x}^{(n)}(a)=\mathfrak{R}{x}(a)\right]=1.
$$

Now define $X_n(x,a,y)$: starting from state $x$, if executing action $a$ results in reaching $y = y_n$, then this value is 1; otherwise, it's 0.

Then:

$$
P_{xy}^{(n^{i+1})}[a] = (1-\alpha_{n^{i+1}})P_{xy}^{(n^i)}[a] + \alpha_{n^{i+1}}X_{n^{i+1}}(x,a,y) = P_{xy}^{(n^i)}[a] + \alpha_{n^{i+1}}{X_{n^{i+1}}(x,a,y) - P_{xy}^{(n^i)}[a]}
$$

Although $X_{n^{i+1}}(x,a,y)$ is a deterministic sequence depending on $i$, it can also be regarded as a sample value from a random variable.

Therefore,

$$
P\left{\lim_{i\to\infty}P_{xy}^{(n^{i+1})}[a]=P_{xy}[a]\right}=1,
$$

so

$$
P\left{\lim_{n\to\infty}P_{xy}^{(n)}[a]=P_{xy}[a]\right}=1.
$$

In fact, since the MDP is finite, the expressions inside the probability operator converge uniformly for any $x$, $y$, and $a$. Q.E.D.


Lemma 2
Now consider $s$ Markov chains that share the same state and action spaces (they differ only in their state transition probabilities). Suppose in the $i$th chain, the expected reward from taking action $a$ in state $x$ is $R^i_x(a)$, and the transition probability to state $y$ is $p^i_{xy}[a]$. At each step, the index of the Markov chain used increases by 1.

If for all $a, i, x, y, \eta$,

$$
|R_x^i(a)-R_x(a)| < \eta, \quad |p_{xy}^i[a]-p_{xy}[a]| < \frac{\eta}{R}
$$

(where $R_x(a)$, $p_{xy}[a]$ are corresponding statistics from a new chain outside the shared set, and $R \ge \sup_{x,a}|R_x(a)|$), then for any state $x$, the $s$-step value estimate $\overline{Q}'(x,a_1,\dots,a_s)$ (given action sequence $a_1,\dots,a_s$) satisfies:

$$
|\overline{Q}'(x,a_1,\dots,a_s)-\overline{Q}(x,a_1,\dots,a_s)| < \eta,
$$

where $\overline{Q}(x,a_1,\dots,a_s)$ is the value on the new chain.

Proof:
Suppose that the MDP has $n$ states and let

$$
\overline{Q}(x,a_1,a_2) = R_x(a_1) + \gamma \sum_y p_{xy}[a_1] R_y(a_2)
$$

and

$$
\overline{Q}'(x,a_1,a_2) = R^1_x(a_1) + \gamma \sum_y p^1_{xy}[a_1] R^2_y(a_2)
$$

Then,

$$
|\overline{Q}(x,a_1,a_2) - \overline{Q}'(x,a_1,a_2)| \le |R_x(a_1) - R_x^1(a_1)| + \gamma \sum_y |p_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R^2_y(a_2)|
$$

$$
< \eta + \gamma \sum_y \left|p_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R_y(a_2) + p^1_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R^2_y(a_2)\right|
$$

$$
\le \eta + \gamma \sum_y \left|p_{xy}[a_1] - p^1_{xy}[a_1]\right||R_y(a_2)| + \gamma \sum_y p^1_{xy}[a_1]|R_y(a_2) - R^2_y(a_2)|
$$

$$
\le \eta + \gamma \sum_y \frac{\eta}{R}R + \gamma\eta = \eta + \gamma(n\eta) + \gamma\eta = \eta(1 + \gamma + n\gamma) < \eta(n+2)
$$

Since $\eta$ is arbitrary, we can in fact treat the bound as $|\overline{Q}(x,a_1,a_2) - \overline{Q}'(x,a_1,a_2)| < \eta$.

Thus, by mathematical induction (Hint: the $Q$-values are bounded, and the recurrence is similar to a Bellman equation but with $R$ replaced by corresponding $Q$-values), we get:

$$
|\overline{Q}'(x,a_1,\dots,a_s)-\overline{Q}(x,a_1,\dots,a_s)|<\eta
$$

Q.E.D.


Theorem: Convergence of the Q-Learning Algorithm

Suppose the Markov Decision Process (MDP) is finite, and from any state, the exploration strategy ensures the process never terminates. Under this strategy, every action has a non-zero probability. The learning rate is chosen within (0,1), such that its series diverges and the series of its squares converges. Then the Q-learning algorithm converges.

My attempt:

Now, in a finite MDP, under the given exploration strategy, one can construct a non-terminating MDP instance along with the sequence $Q_n(x,a)$. Thus, the corresponding augmented reward process (ARP) can be defined. By Property 1, we have $\forall n,x,a,~ Q^_{ARP}(\langle x,n\rangle,a) = Q_n(x,a)$. Therefore, we next analyze $Q^_{ARP}(\langle x,n\rangle,a)$.

By definition, $Q_0(x,a) = 0 < \frac{M}{1 - \gamma}$ (without loss of generality, let $M \ge 1$). For any $\epsilon > 0$, there exists $s > 0$ such that $\gamma^s \cdot \frac{M}{1 - \gamma} < \frac{\epsilon}{6}$.

By Property 3, the following holds:

$$
P\left(\left{\lim_{n\to\infty} P_{xy}^{(n)}[a] = P_{xy}[a]\right} \cap \left[\lim_{n\to\infty} \mathfrak{R}{x}^{(n)}(a) = \mathfrak{R}{x}(a)\right]\right) = 1,
$$

uniformly for all $x, y, a$. Therefore, with probability 1, there exists $l > 0$ such that for all $n > l$, and for all $x, y, a$:

$$
|P_{xy}^{(n)}[a] - P_{xy}[a]| < \frac{\epsilon}{M},\quad |\mathfrak{R}{x}^{(n)}(a) - \mathfrak{R}{x}(a)| < \epsilon.
$$

According to Property 2, there exists $h > l + 1$, such that for all $n_1 > h$:

$$
P(n_{s+1} < l + 1) < \min\left{\frac{\epsilon(1 - \gamma)}{6Ms}, \frac{\epsilon}{3s(s + 1)M}\right},
$$

and hence:

$$
P(n_{s+1} \ge l + 1) > 1 - \min\left{\frac{\epsilon(1 - \gamma)}{6Ms}, \frac{\epsilon}{3s(s + 1)M}\right}.
$$

Now consider the case where $n_1 > \cdots > n_s > n_{s+1} \ge l + 1 > l$. Then for all $x, y, a$ and $1 \le i \le s+1$, since $n_i > l$, we have:

$$
|P_{xy}^{(n_i)}[a] - P_{xy}[a]| < \frac{\epsilon}{M},\quad |\mathfrak{R}{x}^{(n_i)}(a) - \mathfrak{R}{x}(a)| < \epsilon.
$$

By Lemma 2:

$$
|\overline{Q}_{ARP}(\langle x,n_1\rangle,a_1,\dots,a_s) - \overline{Q}(x,a_1,\dots,a_s)| < \epsilon, \quad \forall a_1,\dots,a_s,x.
$$

When $s$ is sufficiently large:

$$
|Q_{n_1}(x,a) - \overline{Q}(x,a,\dots,a_s)| = |Q^*_{ARP}(\langle x,n_1\rangle,a) - \overline{Q}(x,a,\dots,a_s)| \le \epsilon, \quad \forall a,x.
$$

Therefore:
$$
|Q^_{ARP}(,a_1)-Q^(x,a_1)|\le|Q^{ARP}(,a_1)-\overline{Q}{ARP}(,a_1,\dots,a_s)|+|\overline{Q}_{ARP}(,a_1,\dots,a_s)-\overline{Q}(x,a_1,\dots,a_s)|+|\overline{Q}(x,a_1,\dots,a_s)-Q^(x,a_1)|
$$

(Note that they share the same $a_1,\dots,a_s$, so I think $|Q^{ARP}(,a_1)-\overline{Q}{ARP}(,a_1,\dots,a_s)|<\frac{\epsilon}{6}$ and $|\overline{Q}(x,a_1,\dots,a_s)-Q^(x,a_1)|<\frac{\epsilon}{6}$ may not always hold at the same time. Have any evidence to show that both of them can be true, or, use these lemmas and properties but in a different way?)

Top comments (0)