14 Convergences I : almost sure, \(L_2\), \(L_1\), in Probability
14.1 Motivations
We need to put topological structures in the world of random variables living on some probability space. As random variables are (measurable) functions, we shall borrow and adapt the notions used in Analysis: pointwise convergence (Section 14.2)), convergence in \(L_p, 1 \leq p <\infty\) (Section 14.3)).
Finally, we define and investigate convergence in probability. This notion weakens both \(L_p\) and almost sure (pointwise) convergence. Just as \(L_p\) convergences, it can be metrized.
Convergence in probability and almost sure convergence are illustrated by weak and strong law of large numbers (Sections Section 14.5 and Section 14.6). Laws of large numbers assert that empirical means converge towards expectations (under mild conditions), they are the workhorses of statistical learning theory.
In Section 14.7), we look at non-asymptotic counterparts of the weak law of large numbers. We establish exponential tail bounds for sums of independent random variables (under stringent integrability assumptions).
14.2 Almost sure convergence
The notion of almost sure convergence mirrors the notion of pointwise convergence in probabilistic settings.
Recall that a sequence of real-valued functions \((f_n)_n\) mapping some space \(\Omega\) to \(\mathbb{R}\) converges pointwise to \(f: \Omega \to \mathbb{R}\), if for each \(\omega \in \Omega\), \(f_n(\omega) \to f(\omega)\). There is no uniformity condition.
In the next definition, we assume that random variables are real-valued. The definition is easily extended to multivariate settings.
Almost sure convergence, is (just) pointwise convergence with probability \(1\). Almost sure convergence is not tied to integrability. Note that all random variables involved in the above statements live on the same probability space. We may wonder whether we can design a metric for almost-sure convergence? The answer is no, as for pointwise convergence, in general.
14.3 Convergence in \(L_p\)
In this section, we consider random variables that satisfy integrability assumptions. The scope of \(L_p\) convergences is narrower than the scope of \(L_p\) convergences.
We already introduced \(L_p\) convergences in Lesson Chapter 3. We recall it for the sake of readibility.
Recall that \(L_p\) spaces are nested (by Holder’s inequality) and complete.
Almost sure convergence is not tied to integrability. We cannot ask whether almost sure convergence implies \(L_p\) convergence. But we can ask whether \(L_p\) convergence implies almost sure convergence. The next statement is a by-product of the proof of the completeness of \(L_p\) spaces, see Section 4.7).
A counter-example given in Section 4.7) shows that convergence in \(L_p\) does not imply almost-sure convergence.
14.4 Convergence in probability
If we denote by \(L_0=L_0(\Omega, \mathcal{F}, P)\) the set of real-valued random variables, the notion of convergence in probability is relevant to all sequences in \(L_0\) like almost sure convergence. And like convergence in \(L_p, p\geq 1\), convergence in probability can be metrized.
This is an immediate consequence of Markov’s inequality.
Proof. Assuming convergence in probability, \[\begin{array}{rl} \mathbb{E} \Big[ 1 \wedge |X_n -X|\Big] & \leq \mathbb{E} \Big[ (1 \wedge |X_n -X|)\mathbb{I}_{|X-X_n| \geq \epsilon}\Big] + \mathbb{E} \Big[ (1 \wedge |X_n -X|)\mathbb{I}_{|X-X_n| < \epsilon}\Big] \\ & \leq P \Big\{|X-X_n| \geq \epsilon \Big\} + \epsilon\end{array}\] the limit of the right-hand side is not larger than \(\epsilon\). As we can take \(\epsilon\) arbitrarily small, this entails that the limit of the left-hand side is zero.
Conversely, for all \(0< \epsilon< 1\) \[\begin{array}{rl} P \Big\{|X-X_n| \geq \epsilon \Big\} & \leq \frac{1}{\epsilon} \mathbb{E}\Big[ 1 \wedge |X-X_n|\Big] \,. \end{array}\] Hence \(\lim_n \mathbb{E} \Big[ 1 \wedge |X_n -X|\Big] = 0\) entails \(\lim_n P \big\{|X-X_n| \geq \epsilon \big\} =0\). As this holds for all \(\epsilon>0\), \(\lim_n \mathbb{E} \Big[ 1 \wedge |X_n -X|\Big] = 0\) entails convergence in Probability.
Proof. Assume \(X_n \to X\) a.s., that is \(|X_n -X| \to 0\). Then by dominated convergence, \[\lim_n \mathbb{E}\Big[ |X_n -X| \wedge 1\Big] = 0\] which entails convergence in probability of \((X_n)_n\) towards \(X\).
Now, we come to a metric which fits perfectly with convergence in probability.
Note that we have to check that \(\mathrm{d}_{\mathrm{KF}}\) is indeed a distance. This is the content of Proposition @ref(prp:kyfanprop) below.
Proof. Let \(a > \mathrm{d}_{\mathrm{KF}}(X, Y)\) the event \(A_a = \Big\{ |X-Y| > a \Big\}\) has probability smaller than \(\epsilon\). And if \(\epsilon < a < b\), \(A_b \subseteq A_a\). By monotone converence, \(P\Big(\cap_n A_{\epsilon + 1/n}\Big)= \lim_{n} \uparrow P\Big(A_{\epsilon + 1/n}\Big) = \epsilon\).
Proof. We check that \(\mathrm{d}_{\mathrm{KF}}\) satisfies the triangle inequality. There exists two events \(B\) and \(C\) with respective probabilities \(\mathrm{d}_{\mathrm{KF}}(X, Y)\) and \(\mathrm{d}_{\mathrm{KF}}(Y, Z)\) such that \[|X(\omega) -Y(\omega)| \leq \mathrm{d}_{\mathrm{KF}}(X, Y) \qquad \text{on } B^c\] and \[|Z(\omega) -Y(\omega)| \leq \mathrm{d}_{\mathrm{KF}}(Z, Y) \qquad \text{on } C^c\,.\]
On \(B^c \cap C^c\), by the triangle inequality on \(\mathbb{R}\): \[|X(\omega) - Z(\omega)| \leq \mathrm{d}_{\mathrm{KF}}(X, Y) + \mathrm{d}_{\mathrm{KF}}(Y, Z)\, .\]
We conclude by observing \[\begin{array}{rl} P \Big( |X(\omega) - Z(\omega)| > \mathrm{d}_{\mathrm{KF}}(X, Y) + \mathrm{d}_{\mathrm{KF}}(Y, Z) \Big) & \leq P\Big( (B^c \cap C^c)^c\Big)\\ & = P(B \cup C) \\ & \leq P(B) + P(C) \\ & = \mathrm{d}_{\mathrm{KF}}(X, Y) + \mathrm{d}_{\mathrm{KF}}(Y, Z) \, . \end{array}\]
We leave the following questions as exercises:
- Is \(\mathcal{L}_0(\Omega, \mathcal{F}, P)\) complete under the Ky-Fan metric?
- Does convergence in probability imply almost sure convergence?
- Does convergence in probability imply convergence in \(L_p, p\geq 1\)?
Finally, we state a more gemeral definition of convergence in probability. The notion can be tailored to random variables that map some universe to some metric space. The connections with almost-sure convergence and \(L_p\) convergences remain unchanged.
14.5 Weak law of large numbers
The weak and the strong law of large numbers are concerned with the convergence of empirical means of independent, identically distributed, integrable random variables towards their common expectation.
Proof. Assume first that \(\mathbb{E}\Big[\Big(X_i-\mu\Big)^2\Big] = \sigma^2 < \infty.\) Then, for all \(\epsilon>0\), by the Markov-Chebychev inequality: \[\begin{array}{rl} P\Big\{ \Big|\frac{1}{n}\sum_{i=1}^n X_i - \mu\Big| > \epsilon\Big\} & \leq \frac{\mathbb{E} \Big|\frac{1}{n}\sum_{i=1}^n X_i - \mu\Big|^2 }{\epsilon^2} \\ & = \frac{\mathbb{E}\Big[\Big(X_i-\mu\Big)^2\Big] }{n \epsilon^2} \\ & = \frac{\sigma^2}{n \epsilon^2} \end{array}\] because the variance of a sum of independent random variables equals the sum of the variances of the summands.
The right-hand side converges to \(0\) for all \(\epsilon >0\). The WLLN holds for square-integrable random variables.
Let us turn to the general case. Without loss of generality, assume all \(X_n\) are centered. Let \(\tau >0\) be a truncation threshold (which value will be tuned later). For each \(i \in \mathbb{N}\), \(X_i\) is decomposed into a sum: \[X_i = X^\tau_i + Y^\tau_i\]
with \(X^\tau_i = \mathbb{I}_{|X_i|\leq \tau} X_i\) and \(Y^\tau_i = \mathbb{I}_{|X_i|>\tau} X_i\). For every \(\epsilon >0\),
\[\Big\{ \Big|\frac{1}{n}\sum_{i=1}^n X_i \Big| >\epsilon\Big\} \subseteq \Big\{ \Big|\frac{1}{n}\sum_{i=1}^n X^\tau_i \Big| > \frac{\epsilon}{2}\Big\} \cup \Big\{ \Big|\frac{1}{n}\sum_{i=1}^n Y^\tau_i \Big| >\frac{\epsilon}{2}\Big\} \, .\]
Invoking the union bound, Markov’s inequality (twice), the boundedness of the variances of the \(X^\tau_i\)’s leads to:
\[\begin{array}{rl} P\Big\{ \Big|\frac{1}{n}\sum_{i=1}^n X_i - \mu\Big| > \epsilon\Big\} & \leq P \Big\{ \Big|\frac{1}{n}\sum_{i=1}^n X^\tau_i \Big| > \frac{\epsilon}{2}\Big\} + P \Big\{ \Big|\frac{1}{n}\sum_{i=1}^n Y^\tau_i \Big| >\frac{\epsilon}{2}\Big\} \\ & \leq 4 \frac{\mathbb{E}\Big|\frac{1}{n}\sum_{i=1}^n X^\tau_i \Big|^2}{\epsilon^2} + 2 \frac{\mathbb{E}\Big|\frac{1}{n}\sum_{i=1}^n Y^\tau_i \Big|}{\epsilon} \\ & \leq \frac{4 \tau^2}{n\epsilon^2} + 2 \frac{\mathbb{E}\Big|\frac{1}{n}\sum_{i=1}^n Y^\tau_i \Big|}{\epsilon} \\ & \leq \frac{4 \tau^2}{n\epsilon^2} + 2 \frac{1}{n}\sum_{i=1}^n \frac{\mathbb{E}\Big|Y^\tau_i \Big|}{\epsilon} \\ & \leq \frac{4 \tau^2}{n\epsilon^2} + 2 \frac{\mathbb{E} \Big|Y^\tau_1 \Big|}{\epsilon} \, . \end{array}\]
Taking \(n\) to infinity leads to \[\limsup_n P\Big\{ \Big|\frac{1}{n}\sum_{i=1}^n X_i - \mu\Big| > \epsilon\Big\} \leq 2 \frac{\mathbb{E}\Big|Y^\tau_1 \Big|}{\epsilon} \, .\]
Now as \({\tau \uparrow \infty}\) \(|Y^\tau_1| \downarrow 0\) while \(|Y^\tau_1| \leq |X_1|\), dominated convergence (here a special case of monotone convergence) warrants that \(\lim_{\tau \uparrow \infty} \frac{\mathbb{E}\Big|Y^\tau_1 \Big|}{\epsilon}=0\).
This completes the proof of the WLLN.
14.6 Strong law of large numbers
Infinite product space endowed with cylinders \(\sigma\)-algebra, and infinite product distribution.
Recall
Proof. An outcome \(\omega\) belongs to infinitely many events \(A_k\), iff \(\omega \in \cap_{n} \cup_{k\geq n} A_k\). By monotone convergence, \[\begin{array}{rl} P \Big\{ \omega : \omega \text{ belongs to infinitely many events } A_k\Big\} & = P \Big\{ \cap_{n} \cup_{k\geq n} A_k \Big\} \\ & = \lim_n \downarrow P \Big\{ \cup_{k\geq n} A_k \Big\} \\ & \leq \lim_n \downarrow \sum_{k \geq n} P \Big\{ A_k \Big\} \\ & = 0 \, . \end{array}\]
Proof (Proof of SLLN (direct part)). The event \(\Big\{ \omega : \lim_n \sum_{i=1}^n \frac{X_i}{n} = \mu \Big\}\) belongs to the tail \(\sigma\)-algebra. To check the Strong Law of Large Numbers, it suffices to check that this event has non-zero probability.
Moreover, using the usual decomposition \(X = (X)_+ - (X)_-\) where \((X)_+\) and \((X)_-\) are the positive and negative parts of \(X\), we observe that we can assume without loss of generality that \(X_i\)’s are non-negative.
Recall the definition of truncated variables \(X_i^i = \mathbb{I}_{X_i \leq i}X_i\) for \(i \in \mathbb{N}\). Let \(S_n = \sum_{i=1}^n X_i\) and \(T_n = \sum_{i=1}^n X_i^i\).
The difference \(S_n - T_n = \sum_{i=1}^n (X_i - X^i_i)\) is a sum of non-negative random variables. As \[P \{ X_i - X^i_i >0 \} = P\{ X_i >i \} = P\{ X_1 > i\} \, ,\]
thanks to \(\mathbb{E} X_1 < \infty\), \[\sum_{i \in \mathbb{N}} P \{ X_i - X^i_i >0 \} < \infty \, .\]
By the first Borel-Cantelli Lemma, this implies that almost surely, only finitely many events \(\{ X_i - X^i_i >0 \}\) are realized. Hence almost surely, \(T_n\) and \(S_n\) differ by at most a bounded number of summands, and \(\lim_n \uparrow (S_n - T_n)\) is finite.
Now \[\lim_n \uparrow \mathbb{E} \frac{T_n}{n} = \mathbb{E} X_1 \, .\]
We shall first check that \(T_{n(k)}/n(k)\) converges almost surely towards \(\mathbb{E} X_1\) for some (almost) geometrically increasing subsequence \((n(k))_{k \in \mathbb{N}}\).
Fix \(\alpha>1\) and let \(n(k) = \lfloor \alpha^k \rfloor\). If for all \(\epsilon>0\), almost surely, only finitely many events \[\Big\{ \Big|T_{n(k)} - \mathbb{E}T_{n(k)} \Big| \geq n(k) > \epsilon \Big\}\] occur, then \(\Big|T_{n(k)} - \mathbb{E}T_{n(k)} \Big|/n(k)\) converges almost surely to \(0\) and thus \(T_{n(k)}/n(k)\) converges almost surely to \(\mathbb{E}X_1\).
Let \[\Theta = \sum_{k\in \mathbb{N}} P\Big\{ \Big|T_{n(k)} - \mathbb{E}T_{n(k)} \Big| \geq n(k) > \epsilon \Big\} \, .\]
Thanks to truncation, each \(T_{n(k)}\) is square-integrable. By Chebychev’s inequality: \[P\Big\{ \Big|T_{n(k)} - \mathbb{E}T_{n(k)} \Big| \geq n(k) > \epsilon \Big\} \leq \frac{\operatorname{var}(T_{n(k)})}{\epsilon^2 n(k)^2} \, .\]
As \(X_i^i\)’s are independent, \[\begin{array}{rl} \operatorname{var}(T_{n(k)}) & = \sum_{i \leq n(k)} \operatorname{var}(X_i^i) \\ & \leq \sum_{i \leq n(k)} \mathbb{E}\Big[(X_i^i)^2\Big] \\ & = \sum_{i \leq n(k)} \int_0^\infty 2 t P \{ X^i_i >t \} \mathrm{d}t \\ & \leq \sum_{i \leq n(k)} \int_0^i 2 t P \{ X_1 >t \} \mathrm{d}t \, . \end{array}\]
\[\begin{array}{rl} \Theta & \leq \sum_{k\in \mathbb{N}} \frac{1}{\epsilon^2 n(k)^2}\sum_{i \leq n(k)} \int_0^i 2 t P \{ X_1 >t \} \mathrm{d}t \\ & = \frac{1}{\epsilon^2} \sum_{i \in \mathbb{N}} \int_0^i 2 t P \{ X_1 >t \} \mathrm{d}t \sum_{k: n(k)\geq i} \frac{1}{n(k)^2} \, . \end{array}\] Thanks to the fact that \(\alpha^k >1\) for \(k\geq 1\), the following holds: \[ \sum_{k: n(k)\geq i} \frac{1}{n(k)^2} = \sum_{k: \lfloor \alpha^k \rfloor \geq i} \frac{1}{\lfloor \alpha^k \rfloor^2} \leq \frac{4}{i^2} \frac{\alpha^2}{\alpha^2- 1} \, .\]
\[\begin{array}{rl} \Theta & \leq \frac{4\alpha^2}{\epsilon^2(\alpha^2-1)} \sum_{i \in \mathbb{N}} \frac{1}{i^2} \int_0^i 2 t P \{ X_1 >t \} \mathrm{d}t \\ & \leq \frac{4\alpha^2}{\epsilon^2(\alpha^2-1)} \sum_{i \in \mathbb{N}} \frac{1}{i^2} \sum_{j<i} \int_{j}^{j+1} 2P \{ X_1 >t \} \mathrm{d}t \\ & \leq \frac{4\alpha^2}{\epsilon^2(\alpha^2-1)} \sum_{j=0}^\infty \int_{j}^{j+1} 2t P \{ X_1 >t \} \mathrm{d}t \sum_{i >j} \frac{1}{i^2} \\ & \leq \frac{4\alpha^2}{\epsilon^2(\alpha^2-1)} \sum_{j=0}^\infty \int_{j}^{j+1} 2t P \{ X_1 >t \} \mathrm{d}t \frac{2}{j\vee 1} \\ & \leq 8\frac{4\alpha^2}{\epsilon^2(\alpha^2-1)} \sum_{j=0}^\infty \int_{j}^{j+1} P \{ X_1 >t \} \mathrm{d}t \\ & \leq 8\frac{4\alpha^2}{\epsilon^2(\alpha^2-1)} \mathbb{E} X_1 \\ & < \infty \, . \end{array}\] By the first Borell-Cantelli Lemma, with probability \(1\), only finitely many events \[\Big\{ \Big|T_{n(k)} - \mathbb{E}T_{n(k)} \Big| \geq n(k) > \epsilon \Big\}\] occur. As this holds for each \(\epsilon>0\), it holds simultaneously for all \(\epsilon= 1/n\), which implies that \(\Big|T_{n(k)} - \mathbb{E}T_{n(k)} \Big|/n(k)\) converges almost surely to \(0\). This also implies that \(S_{n(k)}/n(k)\) converges almost surely to \(\mathbb{E}X_1\).
To complete the proof, we need to check that this holds for \(S_n/n\).
If \(n(k) \leq n < n(k+1)\), as \((S_n)_n\) is non-decreasing, \[\frac{n(k)}{n(k+1)}\frac{S_{n(k)}}{n(k)}\leq \frac{S_n}{n}\leq \frac{n(k+1)}{n(k)}\frac{S_{n(k+1)}}{n(k+1)}\] with \[\frac{1}{\alpha} \Big(1 - \frac{1}{\alpha^k} \Big)\leq \frac{n(k+1)}{n(k)} \leq \alpha \left(1 + \frac{1}{\lfloor \alpha^k\rfloor}\right) \, .\]
Taking \(k\) to infinty, almost surely \[\frac{1}{\alpha} \mathbb{E} X_1 \leq \liminf_n \frac{S_n}{n} \leq \limsup_n \frac{S_n}{n} \leq \alpha \mathbb{E}X_1 \, .\]
Finally, we may choose \(\alpha\) arbitrarily close to \(1\), to establish the desired result.
Remark 14.1. In the statement of the Theorem, we can replace the independence assumption by a pairwise independence assumption.
Theorem 14.4) shows that, under independence assumption, the conditions in Theorem 14.3) are tight. Before proceeding to the proof of Theorem 14.4), we state and prove the second Borel-Cantelli Lemma.
Proof. An outcome \(\omega\) does not belong to infinitely many events \(A_k\), iff \(\omega \in \cup_{n} \cap_{k\geq n} A^c_k\). By monotone convergence, \[\begin{array}{rl} P \Big\{ \omega : \omega \text{ does not belong to infinitely many events } A_k\Big\} & = P \Big\{ \omega \in \cup_{n} \cap_{k\geq n} A^c_k \Big\} \\ & = \lim_n \uparrow P \Big\{ \cap_{k\geq n} A^c_k \Big\} \\ & = \lim_n \uparrow \lim_{m \uparrow \infty } \downarrow P \Big\{ \cap_{k=n}^m A^c_k \Big\} \\ & = \lim_n \uparrow \lim_{m \uparrow \infty } \downarrow \prod_{k=n}^m \Big( 1 - P (A_k) \Big\} \Big) \\ & = \lim_n \uparrow \prod_{k=n}^\infty \Big( 1 - P ( A_k ) \Big) \\ & = \lim_n \uparrow \exp\Big( - \sum_{k=n}^\infty P ( A_k)\Big) \\ & = \lim_n \uparrow 0 \\ & = 0 \, . \end{array}\]
We may assume that \(X_i\)’s are non-negative random variables.
Proof. In order to check that the \(X_i\)’s are integrable, it suffices to show that \[\sum_{n=0}^\infty P \big\{ X_1 > n \big\} = \sum_{n=0}^\infty P \big\{ X_n > n \big\} < \infty.\]
Let \(S_n = \sum_{i=1}^n X_i\). Observe that \[\begin{array}{rl} \Big\{ \omega : X_{n+1}(\omega) > n+1 \Big\} & = \Big\{ \omega : S_{n+1}(\omega) - S_{n}(\omega) > n+1 \Big\} \\ & = \Big\{ \omega : \frac{S_{n+1}(\omega)}{n+1} - \frac{S_{n}(\omega)}{n} > 1 + \frac{S_{n}(\omega)}{n(n+1)} \Big\} \, . \end{array}\] Assume by contradiction that the \(X_i\)’s are not integrable. Then by the second Borel-Cantelli Lemma, with probability \(1\), infinitely many events \[\Big\{ \omega : \frac{S_{n+1}}{n+1} - \frac{S_{n}}{n} > 1 + \frac{S_{n}}{n(n+1)} \Big\}\]
occur. But this cannot happen if \(S_n/n\) converges toward a finite limit.
The law of large numbers is the cornerstone of consistency proofs.
Before shifting to non-exponential inequalities, we point a general result about events that depend on the limiting behavior of sequences of independent random variables.
Observe that the event \(\sum_{i=1}^n X_i/n\) converges towards a finite limit belongs to the tail \(\sigma\)-algebra. The Strong Law of Large Numbers tells us that under integrability and independence assumptions, this tail event has probability \(1\). This is no accident. The \(0-1\)-law asserts that under independence, tail events have trivial probabilities.
Proof. It suffices to check that any event \(A \in \mathcal{T}\) satisfies \(P(A)^2 = P(A)\), or equivalently that \(P(A) = P(A \cap A) = P(A) \times P(A)\), that is \(A\) is independent of itself.
For any \(n\), as an event in \(\sigma\big(X_n, X_{n+1}, \ldots \big)\), \(A\) is independent from any event in \(\sigma\big(X_1, \ldots, X_n\big)\). But this entails that \(A\) is independent from any event in \(\cup_n \sigma\big(X_1, \ldots, X_n\big)\).
Observe that \(\cup_n \sigma\big(X_1, \ldots, X_n\big)\) is a \(\pi\)-system. Hence, \(A\) is independent from any event from the \(\sigma\)-algebra generated by \(\cup_n \sigma\big(X_1, \ldots, X_n\big)\), which happens to be \(\mathcal{F}\). As \(A \in \mathcal{T} \subset \mathcal{F}\), \(A\) is independent from itself.
14.7 Exponential inequalities
Laws of large numbers are asymptotic statements. In applications, in Statistics, in Statistical Learning Theory, it is often desirable to have guarantees for fixed \(n\). Exponential inequalities are refinements of Chebychev inequality. Under strong integrability assumptions on the summands, it is possible and relatively easy to derive sharp tail bounds for sums of independent random variables.
14.7.1 Hoeffding’s Lemma
Let \(Y\) be a random variable taking values in a bounded interval \([a,b]\) and let \(\psi_Y(\lambda)=\log \mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}\). Then \[\operatorname{var}(Y) \leq \frac{(b-a)^2}{4}\qquad \text{and} \qquad \psi_Y(\lambda) \leq \frac{1}{2} \frac{(b-a)^2}{4} \, .\]
Proof. The upper bound on the variance of \(Y\) has been established in Section 4.4).
Now let \(P\) denote the distribution of \(Y\) and let \(P_{\lambda}\) be the probability distribution with density \[x \rightarrow e^{-\psi_{Y}\left( \lambda\right) }e^{\lambda (x - \mathbb{E}Y)}\]
with respect to \(P\).
Since \(P_{\lambda}\) is concentrated on \([a,b]\) (\(P_\lambda([a, b]) = P([a, b]) =1\)), the variance of a random variable \(Z\) with distribution \(P_{\lambda}\) is bounded by \((b-a)^2/4\). Note that \(P_0 = P\).
Dominated convergence arguments allow to compute the derivatives of \(\psi_Y(\lambda)\). Namely \[\psi'_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}Y) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} = \mathbb{E}_{P_\lambda} Z \, .\] and \[\psi^{\prime\prime}_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y})^2 e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} - \Bigg(\frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y}) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}}\Bigg)^2 = \operatorname{var}_{P_\lambda}(Z) \, .\]
Hence, thanks to the variance upper bound: \[\begin{array}{rl} \psi_Y^{\prime\prime}(\lambda) & \leq \frac{(b-a)^2}{4}~. \end{array}\] Note that \(\psi_{Y}(0) = \psi_{Y}'(0) =0\), and by Taylor’s theorem, for some \(\theta \in [0,\lambda]\), \[ \psi_Y(\lambda) = \psi_Y(0) + \lambda\psi_Y'(0) + \frac{\lambda^2}{2}\psi_Y''(\theta) \leq \frac{\lambda^2(b-a)^2}{8}~.\]
The upper bound on the variance is sharp in the special case of a Rademacher random variable \(X\) whose distribution is defined by \(P\{X =-1\} = P\{X =1\} = 1/2\). Then one may take \(a=-b=1\) and \(\operatorname{var}(X) =1=\left( b-a\right)^2/4\).
We can now build on Hoeffding’s Lemma to derive very practical tail bounds for sums of bounded independent random variables.
The proof is based on the so-called Cramer-Chernoff bounding technique and on Hoeffding’s Lemma.
Proof. The upper bound on variance follows from \(\operatorname{var}(S) = \sum_{i=1}^n \operatorname{var}(X_i)\) and from the first part of Hoeffding’s Lemma.
For the upper-bound on \(\log \mathbb{E} \mathrm{e}^{\lambda S}\), \[\begin{array}{rl} \log \mathbb{E} \mathrm{e}^{\lambda S} & = \log \mathbb{E} \mathrm{e}^{\sum_{i=1}^n \lambda (X_i - \mathbb{E} X_i)} \\ & = \log \mathbb{E} \Big[\prod_{i=1}^n \mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & = \log \Big(\prod_{i=1}^n \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big]\Big) \\ & = \sum_{i=1}^n \log \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & \leq \sum_{i=1}^n \frac{\lambda^2 (b_i-a_i)^2}{8} \\ & = \frac{\lambda^2 v}{2} \end{array}\] where the third equality comes from independence of the \(X_i\)’s and the inequality follows from invoking Hoeffding’s Lemma for each summand.
The Cramer-Chernoff technique consists of using Markov’s inequality with exponential moments. \[\begin{array}{rl} P \big\{ S \geq t \big\} & \leq \inf_{\lambda \geq 0}\frac{\mathbb{E} \mathrm{e}^{\lambda S}}{\mathrm{e}^{\lambda t}} \\ & \leq \exp\Big(- \sup_{\lambda \geq 0} \big( \lambda t - \log \mathbb{E} \mathrm{e}^{\lambda S}\big) \Big)\\ & \leq \exp\Big(- \sup_{\lambda \geq 0}\big( \lambda t - \frac{\lambda^2 v}{2}\big) \Big) \\ & = \mathrm{e}^{- \frac{t^2}{2v} } \, . \end{array}\]
Hoeffding’s inequality provides interesting tail bounds for binomial random variables which are sums of independent \([0,1]\)-valued random variables. However in some cases, the variance upper bound used in Hoeffding’s inequality is excessively conservative. Think for example of binomial random variable with parameters \(n\) and \(\mu/n\), the variance upper-bound obtained from the boundedness assumption is \(n\) while the true variance is \(\mu\). This motivates the next two exponential inequalities stated in Theorem 14.7) and Theorem 14.8).
Remark 14.2. Bennett’s inequality provides us with improved tail bounds for the binomial random variable with parameters \(n\) and \(\mu/n\). This binomial random variable is distributed like the sum \(n\) independent Bernoulli random variables with parameter \(\mu/n\). This fits in the scope of Bennett’s inequality, we can choose \(b=1\) and \(v=\mu.\) The obtained upper bound on the logarithmic moment generating function coincides with logarithmic moment generating function of a centered Poisson random variable with parameter \(\mu\), see Theorem 12.3).
Proof. The proof combines the Cramer-Chernoff technique with an ad hoc upper bound on \(\log \mathbb{E} \mathrm{e}^{\lambda (X_i - \mathbb{E}X_i)}\).
By homogeneity, we may assume \(b=1\).
Note that \(\phi(\lambda)/\lambda^2\) is non-decreasing over \(\mathbb{R}\). For \(x\leq 1, \lambda \geq 0\), \(\phi(\lambda x)\leq x^2 \phi(\lambda)\).
\[\begin{array}{rl} \log \mathbb{E} \mathrm{e}^{\lambda (X_i - \mathbb{E}X_i)} & = \log \mathbb{E} \mathrm{e}^{\lambda X_i} - \lambda \mathbb{E}X_i \\ & \leq \mathbb{E} \mathrm{e}^{\lambda X_i} - 1 - \lambda \mathbb{E}X_i \\ & = \mathbb{E} \phi(\lambda X_i) \\ & = \mathbb{E}X_i^2 \phi(\lambda) \, . \end{array}\]
Whereas Bennett’s bound works well for Poisson-like random variables, our last bound is geared towards Gamma-like random variables. It is one of the pillars of statistical learning theory.
Proof. The proof combines again the Cramer-Chernoff technique with an ad hoc upper bound on \(\log \mathbb{E} \mathrm{e}^{\lambda (S - \mathbb{E}S)}\).
Let again \(\phi(u)=e^u-u-1\) for \(u\in \mathbb{R}\).
For \(\lambda>0\), \[\begin{array}{rl} \phi(\lambda X_i) & = \sum_{k=2}^\infty \frac{\lambda^k X_i^k}{\lambda^k} \\ & \leq \frac{\lambda^2 X_i^2}{2!} + \sum_{k=3}^\infty \frac{\lambda^k (X_i)_+^k}{\lambda^k} \, . \end{array}\] For \(c> \lambda>0\), \[\begin{array}{rl} \log \mathbb{E} \mathrm{e}^{\lambda S} & = \sum_{i=1}^n \log \mathbb{E} \mathrm{e}^{\lambda (X_i - \mathbb{E}X_i)} \\ & \leq \sum_{i=1}^n \mathbb{E} \phi(\lambda X_i) \\ & \leq \frac{\lambda^2 \sum_{i=1}^n \mathbb{E} X_i^2}{2!} + \sum_{k=3}^\infty \frac{\lambda^k \sum_{i=1}^n \mathbb{E}(X_i)_+^k}{k!} \\ & \leq \frac{\lambda^2 v}{2} + \sum_{k=3}^\infty \frac{\lambda^k v c^{k-2}}{2} \\ & = \frac{\lambda^2 v}{2 (1 - c \lambda)} \, . \end{array}\] The tail bound follows by maximizing \[\sup_{\lambda \in [0,1/c)} \lambda t - \frac{\lambda^2 v}{2 (1 - c \lambda)} = \frac{v}{c^2} \sup_{\eta \in [0,1)} \eta \frac{ct}{v} - \frac{\eta^2}{2(1-\eta)} \, .\]
14.8 Bibliographic remarks
(Dudley, 2002) contains a thorough discussion of the various kinds of convergences that can be defined for random variables. In particular, (Dudley, 2002) offers a general perspective on topological issues in probability spaces. (Dudley, 2002) also tackles the problem raised by random variables that take values in (possibly infinite-dimensional) metric spaces.
Laws of large numbers and \(0-1\)-laws fit in the more general framework of ergodic theorems, see (Dudley, 2002) or (Durrett, 2010). An important example of law of large numbers is the Asymptotic Equipartition Property (AEP) in Information Theory. Note that it holds for a much larger class of sources than the set of memoryless sources (infinite product probability spaces). See (Cover & Thomas, 1991) or [csiszar:korner:1981].
Introduction to exponential inequalities and their applications can be found in (Massart, 2007), (Boucheron, Lugosi, & Massart, 2013).