10 Discrete Conditioning
10.1 Roadmap
Conditioning is central to probabilistic reasoning. In this lesson, we investigate discrete conditioning. In this setting, the definition of conditional probability is not an issue. The definition of conditional expectation can be deceptively simple. Nevertheless the discrete setting lends itself to intuitive definitions and manipulations.
The simplest notion we meet is conditional probability with respect to a specific event with positive probability (Section 10.2). Conditional probability offers an intuitive interpretation of independence.
In Section 10.3 we state, check and discuss Bayes formula.
In Section 10.4, we define conditional expectation with respect to an atomic \(\sigma\)-algebra. This defines conditional expectation with respect to a discrete random variables.
In Section 10.5, we characterize conditional expectation as an optimal predictor. This characterization is very helpful when defining conditional expectation in the general setting.
10.2 Conditioning with respect to an event
If \(X\) is a standard Gaussian random variable on \((\Omega, \mathcal{F})\), and event \(A\) is defined by \(\{ \omega : X(\omega) \geq t\}\) for some \(t\geq 0\), we may condition on event \(A\) and define \(P\{B \mid A\}\) for \(B = \{ \omega : |X(\omega)|\geq 2t\}\).
We get
\[P\{ B \mid A \} = \frac{P\{ X \geq 2 t\}}{P\{ X \geq t\}} \,.\]
We may check the next proposition by considering once again the definition of probability distributions.
Proof. \(P(\cdot \mid A)\) maps \(\mathcal{F}\) to \([0,1]\).
We have \(P(\Omega \mid A)= P(A \cap \Omega)/ P(A) = P(A) /P(A)= 1\)
Let \((B_n)_n\) be a monotone increasing sequence of events, then \[ \begin{array}{rl} P (\cup_n B_n \mid A) & = \frac{P((\cup_n B_n) \cap A)}{P(A)} \\ & = \frac{P(\cup_n (B_n \cap A))}{P(A)} \\ & = \frac{\lim_n P(B_n \cap A)}{P(A)} \\ & = \lim_n P(B_n \mid A) \, . \end{array} \]
We may consider the distribution of random variables on \((\Omega, \mathcal{F})\) under \(P\{\cdot \mid A\}\). We compute the expectation of \(X\) under \(P\{ \cdot \mid A\}\):
\[\mathbb{E}_{P\{\cdot \mid A\}} X = \frac{\mathbb{E} [\mathbb{I}_A\, X]}{P\{A\}} \,.\]
This is often denoted by \(\mathbb{E}[X \mid A]\), we will try to avoid this possibly misleading notation.
Example 10.1 Assume \(X\) is standard normally distributed. One may investigate the distribution of \(X^2\) conditionnally on event \(A = \{ \omega : X(\omega)\geq t\}\). For \(t>1\), we have
\[ \begin{array}{rl} \mathbb{E}_{P\{\cdot \mid X \geq t\}} X^2 & = \frac{\int_t^\infty x^2 \phi(x) \mathrm{d}x}{\int_t^\infty \phi(x) \mathrm{d}x}\\ & \leq \frac{t^2}{1-1/t} + 1 \, . \end{array} \]
where the upper bound is obtained by repeated integration by parts.
The distribution of \(X\) given \(A\) is not Gaussian. Under \(A\), \(X\) is very concentrated in the neighborhood of \(t,\) and tends to be more concentrated as \(t\) goes to infinity.
Knowing the probability distribution given event \(A\) enables to investigate independence of events with respect to \(A\) The next trivial proposition is worth reminding.
10.3 Bayes formula
Bayes formula is sometimes used in probabilistic causation theory. This is a difficult matter. Causality is a subtle notion and we will refrain from making causal interpretations.
Proof. By definition, \(P\{A_i \mid B\}= P\{A_i \cap B\}/ P\{B\}= P\{A_i \} \times P\{B \mid A_i \}/ P\{B\}\).
Morever \[ \begin{array}{rl} P \{ B\} & = P\{B \cap (\cup_{j \in \mathcal{I}} A_j)\}\\ & = P\{\cup_{j \in \mathcal{I}} (B \cap A_j)\} \\ & = \sum_{j \in \mathcal{I}} P\{B \cap A_j \} \\ & = \sum_{j \in \mathcal{I}} P\{A_j \} \times P\{B \mid A_j \}. \end{array} \]
In the preceding proposition, \(P\{A_i \}\) is called the prior probability of \(A_i\) and \(P\{A_i \mid B\}\) the posterior probability.
10.4 Conditional expectation with respect to a discrete \(\sigma\)-algebra
While the general notion of conditional expectation requires some abstraction, we can introduce conditioning with respect to a discrete \(\sigma\)-algebra starting from the elementary notion of conditional probability with respect to an event with positive probability.
While \(\mathbb{E}_{P_{\{\cdot |A_i \}}} [X]\) is a real number, \(\mathbb{E} [X \mid \mathcal{G}]\) is a \(\mathcal{G}\)-measurable function from \(\Omega\) to \(\mathcal{X}\):
\[ \mathbb{E} [X \mid \mathcal{G}](\omega) = \sum_{i \in \mathcal{I}} \mathbb{E}_{P_{\{\cdot |A_i \}}} [X] \times \mathbf{1}_{A_i}(\omega) \qquad \forall \omega \in \Omega\, . \]
These two kinds of objects should not be confused. We will refrain from using notation \(\mathbb{E}[X \mid A_i]\) since it may be confusing: \(\mathbb{E}[X \mid A_i]\) might denote either \(\mathbb{E}_{P_{\{\cdot |A_i \}}} [X]\) or \(\mathbb{E} [X \mid \sigma(A_i)]\) where \(\sigma(A_i)\) is the sigma-algebra generated by \(A_i\): \(\{ A_i, A_i^c, \Omega, \emptyset\}\).
Proof. We need to ckeck points 1.) and 2.):
- \(\mathbb{E} \left[ X \mid \mathcal{G} \right]\) satisfies first property in Proposition Proposition 10.4.
- If \(Z\) satisfies Proposition 10.4, then \(Z = \mathbb{E}\left[ X \mid \mathcal{G} \right]\) \(P\)-almost-surely.
Checking i.)
If \(Y\) is \(\mathcal{G}\)-measurable, then \(Y = \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i}\) for some real-valued sequence \((\lambda_i)_{i \in \mathcal{I}}\) .
Then \[ \begin{array}{rl} \mathbb{E} [Y \mathbb{E} \left[ X \mid \mathcal{G} \right]] & = \mathbb{E} \left[ \left( \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i} \right) \left( \sum_{j \in \mathcal{I}} \mathbf{1}_{A_j} \frac{\mathbb{E} [ \mathbf{1}_{A_j} X]}{P\{A_j \}} \right) \right]\\ & = \left. \mathbb{E} \left[ \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i} \frac{\mathbb{E} [ \mathbf{1}_{A_i} X]}{P\{A_i \}} \right) \right]\\ & = \sum_{i \in \mathcal{I}} \lambda_i \mathbb{E} [ \mathbf{1}_{A_i} X] \frac{\mathbb{E} \left[ \mathbf{1}_{A_i} \right]}{P\{A_i \}} \quad \text{linearity of expectation}\\ & = \sum_{i \in \mathcal{I}} \lambda_i \mathbb{E} [ \mathbf{1}_{A_i} X]\\ & = \mathbb{E} \left[ \left( \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i} \right) X \right]\\ & = \mathbb{E} \left[ YX \right] . \end{array} \]
Checking ii.)
Assume \(Z\) satisfies Proposition 10.4.
Let us define \(Y\) using \(Y = \mathbf{1}_{A_i}\), for some index \(i \in \mathcal{I}\).
As \(Z\) is \(\mathcal{G}\)-measurable, there exists a real-valued sequence \(\left( \mu_j \right)_{j \in \mathcal{I}}\), such that \(Z = \sum_{j \in \mathcal{I}} \mu_j \mathbf{1}_{A_j}.\)
Thus, relying on the fact that events \(A_j\) are pairwise disjoint:
\[ \begin{array}{rl} \mathbb{E} \left[ ZY \right] & = \mathbb{E} \left[ \sum_{j \in \mathcal{I}} \mu_j \mathbf{1}_{A_j} \mathbf{1}_{A_i} \right] = \mu_i P\{A_i \} \end{array} \]
By the defining property of \(Z\), we have \[\mathbb{E} [ZY] = \mathbb{E} [XY] = \mathbb{E} [X \mathbf{1}_{A_i}]\,.\]
Finally, for all \(i \in \mathcal{I}\), \(\mu_i= \mathbb{E}[X\mathbf{1}_{A_i}]/ P\{A_i \}.\)
We can now conclude \(Z = \mathbb{E}[X \mid \mathcal{G}]\).
10.5 Conditional expectation as prediction
The next proposition reveals the role of conditional expectation in prediction/approximation problems.
Recall that a \(\mathcal{G}\)-measurable random variable is a function that remains constant on each \(A_i, i \in \mathcal{I}\).
Proof. If \(Y\) is a random variable on \((\Omega,\mathcal{F}),\) and if we are trying to predict at best \(Y\) from a \(\mathcal{G}\)-measurable random variable , we are looking for a sequence of coefficients \((b_i)_{i\in \mathcal{I}}\) that minimizes:
\[ \begin{array}{rl} \mathbb{E}_P \left[ \Big( Y - \sum_{i\in \mathcal{I}} b_i \mathbf{I}_{A_i} \Big)^2 \right] & = \mathbb{E}_P \left[ \Big( \sum_{i\in \mathcal{I}} (Y-b_i) \mathbf{I}_{A_i} \Big)^2 \right] \\ & = \sum_{i\in \mathcal{I}} \mathbb{E}_P \left[ \left( Y-b_i\right)^2 \mathbf{I}_{A_i} \right] \\ & = \sum_{i\in \mathcal{I}} P\{A_i\}\, \mathbb{E}_{P\{\cdot \mid A_i\}} \left[ \left( Y-b_i\right)^2 \right] \end{array} \]
Thus for each \(i\), \(b_i\) must coincide with the expectation of \(Y\) under \(P\{\cdot \mid A_i\}.\) The best prediction of \(Y\), in the sense of the quadratic error, among the \(\mathcal{G}\)-measurable functions is the conditional expectation of \(Y\) with respect to \(\mathcal{G}\).
The properties identified by propositions Proposition 10.4 and @ref(prp:espercondpred) serve as a definition for the conditional expectation with respect to a general \(\sigma\)-algebra.
10.6 Properties of conditional expectation
We state without proof a number of useful properties of conditional expectation with respect to discrete \(\sigma\)-algebras. We shall prove them in full generality later.
Prove the proposition.
\[\mathbb{E} X = \mathbb{E}[\mathbb{E}[X \mid \mathcal{G}]] \, .\]
10.7 Application: Galton-Watson processes I
The size of generation \(k\geq 0\) is defined recursively by
\[Z_0 = 1, \qquad Z_{k+1} = \sum_{i=1}^{Z_k} X^k_{i} \, .\]
The \(\sigma\)-algebra \(\sigma(Z_k)\) is discrete/atomic, it is generated by the pairwise disjoint events \(\Big\{ Z_k = a\Big\}\) for \(a \in \mathbb{N}\).
Proof. On the event \(\Big\{ Z_k = a\Big\}\), we can determine the conditional distribution of \(Z_{k+1}\).
\[ \begin{array}{rl} \{ Z_{k+1} = b \wedge Z_k = a\Big\} & = \Big\{ \sum_{i=1}^a X^k_i = b \wedge Z_k = a\Big\} \\ & = \Big\{ \sum_{i=1}^a X^k_i = b\Big\} \cap \Big\{ Z_k = a \Big\} \end{array} \]
we have \[ \begin{array}{rl} P \Big\{ Z_{k+1} = b \mid Z_k = a\Big\} = P \Big\{ \sum_{i=1}^a X^k_i = b \mid Z_k = a\Big\} = P \Big\{ \sum_{i=1}^a X^k_i = b \Big\} \end{array} \]
On the event \(\{Z_k = a\}\), \(Z_{k+1}\) is distributed like the sum of \(a\) independent copies of \(X^0_1\):
\[ \begin{array}{rl} \mathbb{E} \Big[ Z_{k+1}\mid \sigma(Z_k)\Big] & = \sum_{a=0}^\infty \mathbb{E}_{P(\mid Z_k=a)}\Big[Z_{k+1}\Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty \mathbb{E}_{P(\mid Z_k=a)}\Big[ \sum_{i=1}^a X^k_i \Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty \mathbb{E}\Big[ \sum_{i=1}^a X^k_i \Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty \sum_{i=1}^a \mathbb{E}\Big[ X^k_i \Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty a \mathbb{E} X^0_1 \times \mathbb{I}_{Z_k=a} \\ & = \mathbb{E} X^0_1 \times Z_k \, . \end{array} \]
An immediate corollary is:
\[\mathbb{E}Z_k = (\mathbb{E} X^0_1)^k \qquad\text{forall } k\geq 0\,.\]
The sequence of expected sizes of generations forms a geometric sequence.
A Galton-Watson process is said to be sub-critical if the expectation of the offspring distribution is smaller than \(1\).
Proof. Denote by \(E_k\) the event \(\{ Z_k = 0\}\). Observe that the sequence \((E_k)_k\) is increasing. Denote by \(E_{\infty} = \cup_{k=0}^\infty E_k\).
\[P \{ E_k^c \} = P\{ Z_k \geq 1 \} \leq \mathbb{E} Z_k \,.\]
Hence \(P\{E_k^c\} \downarrow 0\) and \(P\{ E_k\} \uparrow 1\). By monotone convergence \(P(E_\infty) = 1\).
The expected size of the total progeny of subcritical branching process is equal to \[ \sum_{k=0}^\infty \mathbb{E} Z_k = \sum_{k=0}^\infty (\mathbb{E} X^0_1)^k = \frac{1}{1 - \mathbb{E} X^0_1} \, . \]
Working with discrete conditioning allows us to derive non-trivial statements about the Galton-Watson process without knowing much about the offspring distribution beyond the fact that its expectation is smaller than \(1\). We still ignore the the details of the distribution of \(Z_k\), let alone of the distribution of \(\sum_{k=0}^\infty Z_k\).