10  Discrete Conditioning

10.1 Roadmap

Conditioning is central to probabilistic reasoning. In this lesson, we investigate discrete conditioning. In this setting, the definition of conditional probability is not an issue. The definition of conditional expectation can be deceptively simple. Nevertheless the discrete setting lends itself to intuitive definitions and manipulations.

The simplest notion we meet is conditional probability with respect to a specific event with positive probability (Section 10.2). Conditional probability offers an intuitive interpretation of independence.

In Section 10.3 we state, check and discuss Bayes formula.

In Section 10.4, we define conditional expectation with respect to an atomic \(\sigma\)-algebra. This defines conditional expectation with respect to a discrete random variables.

In Section 10.5, we characterize conditional expectation as an optimal predictor. This characterization is very helpful when defining conditional expectation in the general setting.

10.2 Conditioning with respect to an event

Definition 10.1 Let \(P\) be a probability distribution on \((\Omega, \mathcal{F})\). Let \(A \in \mathcal{F}\) be such that \(P\{A\} > 0\). Let \(B\) be another event ( \(B \in \mathcal{F}\) ), the probability of \(B\) given \(A\) is defined as

\[P\{B \mid A\}= \frac{P\{A \cap B\}}{P\{A\}} .\]


If \(X\) is a standard Gaussian random variable on \((\Omega, \mathcal{F})\), and event \(A\) is defined by \(\{ \omega : X(\omega) \geq t\}\) for some \(t\geq 0\), we may condition on event \(A\) and define \(P\{B \mid A\}\) for \(B = \{ \omega : |X(\omega)|\geq 2t\}\).

We get

\[P\{ B \mid A \} = \frac{P\{ X \geq 2 t\}}{P\{ X \geq t\}} \,.\]

We may check the next proposition by considering once again the definition of probability distributions.

Proposition 10.1 Let \(P\) be a probability distribution on \((\Omega, \mathcal{F})\).

Let \(A \in \mathcal{F})\) be such that \(P\{A\} > 0\), then \(P\{\cdot \mid A\}\) (\(P\) given \(A\)) defines a probability distribution over \((\Omega, \mathcal{F}).\)


Proof. \(P(\cdot \mid A)\) maps \(\mathcal{F}\) to \([0,1]\).

We have \(P(\Omega \mid A)= P(A \cap \Omega)/ P(A) = P(A) /P(A)= 1\)

Let \((B_n)_n\) be a monotone increasing sequence of events, then \[ \begin{array}{rl} P (\cup_n B_n \mid A) & = \frac{P((\cup_n B_n) \cap A)}{P(A)} \\ & = \frac{P(\cup_n (B_n \cap A))}{P(A)} \\ & = \frac{\lim_n P(B_n \cap A)}{P(A)} \\ & = \lim_n P(B_n \mid A) \, . \end{array} \]


We may consider the distribution of random variables on \((\Omega, \mathcal{F})\) under \(P\{\cdot \mid A\}\). We compute the expectation of \(X\) under \(P\{ \cdot \mid A\}\):

\[\mathbb{E}_{P\{\cdot \mid A\}} X = \frac{\mathbb{E} [\mathbb{I}_A\, X]}{P\{A\}} \,.\]

This is often denoted by \(\mathbb{E}[X \mid A]\), we will try to avoid this possibly misleading notation.


Example 10.1 Assume \(X\) is standard normally distributed. One may investigate the distribution of \(X^2\) conditionnally on event \(A = \{ \omega : X(\omega)\geq t\}\). For \(t>1\), we have

\[ \begin{array}{rl} \mathbb{E}_{P\{\cdot \mid X \geq t\}} X^2 & = \frac{\int_t^\infty x^2 \phi(x) \mathrm{d}x}{\int_t^\infty \phi(x) \mathrm{d}x}\\ & \leq \frac{t^2}{1-1/t} + 1 \, . \end{array} \]

where the upper bound is obtained by repeated integration by parts.

The distribution of \(X\) given \(A\) is not Gaussian. Under \(A\), \(X\) is very concentrated in the neighborhood of \(t,\) and tends to be more concentrated as \(t\) goes to infinity.


Knowing the probability distribution given event \(A\) enables to investigate independence of events with respect to \(A\) The next trivial proposition is worth reminding.

Proposition 10.2 If \(A\) and \(B \in \mathcal{F}\) satisfy \(P\{A\}> 0\), then \(A\) and \(B\) are independent under \(P\) iff

\[P\{B\mid A\}= P\{B\}.\]


10.3 Bayes formula


Bayes formula is sometimes used in probabilistic causation theory. This is a difficult matter. Causality is a subtle notion and we will refrain from making causal interpretations.

Proposition 10.3 (Bayes formula) Let \(P\) be a probability distribution on \((\Omega, \mathcal{F})\), let \((A_i)_{i \in \mathcal{I} \subseteq \mathbb{N}}\) be a collection of pairwise disjoint events, with non-zero probability such that \(\cup_{i \in \mathcal{I}} A_i = \Omega\) (\((A_i)_i\) form a complete system of events), let \(B\) be an event with non-zero probability, then for all \(i \in \mathcal{I}\),

\[P\{A_i \mid B\} = \frac{P\{A_i \} \times P\{B \mid A_i \}}{\sum_{j \in \mathcal{I}} P\{A_j\}\times P\{B \mid A_j\}}\]


Proof. By definition, \(P\{A_i \mid B\}= P\{A_i \cap B\}/ P\{B\}= P\{A_i \} \times P\{B \mid A_i \}/ P\{B\}\).

Morever \[ \begin{array}{rl} P \{ B\} & = P\{B \cap (\cup_{j \in \mathcal{I}} A_j)\}\\ & = P\{\cup_{j \in \mathcal{I}} (B \cap A_j)\} \\ & = \sum_{j \in \mathcal{I}} P\{B \cap A_j \} \\ & = \sum_{j \in \mathcal{I}} P\{A_j \} \times P\{B \mid A_j \}. \end{array} \]


In the preceding proposition, \(P\{A_i \}\) is called the prior probability of \(A_i\) and \(P\{A_i \mid B\}\) the posterior probability.

10.4 Conditional expectation with respect to a discrete \(\sigma\)-algebra

While the general notion of conditional expectation requires some abstraction, we can introduce conditioning with respect to a discrete \(\sigma\)-algebra starting from the elementary notion of conditional probability with respect to an event with positive probability.


Definition 10.2 Let \(\Omega\) be a universe, \(\mathcal{F}\) a \(\sigma\)-algebra of events on \(\Omega\), \(P\) a probability distribution on \((\Omega, \mathcal{F})\), let \((A_i)_{i \in \mathcal{I} \subseteq \mathbb{N}}\) be pairwise disjoint events, with non-zero probability such that \(\cup_i A_i = \Omega .\) Let \(\mathcal{G}\) be the atomic \(\sigma\)-algebra generated by \((A_i)_{i \in \mathcal{I}}\).

Let \(X\) be a random variable from \((\Omega, \mathcal{F})\) to \((\mathcal{X}, \mathcal{H})\), the conditional expectation of \(X\) with respect to \(\mathcal{G}\) is the random variable defined as

\[ \mathbb{E} [X \mid \mathcal{G}] = \sum_{i \in \mathcal{I}} \mathbb{E}_{P_{\{\cdot |A_i \}}} [X] \times \mathbf{1}_{A_i} . \]


While \(\mathbb{E}_{P_{\{\cdot |A_i \}}} [X]\) is a real number, \(\mathbb{E} [X \mid \mathcal{G}]\) is a \(\mathcal{G}\)-measurable function from \(\Omega\) to \(\mathcal{X}\):

\[ \mathbb{E} [X \mid \mathcal{G}](\omega) = \sum_{i \in \mathcal{I}} \mathbb{E}_{P_{\{\cdot |A_i \}}} [X] \times \mathbf{1}_{A_i}(\omega) \qquad \forall \omega \in \Omega\, . \]

These two kinds of objects should not be confused. We will refrain from using notation \(\mathbb{E}[X \mid A_i]\) since it may be confusing: \(\mathbb{E}[X \mid A_i]\) might denote either \(\mathbb{E}_{P_{\{\cdot |A_i \}}} [X]\) or \(\mathbb{E} [X \mid \sigma(A_i)]\) where \(\sigma(A_i)\) is the sigma-algebra generated by \(A_i\): \(\{ A_i, A_i^c, \Omega, \emptyset\}\).


Proposition 10.4 Let \(P\) be a probability distribution on \((\Omega, \mathcal{F})\). Let \((A_i)_{i \in \mathcal{I} \subseteq \mathbb{N}}\) be a collection of pairwise disjoint events, with non-zero probability satisfying \(\cup_{i \in \mathcal{I}} A_i = \Omega.\) Let \(\mathcal{G}=\sigma\Big((A_i)_{i \in \mathcal{I}}\Big)\) denote the sigma-algebra generated by \((A_i)_{i \in \mathcal{I}}\).The random variable \(X\) is assumed to be \(P\)- integrabe.

  1. The conditional expectation \(\mathbb{E} [X \mid \mathcal{G}]\) is a \(\mathcal{G}\)-measurable random variable, that satisfies

\[ \mathbb{E} \left[ YX \right] = \mathbb{E} \left[ Y \mathbb{E} [X \mid \mathcal{G}] \right] \qquad \forall Y \in \sigma(\mathcal{G}), Y \text{ bounded.} \]

  1. If two \(\mathcal{G}\)-measurable random variables \(Z, T\) satisfy \(\mathbb{E} \left[ YX \right] = \mathbb{E} \left[ Y Z \right] = \mathbb{E}[YT]\), for all \(Y \in \sigma(\mathcal{G}), Y \text{ bounded}\), then \(Z=T\) almost surely.

Proof. We need to ckeck points 1.) and 2.):

  1. \(\mathbb{E} \left[ X \mid \mathcal{G} \right]\) satisfies first property in Proposition Proposition 10.4.
  2. If \(Z\) satisfies Proposition 10.4, then \(Z = \mathbb{E}\left[ X \mid \mathcal{G} \right]\) \(P\)-almost-surely.

Checking i.)

If \(Y\) is \(\mathcal{G}\)-measurable, then \(Y = \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i}\) for some real-valued sequence \((\lambda_i)_{i \in \mathcal{I}}\) .

Then \[ \begin{array}{rl} \mathbb{E} [Y \mathbb{E} \left[ X \mid \mathcal{G} \right]] & = \mathbb{E} \left[ \left( \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i} \right) \left( \sum_{j \in \mathcal{I}} \mathbf{1}_{A_j} \frac{\mathbb{E} [ \mathbf{1}_{A_j} X]}{P\{A_j \}} \right) \right]\\ & = \left. \mathbb{E} \left[ \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i} \frac{\mathbb{E} [ \mathbf{1}_{A_i} X]}{P\{A_i \}} \right) \right]\\ & = \sum_{i \in \mathcal{I}} \lambda_i \mathbb{E} [ \mathbf{1}_{A_i} X] \frac{\mathbb{E} \left[ \mathbf{1}_{A_i} \right]}{P\{A_i \}} \quad \text{linearity of expectation}\\ & = \sum_{i \in \mathcal{I}} \lambda_i \mathbb{E} [ \mathbf{1}_{A_i} X]\\ & = \mathbb{E} \left[ \left( \sum_{i \in \mathcal{I}} \lambda_i \mathbf{1}_{A_i} \right) X \right]\\ & = \mathbb{E} \left[ YX \right] . \end{array} \]

Checking ii.)

Assume \(Z\) satisfies Proposition 10.4.

Let us define \(Y\) using \(Y = \mathbf{1}_{A_i}\), for some index \(i \in \mathcal{I}\).

As \(Z\) is \(\mathcal{G}\)-measurable, there exists a real-valued sequence \(\left( \mu_j \right)_{j \in \mathcal{I}}\), such that \(Z = \sum_{j \in \mathcal{I}} \mu_j \mathbf{1}_{A_j}.\)

Thus, relying on the fact that events \(A_j\) are pairwise disjoint:

\[ \begin{array}{rl} \mathbb{E} \left[ ZY \right] & = \mathbb{E} \left[ \sum_{j \in \mathcal{I}} \mu_j \mathbf{1}_{A_j} \mathbf{1}_{A_i} \right] = \mu_i P\{A_i \} \end{array} \]

By the defining property of \(Z\), we have \[\mathbb{E} [ZY] = \mathbb{E} [XY] = \mathbb{E} [X \mathbf{1}_{A_i}]\,.\]

Finally, for all \(i \in \mathcal{I}\), \(\mu_i= \mathbb{E}[X\mathbf{1}_{A_i}]/ P\{A_i \}.\)

We can now conclude \(Z = \mathbb{E}[X \mid \mathcal{G}]\).


10.5 Conditional expectation as prediction

The next proposition reveals the role of conditional expectation in prediction/approximation problems.

Proposition 10.5 Let \(Y\) be a square-integrable random variable on \((\Omega, \mathcal{F}, P)\) and \(\mathcal{G}\) a discrete sub-\(\sigma\)-algebra of \(\mathcal{F}\). The conditional expectation of \(Y\) with respect to \(\mathcal{G}\) minimizes

\[\mathbb{E}\left[\big(Y - Z\big)^2\right]\]

amongst \(\mathcal{G}\)-measurable square-integrable random variables.

Recall that a \(\mathcal{G}\)-measurable random variable is a function that remains constant on each \(A_i, i \in \mathcal{I}\).

Proof. If \(Y\) is a random variable on \((\Omega,\mathcal{F}),\) and if we are trying to predict at best \(Y\) from a \(\mathcal{G}\)-measurable random variable , we are looking for a sequence of coefficients \((b_i)_{i\in \mathcal{I}}\) that minimizes:

\[ \begin{array}{rl} \mathbb{E}_P \left[ \Big( Y - \sum_{i\in \mathcal{I}} b_i \mathbf{I}_{A_i} \Big)^2 \right] & = \mathbb{E}_P \left[ \Big( \sum_{i\in \mathcal{I}} (Y-b_i) \mathbf{I}_{A_i} \Big)^2 \right] \\ & = \sum_{i\in \mathcal{I}} \mathbb{E}_P \left[ \left( Y-b_i\right)^2 \mathbf{I}_{A_i} \right] \\ & = \sum_{i\in \mathcal{I}} P\{A_i\}\, \mathbb{E}_{P\{\cdot \mid A_i\}} \left[ \left( Y-b_i\right)^2 \right] \end{array} \]

Thus for each \(i\), \(b_i\) must coincide with the expectation of \(Y\) under \(P\{\cdot \mid A_i\}.\) The best prediction of \(Y\), in the sense of the quadratic error, among the \(\mathcal{G}\)-measurable functions is the conditional expectation of \(Y\) with respect to \(\mathcal{G}\).

The properties identified by propositions Proposition 10.4 and @ref(prp:espercondpred) serve as a definition for the conditional expectation with respect to a general \(\sigma\)-algebra.

10.6 Properties of conditional expectation

We state without proof a number of useful properties of conditional expectation with respect to discrete \(\sigma\)-algebras. We shall prove them in full generality later.

Proposition 10.6 If \(X \leq Y\), \(P\)-a.s., then

\[\mathbb{E}[X \mid \mathcal{G}] \leq \mathbb{E}[Y \mid \mathcal{G}] \qquad P \text{-p.s.}\]

Proposition 10.7 \[ \mathbb{E}[ aX + bY \mid \mathcal{G}] = a \mathbb{E}[X \mid \mathcal{G}] + b \mathbb{E}[Y \mid \mathcal{G}] \,. \]

Proposition 10.8 If \(\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}\)

\[ \mathbb{E}\left[\mathbb{E}\left[ X \mid \mathcal{G}\right] \mid \mathcal{H} \right] = \mathbb{E}\left[ X \mid \mathcal{H} \right] \, . \]

\[\mathbb{E}\left[\mathbb{E}\left[ X \mid \mathcal{H}\right] \mid \mathcal{G} \right] = \mathbb{E}\left[ X \mid \mathcal{H} \right] \, . \]

Prove the proposition.

\[\mathbb{E} X = \mathbb{E}[\mathbb{E}[X \mid \mathcal{G}]] \, .\]

10.7 Application: Galton-Watson processes I

The size of generation \(k\geq 0\) is defined recursively by

\[Z_0 = 1, \qquad Z_{k+1} = \sum_{i=1}^{Z_k} X^k_{i} \, .\]

The \(\sigma\)-algebra \(\sigma(Z_k)\) is discrete/atomic, it is generated by the pairwise disjoint events \(\Big\{ Z_k = a\Big\}\) for \(a \in \mathbb{N}\).

Proposition 10.9 In a Galton-Watson (homogeneous) branching process with reproduction number \(\mu\), the conditional expectation of the size of the size of the \(k+1^{\text{th}}\) generation with respect to the size of the \(k^{ \text{th}}\) generation is a linear function of the size of the \(k^{ \text{th}}\) generation:

\[\mathbb{E}\Big[Z_{k+1} \mid \sigma(Z_k)\Big] = \mathbb{E}X^0_1 \times Z_k = \mu \times Z_k\]


Proof. On the event \(\Big\{ Z_k = a\Big\}\), we can determine the conditional distribution of \(Z_{k+1}\).

\[ \begin{array}{rl} \{ Z_{k+1} = b \wedge Z_k = a\Big\} & = \Big\{ \sum_{i=1}^a X^k_i = b \wedge Z_k = a\Big\} \\ & = \Big\{ \sum_{i=1}^a X^k_i = b\Big\} \cap \Big\{ Z_k = a \Big\} \end{array} \]

we have \[ \begin{array}{rl} P \Big\{ Z_{k+1} = b \mid Z_k = a\Big\} = P \Big\{ \sum_{i=1}^a X^k_i = b \mid Z_k = a\Big\} = P \Big\{ \sum_{i=1}^a X^k_i = b \Big\} \end{array} \]

On the event \(\{Z_k = a\}\), \(Z_{k+1}\) is distributed like the sum of \(a\) independent copies of \(X^0_1\):

\[ \begin{array}{rl} \mathbb{E} \Big[ Z_{k+1}\mid \sigma(Z_k)\Big] & = \sum_{a=0}^\infty \mathbb{E}_{P(\mid Z_k=a)}\Big[Z_{k+1}\Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty \mathbb{E}_{P(\mid Z_k=a)}\Big[ \sum_{i=1}^a X^k_i \Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty \mathbb{E}\Big[ \sum_{i=1}^a X^k_i \Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty \sum_{i=1}^a \mathbb{E}\Big[ X^k_i \Big] \times \mathbb{I}_{Z_k=a} \\ & = \sum_{a=0}^\infty a \mathbb{E} X^0_1 \times \mathbb{I}_{Z_k=a} \\ & = \mathbb{E} X^0_1 \times Z_k \, . \end{array} \]


An immediate corollary is:

\[\mathbb{E}Z_k = (\mathbb{E} X^0_1)^k \qquad\text{forall } k\geq 0\,.\]

The sequence of expected sizes of generations forms a geometric sequence.

A Galton-Watson process is said to be sub-critical if the expectation of the offspring distribution is smaller than \(1\).


Proposition 10.10 (Extinction under sub-critical offspring distribution) The extinction probability of a sub-critical branching process is equal to \(1\).


Proof. Denote by \(E_k\) the event \(\{ Z_k = 0\}\). Observe that the sequence \((E_k)_k\) is increasing. Denote by \(E_{\infty} = \cup_{k=0}^\infty E_k\).

\[P \{ E_k^c \} = P\{ Z_k \geq 1 \} \leq \mathbb{E} Z_k \,.\]

Hence \(P\{E_k^c\} \downarrow 0\) and \(P\{ E_k\} \uparrow 1\). By monotone convergence \(P(E_\infty) = 1\).


The expected size of the total progeny of subcritical branching process is equal to \[ \sum_{k=0}^\infty \mathbb{E} Z_k = \sum_{k=0}^\infty (\mathbb{E} X^0_1)^k = \frac{1}{1 - \mathbb{E} X^0_1} \, . \]

Working with discrete conditioning allows us to derive non-trivial statements about the Galton-Watson process without knowing much about the offspring distribution beyond the fact that its expectation is smaller than \(1\). We still ignore the the details of the distribution of \(Z_k\), let alone of the distribution of \(\sum_{k=0}^\infty Z_k\).