6  Characterizations of discrete probability distributions

6.1 Motivation

In full generality, a probability distribution is a complex object. It is a \([0,1]\)-valued function defined over a \(\sigma\)-algebra of subsets. A concrete \(\sigma\)-algebra, let alone the abstract notion of \(\sigma\)-algebra, is not easily grasped. Looking for simpler characterizations of probability distributions is a sensible goal. When facing questions like: ``are two probability distributions equal?“, we know it suffices to check that the two distributions coincide on generating families of events. This makes cumulative distribution functions precious tools. Cumulative distribution functions and their generalized inverse functions (quantile functions) are very convenient when handling maxima, minima, or more generally order statistics of collections of independent random variables, but when it comes to handling sums of independent random variables or branching processes, cumulative distribution functions are of moderate help.

In this lesson, we review a way of characterizing probability distributions over \((\mathbb{N}, \mathcal{F}=2^{\mathbb{N}})\) through functions defined on the real line: probability generating functions (Section 6.2)). Later (Chapter Chapter 12), we will survey more general tools: Laplace transforms (Section 12.2)) and characteristic functions which extend Fourier transforms to probability distributions (Section 12.3)).

All three methods are distinct in scope but they rely on the same idea and share common features. Indeed, probability generating functions can be seen as special case of Laplace transforms. The latter can be seen as special cases of Fourier transforms.

All three methods do characterize probability distributions. They are equipped with inversion formulae. The three methods provide us with a seamless treatment of sums of independent random variables.

All three methods relate the integrability of probability distributions and the smoothness of transforms.

Probability generating functions, Laplace transforms and characteristic functions deliver an important analytical machinery to Probability Theory. From Analysis, we get off-the-shelf arguments to establish smoothness properties of transforms, and with little more work, we can construct the inversion formulae.

6.2 Probability generating function

In this section, \(X\) is an integer-valued random variable, with distribution \(P\), cumulative distribution function \(F\) and probability mass function \(p\). Recall that \(P\) is completely characterized by the much simpler objects \(F\) and \(p\). Now, let \(Y\) be another integer-valued random variable living on the same probability space as \(X\), independent from \(X\), with distribution \(Q\), distribution function \(G\) and probability mass function \(q\). What can we tell about the distribution of \(X+Y\)? Is it easy to figure out its cumulative distribution function, its probability mass function?

The probability mass function of (the distribution of) \(X+Y\) is the convolution of \(p\) and \(q\) \[\begin{align*} \mathbb{P}\{ X + Y = n\} & = \sum_{k=0}^n \mathbb{P}\{ X + Y = n \wedge X = k\} \\ & = \sum_{k=0}^n \mathbb{P}\{ Y = n - k \wedge X = k\} \\ & = \sum_{k=0}^n p(k) q(n-k) \\ & = p \star q (n) \, . \end{align*}\]

Another function characterizes probability distributions and delivers instantaneous information about the distribution of sums of independent integer-valued random variables and many other things.

Definition 6.1 (Probability Generating Function) The probability generating function (PGF) of a probability distribution over \(\mathbb{N}\), defined by its probability mass function (PMF) \(f\) is the function \(G: [0,1] \to \mathbb{R}\) defined by: \[ G(s) = \sum_{n=0}^\infty f(n) s^n \, . \]

Example 6.1 The probability generating function of basic discrete distributions is easily computed. The results are useful and suggestive.

  • Bernoulli distribution with parameter \(p\): \[ G(s) = (1 - p) s^0 + p s^1 = 1 + p (s-1) \]
  • Binomial distribution with parameters \(n\) and \(p\): \[ G(s) = \sum_{k=0}^n \binom{n}{k} p^k (1-p)^{n-k} s^k = \left(ps + 1- p\right)^n = \left( 1 + p(s-1)\right)^n \]
  • Poisson distribution with parameter \(\mu\): \[ G(s) = \sum_{n=0}^\infty \mathrm{e}^{-\mu} \frac{\mu^n}{n!} s^n = \mathrm{e}^{\mu (s-1)} \,. \] Note that if \(p=\mu/n\) and \(n ↑ ∞\): \[ \lim_{n ↑ ∞} \left( 1 + \frac{μ}{n}(s-1)\right)^n = \mathrm{e}^{\mu(s-1)} \, . \] We will come back to this observation later.

The next observation follows almost immediately from the definition of probability generating functions.

Proposition 6.1 A probability generating function \(G\) satisfies the following conditions:

  • \(G\) is non-negative over \([0,1]\);
  • \(G(0) = P\{0\}, \quad G(1)=1\);
  • \(G\) is non-decreasing over \([0,1]\);
  • \(G\) is continuous and convex.

Proof. Properties 1), 2) and 3) are obvious: \(G\) is a convex combination of non-negative, non-decreasing, continuous and convex functions.

6.3 Inversion formula

Generatingfunctionology lies at the crossing between combinatorics, real analysis, complex analysis, and probability theory. Defining PGF as a power series brings within probability theory a collection of theorems that facilitate the identification of probability distributions or that connect integrability properties of the probability distribution with smoothness properties of the PGF.

Keep in mind that a generating function defines a function from the set of complex numbers \(\mathbb{C}\) to \(\mathbb{C}\): \[ G(z) = \sum_{n=0}^\infty p(n) z^n \qquad\text{for all } z \in \mathbb{C} \text{ such that the series converges}\, . \]

Characterizing the domain of a function defined in that way is crucial. The next proposition is at the core of Power Series theory.

Proposition 6.2 The radius of convergence of the generating function \(G\)

\[ G(z) = \sum_{n\in \mathbb{N}} p(n) z^n, \qquad z \in \mathbb{C} \] is the unique \(R \in [0, \infty) \cup \{+ \infty\}\) such that:

  • for every \(z \in \mathbb{C}\) with \(|z| > R\), the series \(\sum_{n\in \mathbb{N}} p(n) z^n\) diverges.
  • for every \(z \in \mathbb{C}\) with \(|z| < R\), the series \(\sum_{n\in \mathbb{N}} p(n) z^n\) is absolutely convergent.

The opend disk \(\{ z : z \in \mathbb{C}, |z| < R \}\) is called the disk of convergence of \(G\). The circle \(\{ z : z \in \mathbb{C}, |z| = R \}\) is called the circle of convergence of \(G\).

Proof.

Hadamard’s rule allows for fast determination of the radius of convergence:

Proposition 6.3 (Hadamard’s rule) The radius of convergence \(R\) of probability generating function \(G(z) = \sum_{n \in \mathbb{N}} p(n)z^n\) satisfies \[ \frac{1}{R} = \limsup_n (p(n))^{1/n} \, . \] The radius of convergence of a probability generating function is always at least \(1\).

The radius of convergence contains qualitative information about tail behavior:

  • For Poisson distributions, the radius of convergence is infinite. This reflects the fast decay of the tail probability of Poisson distributions.
  • For geometric distributions, \(p(n) = q (1-q)^{n-1}\), the radius of convergence is \(1/(1-q)\).
  • For power law distributions like \(p(n) = n^{-r}/\zeta(r)\) with \(r>1\), the radius of convergence is exactly \(1\).

Proof.

Just knowing the radius of convergence of a function defined by a Power Series expansion tells us about the smoothness properties of the function.

If \(G\) is defined as a power series \(G(z) = \sum_{n \in \mathbb{N}} a_n z^n\) its (complex) derivative is \(G'(z)= \sum_{n \in \mathbb{N}} (n+1) a_{n+1} z^n\). The derivative \(G'\) and \(G\) have the same radius of convergence.

Proof.

This general statement about power series entails a very useful corollary for probability generating functions.

Corollary 6.1 (Inversion formula) Let \(g\) be the probability generating function associated with the probability mass function \(f\). Then \(g\) is infinitely many times differentiable over \([0,1)\) with derivatives \[ g^{(n)}(s) = \sum_{k=n}^\infty \frac{k!}{(k-n)!} \times f(k) s^{k-n} \, , \] more specifically: \[ g^{(n)}(0) = n! \times f(n) \, . \] A probability distribution over \(\mathbb{N}\) is characterized by its probability generating function.


Proof. The property is true for \(n=0\).

Assume it holds for all integers up to \(n\). For \(s\in [0, 1)\) and \(|h|<1-s-\delta\) where \(\delta\) is a small positive number, \[\begin{align*} \frac{g^{(n)}(s+h) - g^{(n)}(s)}{h} & = \sum_{k=n}^\infty \frac{k!}{(k-n)!} \times f(k) \Big(\sum_{j=0}^{k-n-1}(s+h)^{k-n-1-j} s^{j} \Big) \end{align*}\] The absolute value of the internal sum is smaller than \((k-n) (1-\delta)^{k-n-1}\). As \[ \sum_{k=n}^\infty \frac{k!}{(k-n-1)!} \times p(k) \times (1-\delta)^{k-n-1} < \infty \] for all \(0<\delta<1\). By the Dominated Convergence Theorem, \[\begin{align*} \lim_{h \to 0} \frac{g^{(n)}(s+h) - g^{(n)}(s)}{h} & = \sum_{k=n+1}^\infty \frac{k!}{(k-n-1)!} \times f(k) \times s^{k-n-1} \,. \end{align*}\]


Example 6.2 The Probability Generating Function of a Poisson distribution with parameter \(\mu\) equals \(\exp(\mu(s-1))\). \[\begin{align*} g(s) & = \sum_{k=0}^\infty \mathrm{e}^{-\mu} \frac{\mu^k}{k!} s^k \\ & = \mathrm{e}^{-\mu} \times \mathrm{e}^{\mu s} \, . \end{align*}\] If we meet a probability distribution with such a PGF, we know it is a Poisson distribution (with parameter \(\mu\)).

6.4 Sums of independent random variables and probability generating fuunctions

Probability Generating Functions provide us with a handy tool to investigate sums of independent random variables.

Proposition 6.4 Let \(X\), \(Y\) be independent integer-valued random variable, with probability generating functions \(G_X\) and \(G_Y\). The probability generating function \(G_{X+Y}\) of \(X+Y\) is \(G_X\times G_Y\): \[ G_{X+Y} = G_X \times G_Y \, . \]


Proof. The proof relies on the fact that non-negative convergent series are commutatively convergent.

For any \(n\): \[ \left\{ \omega : X(\omega) + Y(\omega) = n \right\} = \cup_{k=0}^n \left\{ \omega : X(\omega)=k \text{ et } Y(\omega) = n -k \right\} \,. \] The union on the right-hand-side is a disjoint union. \[\begin{align*} \sum_{n=0}^\infty \mathbb{P}\{ X + Y = n\}\times s^n & = \sum_{n=0}^\infty \left( \sum_{k=0}^n \mathbb{P}\{ X=k\}\times \mathbb{P}\{ Y=n-k\} \right) s^n \\ & = \sum_{k=0}^\infty \mathbb{P}\{ X=k\} s^k \sum_{n\geq k}^\infty \mathbb{P}\{ Y=n-k\} s^{n-k} \\ & = G_X(s) \times G_Y(s) \, . \end{align*}\]

In measure theoretical language, the proposition is a consequence of the Tonelli-Fubini Theorem: \[\begin{align*} G_{X+Y}(s) & = \mathbb{E}\left[s^{X+Y}\right] \\ & = \mathbb{E}\left[s^{X} \times s^{Y}\right] \\ & = \int_{\mathbb{R}^2} s^x s^y \mathrm{d}P_X \otimes P_Y(x,y) \\ & = \int_{\mathbb{R}} \int_{\mathbb{R}} s^x s^y \mathrm{d}P_X(x) \mathrm{d}P_Y(y) \\ & = \int_{\mathbb{R}} s^y \int_{\mathbb{R}} s^x \mathrm{d}P_X(x) \mathrm{d}P_Y(y) \\ & = \int_{\mathbb{R}} s^y G_X(s) \mathrm{d}P_Y(y) \\ & = G_X(s) \times G_Y(s) \, . \end{align*}\]


Example 6.3 If \(X\) and \(Y\) are independent Poisson random variables with parameters \(\mu\) and \(\nu\), then \(G_{X+Y}(s) = \exp(\mu(s-1))\times \exp(\nu(s-1))= \exp((\mu+\nu)(s-1))\). This is another proof that \(X+Y\) is Poisson distributed with parameter \(\mu+\nu\).

6.5 Smoothness and integrability

A PGF is infinitely many times differentiable inside the (open) disk of convergence. If the radius of convergence is larger than \(1\) (as for Poisson or geometric distributions), this entails that the PGF is infinitely many times differentiable at \(1\), If the radius of convergence is exactly \(1\), the differentiability on the circle of convergence is not prescribed by general theory.

Theorem 6.1 (Integrability and probability generating functions) Let \(X\) be an integer-valued random variable, with probability generating functions \(f\), then \(\mathbb{E} X^p < \infty\) iff \(f\) is \(p\)-times differentiable at \(1\) and \[ f^{(p)}(1) = \mathbb{E}\left[X (X-1) \ldots (X-p+1)\right] \, . \]


Proof. Assume that \(G\) is \(p\)-times differentiable on the left at \(1\).

We need to establish that \(|X|\) is \(p\)-integrable.

Assume that \(|X|\) is \(p\)-integrable.


The next question arises quickly: when is a function from \([0,1]\) to \([0,\infty)\) a probability generating function? This question is addressed later in a broader perspective.