Code
import numpy as npIn the sequel \(X\) is a (data) numerical matrix, that is an element of \(\mathbb{R}^{n \times p}\). The rows of \(X\) (the individuals) are the sample points. Each sample point is a tuple of \(p\) elements (the so-called variables).
The sample mean is defined as \[ \overline{X}= \begin{pmatrix} \frac{1}{n}\sum_{i=1}^n X_{i,j}\end{pmatrix}_{j\leq p} \]
In linear algebra, \(\overline{X}\) is the result of vector matrix multiplication: \[ \overline{X} = \frac{1}{n}\begin{pmatrix} 1 & \ldots & 1\end{pmatrix} \times X \]
Here, we view \(\overline{X}_n\) as a row vector built from column averages.
Centering \(X\) consists in subtracting the column average from each matrix element. \[X - \begin{pmatrix}1 \\ \vdots \\ 1\end{pmatrix} \times \overline{X}\]
Note that centering consists on projecting the columns of \(X\) on the \(n-1\) dimensional subspace of \(\mathbb{R}^n\) that is orthogonal to \(\begin{pmatrix} 1 & \ldots & 1\end{pmatrix}^\top\):
\[\begin{align*} X - \begin{pmatrix}1 \\ \vdots \\ 1\end{pmatrix} \times \overline{X} & = \underbrace{ \left(\text{Id}_n - \frac{1}{n} \begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} \begin{pmatrix} 1 & \ldots & 1\end{pmatrix}\right)}_{\text{projector matrix}} X \end{align*}\]
Let us call \(Y\) the matrix obtained from centering the columns of \(X\).
Scaling \(Y\) consists of dividing each coefficient of \(Y\) by \(1/\sqrt{n}\) times the (Euclidean) norm of its column.
Let us call \(\sigma_j\) the Euclidean norm of column \(j\) of \(Y\) (\(1\leq j \leq p\)) divided by \(1/\sqrt{n}\): \[ \sigma_j^2 = \frac{1}{n}\sum_{i=1}^n Y_{i,j}^2 = \frac{1}{n} \sum_{i=1}^n \left(X_{i,j} - \overline{X}_j\right)^2 \]
This is also the standard deviation of the \(j^{\text{n}}\) column of \(X\).
The standardized matrix \(Z\) is obtained from the next multiplication \[ Z = Y \times \begin{pmatrix} \sigma_1 & 0 & \ldots & & 0 \\ 0 & \sigma_2 & 0 & & \vdots \\ \vdots & 0 & \ddots & & \vdots \\ 0 & \ldots & & & \sigma_d \end{pmatrix}^{-1} \]
Note that for \(i\leq n\), \(j \leq p\): \[Z_{i,j} = \frac{X_{i,j} - \overline{X}_j}{\sigma_j}\]
\(Z\) is called the \(z\)-score matrix. It shows up in many statistical analyses.
In the statistical computing environment R, a function called scale() computes the \(z\)-score matrix. On may ask whether NumPy offers such a function.
In NumPy, there is no single function equivalent to R’s scale() function. However, you can achieve the same result using broadcasting.
import numpy as npLet us first generate a toy data matrix with random (Gaussian) coefficients. This is an opportunity to introduce np.random.
We will work with \(n=5\) and \(p=3\).
We first build a (positive definite) covariance matrix. We ensure positive definiteness starting from the Cholesky factorization.
# %%
#| label: build-covariance
L = np.array([
[1., 0., 0.],
[.5, 1., 0.],
[.5, .5, 1.]])
C = L @ L.transpose()
# L is the Cholesky factor of C# Generate a sample of 5 independent normal vectors with mean (1, 2, 3) and covariance C
mu = np.array([1, 2, 3])X = np.random.randn(5,3) @ L.transpose() + muIn NumPy, function mean with well chosen optional axis argument returns a 1D array filled with column averages
We just did something strange: we added a matrix with shape \((5,3)\) and a vector with length \(3\). In linear algebra, this is not legitimate. We just used the device called broadcasting. See below.
# Compute column averages
# That is compute arithmetic mean along axis `0`
emp_mean = np.mean(X, axis=0)We can magically center the columns using broadcasting.
X_centered = X - emp_meanIf broadcasting were note possible, we could still achieve the result by resorting to NumPy implementation of matrix multiplication
X - np.ones((5,1), dtype=np.float16) @ emp_mean.reshape(1,3)array([[ 0.41846646, -0.72225791, -0.78003353],
[-2.50364484, -0.7034133 , -1.21231068],
[-0.41703883, 0.21260877, -0.25539926],
[ 1.62442711, 1.52986466, 1.19133333],
[ 0.8777901 , -0.31680222, 1.05641013]])
is a centered version of X:
X_centered - (X - np.ones((5,1), dtype=np.float16) @ emp_mean.reshape(1,3))array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
We compute now the column standard deviations using function np.std with axis argument set to 0
emp_std = np.std(X, axis=0)The \(z\)-score matrix is obtained using another broadcasting operation.
Z = (X - emp_mean) / emp_std # X_centered / emp_stdFinally, let us perform the sanity checks :
# %%
#| label: sanity-check-z-matrix-mean
np.mean(Z, axis=0) # Z is column centeredarray([-4.44089210e-17, -3.33066907e-17, 2.22044605e-16])
np.std(Z, axis=0) # Z is standardizedarray([1., 1., 1.])
Alternatively, we can use scipy.stats.zscore which provides R’s scale()-like functionality:
from scipy.stats import zscore
X_scaled_scipy = zscore(X, axis=0)
X_scaled_scipyarray([[ 0.29550841, -0.86295532, -0.80637625],
[-1.76799861, -0.84043975, -1.25325195],
[-0.29450026, 0.25402542, -0.26402441],
[ 1.14712152, 1.82788563, 1.23156617],
[ 0.61986894, -0.37851598, 1.09208644]])
scipy.stats.zscore centers and scales the data by default (equivalent to R’s scale() with default arguments).
Centering and standardization are classical preprocessing steps before Principal Component Analysis (and before many Machine Learning procedures).
The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimension and works its way left. Two dimensions are compatible when
- they are equal, or
- one of them is 1.
In our setting the shape of X and emp_mean are (5,3) and (3). The rightmost dimensions are equal, hence compatible.
Input arrays do not need to have the same number of dimensions. The resulting array will have the same number of dimensions as the input array with the greatest number of dimensions, where the size of each dimension is the largest size of the corresponding dimension among the input arrays. Note that missing dimensions are assumed to have size one.
In our setting, emp_mean is (virtually) reshaped to (1,3) and the two arrays are fully compatible. To match the leading dimension of X, we can stack three copies of reshaped emp_mean, this is just like multiplying emp_mean by np.array([1.], shape=(3,1)).
When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.
In statistics, it is commonplace to sketch a probability distribution (empirical or not) using a location and a spread/scale parameter. If we add a constant to every outcome, the location parameter is shifted by the constant, the scale/spread parameter is left invariant. If we multiply every outcome by a constant, the location and the spread/scale parameters are multiplied by the constant.
Choosing the expectation/mean as the location parameter is traditional, it is motivated by the fact the expectation minimizes the mean quadratic error, but it may not be a panacea. The empirical mean is well known for its sensitivity to outliers. In words, it lacks robustness. Picking the standard deviation as the scale/spread parameter leads to the same issues.
Many alternative location/scale parameters have been proposed aiming to achieve good tradeoffs between robustness and computability. The median is a popular location parameter, and the median absolute deviation around the median is a popular scale/spread parameter.
TODO: median and median deviation around the median
def my_scaling(X, center=None, scale=None):
"""
Center and scale a data matrix using flexible location and spread functions.
This function demonstrates how to use functions as objects in Python to create
a flexible scaling function that can work with different location and scale
parameters. By default, it performs z-score standardization (centering by mean
and scaling by standard deviation), but can be configured to use robust
alternatives like median and median absolute deviation.
Parameters
----------
X : array_like
Input data matrix of shape (n, p) where n is the number of observations
and p is the number of variables. Each row is an observation, each column
is a variable.
center : callable, None, or False, optional
Function to compute the location parameter along axis=0.
- If None (default), uses `np.mean` for column-wise means.
- If a callable (e.g., `np.median`), applies it to compute location.
- If False, no centering is performed (location set to zeros).
scale : callable, None, or False, optional
Function to compute the spread/scale parameter along axis=0.
- If None (default), uses `np.std` for column-wise standard deviations.
- If a callable (e.g., `median_absdev`), applies it to compute spread.
- If False, no scaling is performed (spread set to ones).
Returns
-------
X_scaled : ndarray
Centered and scaled data matrix of the same shape as X.
Each column is centered by subtracting its location parameter and
scaled by dividing by its spread parameter.
Formula: (X - location) / spread
Notes
-----
This function illustrates the use of functions as first-class objects in Python.
By passing different functions for `center` and `scale`, you can achieve:
- Standard z-score: `my_scaling(X)` or `my_scaling(X, np.mean, np.std)`
- Robust scaling: `my_scaling(X, np.median, median_absdev)`
- Centering only: `my_scaling(X, scale=False)`
- Scaling only: `my_scaling(X, center=False)`
Examples
--------
>>> import numpy as np
>>> X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> # Standard z-score standardization
>>> my_scaling(X)
array([[-1.22474487, -1.22474487, -1.22474487],
[ 0. , 0. , 0. ],
[ 1.22474487, 1.22474487, 1.22474487]])
>>> # Robust scaling using median and MAD
>>> my_scaling(X, center=np.median, scale=median_absdev)
array([[-1., -1., -1.],
[ 0., 0., 0.],
[ 1., 1., 1.]])
>>> # Centering only (no scaling)
>>> my_scaling(X, scale=False)
array([[-3., -3., -3.],
[ 0., 0., 0.],
[ 3., 3., 3.]])
"""
if center is None:
center = np.mean
if center:
location = center(X, axis=0)
else:
location = np.zeros(X.shape[1])
if scale is None:
fscale = np.std
if scale:
spread = scale(X, axis=0)
else:
spread = np.ones(X.shape[1])
return (X - location) / spreaddef median_absdev(X, axis=0):
"""
Compute the median absolute deviation (MAD) around the median.
The median absolute deviation is a robust measure of statistical dispersion.
It is defined as the median of the absolute deviations from the median of
the data. Unlike the standard deviation, it is less sensitive to outliers.
Parameters
----------
X : array_like
Input array or object that can be converted to an array.
For a 2D array, rows typically represent observations and columns
represent variables.
axis : int, optional
Axis along which the median and MAD are computed. Default is 0.
For a 2D array with axis=0, computes MAD for each column.
For axis=1, computes MAD for each row.
Returns
-------
mad : ndarray
Median absolute deviation along the specified axis.
The shape of the output is the same as the input array with the
specified axis removed.
Notes
-----
The MAD is computed as:
MAD = median(|X - median(X)|)
This function first computes the median along the specified axis, then
computes the absolute deviations from that median, and finally returns
the median of those absolute deviations.
Examples
--------
>>> import numpy as np
>>> X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> median_absdev(X, axis=0)
array([3., 3., 3.])
>>> # For a single column
>>> x = np.array([1, 2, 3, 4, 5, 100]) # 100 is an outlier
>>> median_absdev(x)
1.5 # Robust to the outlier
"""
med = np.median(X, axis=axis)
return np.median(np.abs(X - med), axis=axis)scipy.stats.median_abs_deviation() offers a function for computing the MAD. Besides an axis parameter, it proposes nan_policy parameter.