The well-known least squares estimator (LSE) for the coefficients of a linear model is the "best" possible estimator according to several different criteria. Three types of such optimality conditions under which the LSE is "best" are discussed below. In the process, we also briefly look at the "best" estimators of the variance in a linear model.

Let's fix the concepts first, and then explore how they apply to LSE.

Some definitions and implications

UMVU estimators

As one would expect, a uniform minimum variance unbiased (or UMVU) estimator $\delta(x)$ of $g(\theta)$ is an unbiased estimator such that $\mathrm{Var}\subscript{\theta} \delta(X) \leq \mathrm{Var}\subscript{\theta} \delta^\prime(X)$ for any other unbiased estimator $\delta^\prime(x)$ of $g(\theta)$ and any $\theta\in\Omega$.

Invariance

Let $X \sim P\subscript{\theta}$ for some $\theta\in\Omega$. That is, $X$ is distributed according to one of the distributions in the family $\mathcal{P} = \{ P\subscript{\theta}, \theta \in \Omega \}$ of distributions. Let $G$ be the group generated by the set of all bijective transformations of the sample space of $X$ onto itself.

If for any $g\in G$ it holds that $gX \sim P\subscript{\theta^\prime}$ for some $\theta^\prime \in \Omega$, and if as $\theta$ traverses $\Omega$ so does $\theta^\prime$, then $\mathcal{P}$ is invariant under $G$ (Definition 2.1 in Chapter 3 TPE).

The principle of invariance has some interesting implications:

  • If $G$ leaves $\mathcal{P}$ invariant, then there must be a bijective transformation $\bar{g}$ such that $\theta^\prime = \bar{g}\theta$. Such transformations $\bar{g}$ form a group $\overline{G}$, and we have that

    $$ \begin{eqnarray} \nonumber P\subscript{\theta}(gX \in A) &=& P\subscript{\bar{g}\theta}(X \in A) \\\ E\subscript{\theta} \psi(gX) &=& E\subscript{\bar{g}\theta} \psi(X), \label{eq:invariant} \end{eqnarray} $$

    for any function $\psi$ whose expectation is defined.

  • If $h(\bar{g}\theta)$ depends on $\theta$ only through $h$, then there is a transformation $g^\ast$ such that $h(\bar{g}\theta) = g^\ast h(\theta)$ for all $\theta\in\Omega$.

  • (Definition 2.4 in Chapter 3 TPE) A problem estimating $h(\theta)$ with the loss function $L$ is called invariant under $G$, if $L(\bar{g}\theta, g^\ast d) = L(\theta, d)$ and if $h(\bar{g}\theta)$ depends on $\theta$ only through $h$.

Equivariant estimators

In an invariant estimation problem, an estimator $\delta(x)$ is equivariant if $$\delta(gx) = g^\ast \delta(x),$$ for all $g\in G$ (Definition 2.5 in Chapter 3 TPE). That is, the estimator $\delta$ respects the principle of invariance.

In particular, equation ($\ref{eq:invariant}$) implies that the risk function of any equivariant estimator is constant on orbits of the group of transformations $G$.

The least squares estimator is UMVU and MRE

Consider a linear model $y = X\beta + \varepsilon$, where $y\in\mathbb{R}^n$, $X\in\mathbb{R}^{n\times p}$ with $p < n$, $\mathrm{rank}(X) = p$, $\beta\in\mathbb{R}^p$, and $\varepsilon\subscript{i} \sim \mathrm{i.i.d.}\, N(0,\sigma^2)$ for all $i\in\{1,\dots,n\}$.

For convenience, denote $\xi := X\beta$. It holds that $\xi\in\Pi$, where $\Pi$ denotes a $p$-dimensional subspace of $\mathbb{R}^n$ (spanned by the columns of $X$).

Orthogonal coordinate transformation

Consider the orthogonal transformation $z = Qy$, where $Q\in\mathbb{R}^{n\times n}$ is an orthogonal matrix such that its first $p$ rows span $\Pi$. Denote $\eta := Q\xi$, the expectation of $z$. It follows that $\eta\subscript{p+1} = \dots = \eta\subscript{n} = 0$. Thus we have that $z\subscript{i} \sim N(\eta\subscript{i}, \sigma^2)$ for $i=1,\dots,p$ and $z\subscript{j} \sim N(0,\sigma^2)$ for $j=p+1,\dots,n$. Moreover, all entries of $z$ are independent.

By writing the multivariate normal density of $z$ it becomes apparent that $z\subscript{1}, \dots, z\subscript{p}$ and $s^2 = \sum\subscript{j = p+1}^n z\subscript{j}^2$ are the complete and sufficient statistics for $(\eta, \sigma^2)$.

It follows that $\sum\subscript{i=1}^n \lambda\subscript{i} z\subscript{i}$ is UMVU for $\sum\subscript{i=1}^n \lambda\subscript{i} \eta\subscript{i}$ and $s^2 / (n-p)$ is UMVU for $\sigma^2$, because both estimators are unbiased and functions of complete and sufficient statistics.

Clearly, $\sum\subscript{i=1}^n \lambda\subscript{i} z\subscript{i}$ is equivariant under the transformations

$$ \begin{eqnarray} z\subscript{i}^\prime &=& z\subscript{i} + a\subscript{i}, i = 1,\dots,p, \quad z\subscript{j}^\prime = z\subscript{j}, j = p+1,\dots,n, \nonumber \\\ \eta\subscript{i}^\prime &=& \eta\subscript{i} + a\subscript{i}, i = 1,\dots,p, \quad \eta\subscript{j}^\prime = \eta\subscript{j}, j = p+1,\dots,n, \nonumber \\\ d^\prime &=& d + \sum\subscript{i = 1}^p a\subscript{i} \lambda\subscript{i}. \nonumber \end{eqnarray} $$

It follows that the estimator $\sum\subscript{i=1}^n \lambda\subscript{i} z\subscript{i}$ is also the minimum risk equivariant (MRE) estimator of $\sum\subscript{i=1}^n \lambda\subscript{i} \eta\subscript{i}$ (with the loss function $L(\eta, d) = \rho(d - \sum \lambda\subscript{i} \eta\subscript{i})$, where $\rho$ is convex and even). Moreover, it can be shown that $s^2 / (n-p+2)$ is MRE for $\sigma^2$ under the loss function $(d-\sigma^2)^2 / \sigma^4$ (see problem 4.3 in Chapter 3 TPE).

We refer to Theorem 4.3 in Chapter 3 TPE and anything referenced to from therein for more rigour and detail.

UMVU and MRE estimators in the original space

We have shown the UMVU and MRE estimators in terms of $z$, the orthogonally transformed version of $y$. However, it would be more useful to have UMVU and MRE estimators in terms of the original variables $y$.

As is well-known, the least squares estimator of $\mathrm{E}(y) = \xi$ is given by $\hat{y} = X (X^T X)^{-1} X^T y$, which is an orthogonal projection of $y$ on $\Pi$. It can be found by minimizing the least squares $\|y - \xi\|\subscript{2}^2 = \|y - X\beta\|\subscript{2}^2$. We have that

$$ \begin{equation} \label{eq:least_squares} \sum\subscript{i=1}^n (y\subscript{i} - \xi\subscript{i})^2 = \sum\subscript{i=1}^p (z\subscript{i} - \eta\subscript{i})^2 + \sum\subscript{i=p+1}^n z\subscript{i}^2. \end{equation} $$

Since the left hand side is minimized by $\hat{y}$ and the right hand side is minimized by $\hat{\eta}\subscript{i} = z\subscript{i}$ for $i = 1,\dots,p$ (and $=0$ for $i>p$), it holds that $\hat{y} = Q^T\hat{\eta}$. Thus, the LSE $\hat{y}$ is a linear function of $z$, and therefore the estimator $\sum\subscript{i=1}^n \lambda\subscript{i} \hat{y}\subscript{i}$ is UMVU for $\sum\subscript{i=1}^n \lambda\subscript{i} \xi\subscript{i}$ by the argumentation given above (namely because $\sum\subscript{i=1}^n \lambda\subscript{i} z\subscript{i}$ is UMVU for $\sum\subscript{i=1}^n \lambda\subscript{i} \eta\subscript{i}$). For more detail see Chapter 3 Theorem 4.4 in TPE.

Likewise, it follows from the argumentation given above for the case of the orthogonal transform $z$ that the estimator $\sum\subscript{i=1}^n \lambda\subscript{i} \hat{y}\subscript{i}$ is MRE for $\sum\subscript{i=1}^n \lambda\subscript{i} \xi\subscript{i}$ under the transformation $y^\prime = y + b$ with $b\in\Pi$ and with the loss function $L(\xi, d) = \rho(d - \sum \lambda\subscript{i} \xi\subscript{i})$ provided $\rho$ is convex and even. See Chapter 3 Corollary 4.5 for detail.

Similarly, using the results given above for the orthogonal transform $z$, by reexpressing $s^2 = \sum\subscript{i=p+1}^n z\subscript{i}^2 = \sum\subscript{i=1}^n (y\subscript{i} - \hat{y}\subscript{i})^2$ (from equation ($\ref{eq:least_squares}$)) it follows that the UMVU and MRE estimators of $\sigma^2$ are given by $\sum\subscript{i=1}^n (y\subscript{i} - \hat{y}\subscript{i})^2 / (n-p)$ and $\sum\subscript{i=1}^n (y\subscript{i} - \hat{y}\subscript{i})^2 / (n-p+2)$ respectively.

Finally, the LSE $\hat{\beta} = (X^T X)^{-1}X^T y$ is UMVU and MRE for $\beta$ by the above argumentation, because it can be written as a linear function of $\hat{y}$.

The least squares estimator is BLUE

A best linear unbiased estimator (BLUE) is an unbiased estimator that is linear in $y$ and achieves uniformly the smallest variance among all other linear unbiased estimators (i.e., UMVU among all linear estimators).

In the context of linear models, an advantage of this optimality criterion over the notions of UMVU and MRE is that it does not rely on the normality assumptions. That is, we merely need to assume that $\mathrm{E}(y) = \xi = X\beta$ and $\mathrm{Cov}(y) = \sigma^2 I$, without any further assumptions on the distribution.

Assume we aim to estimate $\sum\subscript{i=1}^n \lambda\subscript{i} \xi\subscript{i} = \lambda^T X\beta$. By linearity the estimator should have the form $\delta(y) = a^T y$. Unbiasedness implies that $a^T X \beta = \lambda^T X \beta$, from which it follows that $X^T (a - \lambda) \perp \beta$ for any $\beta\in\mathbb{R}^p$, and consequently $X^T a = X^T \lambda$. Taking $m\in\mathbb{R}^n$ to be a vector of Lagrange multipliers, the minimization problem becomes,

$$ \begin{eqnarray} a + Xm &=& 0 \nonumber \\\ X^T a &=& X^T \lambda. \nonumber \end{eqnarray} $$

It is easily seen that this is solved by $a = X(X^T X)^{-1} X^T \lambda$ and $m = -(X^T X)^{-1} X^T \lambda$. In particular, $\hat{\beta} = (X^T X)^{-1} X^T y$ is BLUE for $\beta$.

TPE has a different approach of proving that LSE is BLUE (see Theorem 4.12 in Chapter 3, which TPE calls Gauss' Theorem on Least Squares). Moreover, it follows that LSE is also MRE among all linear estimators (see Corollary 4.13 in Chapter 3 TPE).