Least squares estimators are nice! PART 2 (consistency, asymptotic normality and efficiency)

Chapter 6 in TPE introduces measures of asymptotic optimality of estimators, such as consistency and efficiency as the sample size gets large. Here, I aim to apply these results in the setting of linear models.

Overview of the results on asymptotics in TPE

Theorem 6.3.7 establishes the existence of a consistent sequence of roots of the log-likelihood in a one-parameter family of distributions (i.e. a sequence of estimators $\hat{\theta}\subscript{n}$ such that $\hat{\theta}\subscript{n} \rightarrow \theta$ in probability, where $\theta$ denotes the true parameter to be estimated). The Theorem does not tell us how exactly to pick a consistent sequence, unless the log-likelihood has a unique root for all $n$, in which case the sequence of maximum likelihood estimators is consistent (see Corollary 6.3.8).

Theorem 6.3.10 extends the previous result by establishing that any consistent sequence of roots of the log-likelihood is asymptotically normal with asymptotic variance given by $\frac{1}{\mathcal{I}(\theta)}$, the inverse of the Fisher's information. Such a sequence $\hat{\theta}\subscript{n}$ is called an efficient likelihood estimator. In particular, $\hat{\theta}\subscript{n}$ is also asymptotically efficient, meaning that it attains the Cramér–Rao lower bound asymptotically.

Finally, Theorem 6.5.1 extends the previous results to the multi-parameter case. When the true parameter $\theta$ is a vector, there exist solutions $\hat{\theta}\subscript{n}$ to the likelihood equation (under some additional regularity conditions), such that $\hat{\theta}\subscript{jn} \rightarrow \theta\subscript{j}$ in probability and $\sqrt{n}(\hat{\theta}\subscript{n} - \theta) \overset{\mathcal{L}}{\rightarrow} \mathcal{N}(0,\mathcal{I}(\theta)^{-1})$. In particular, $\hat{\theta}\subscript{jn}$ is asymptotically efficient for each $j$.

Additionally, TPE presents multiple results on how to find a consistent estimator when the log-likelihood has multiple roots. For the one-parameter case see Theorem 6.4.3, Corollary 6.4.4 and Examples 6.4.6, 6.4.7 and 6.4.10, as well as other results in Section 4 of Chapter 6 in TPE. These results are extended to the multi-parameter situation in Theorem 6.5.3 and Corollary 6.5.4.

Application to a stochastic linear model

The usual formulation of a linear model is $y\subscript{i} = x\subscript{i}^T \beta + \varepsilon\subscript{i}$, where $\varepsilon\subscript{i}$ are i.i.d. error terms, and the $x\subscript{i}$ are (deterministic) measurements of the predictor variables for the $i$th subject. The above results on asymptotics cannot be applied to this model, because the observations $y\subscript{i}$ are not identically distributed. However, if we consider $x\subscript{i}$ to be stochastic and i.i.d., then the tupels $(y\subscript{i}, x\subscript{i})$ are i.i.d. observations of a random vector $(y, x)$, and consequently the above theory is applicable.

Thus, we consider the following model formulation:

$$ \begin{eqnarray} y\subscript{i} | x\subscript{i} &=& x\subscript{i}^T \beta + \varepsilon\subscript{i}, \quad y\subscript{i}, \varepsilon\subscript{i} \in \mathbb{R}, x\subscript{i}, \beta \in \mathbb{R}^p, \nonumber \\\ x\subscript{i} &\sim& \mathrm{i.i.d.}\,F\subscript{\xi}, \quad \varepsilon\subscript{i} \sim \mathrm{i.i.d.}\,\mathcal{N}(0,\sigma^2), \quad x\subscript{i} \,\mathrm{and}\, \varepsilon\subscript{i} \,\mathrm{are\,independent}, \nonumber \\\ \mathrm{E}(x\subscript{i}) &=& \int z \mathrm{d}F\subscript{\xi}(z) \,\mathrm{and}\, \mathrm{E}(x\subscript{i}x\subscript{i}^T) = \int zz^T \mathrm{d}F\subscript{\xi}(z) \,\mathrm{both\,exist}. \nonumber \end{eqnarray} $$

Writing $y = (y\subscript{1}, \dots, y\subscript{N})^T \in \mathbb{R}^{N}$, $X = (x\subscript{1}, \dots, x\subscript{N})^T \in \mathbb{R}^{N\times p}$ and $f(\cdot; \xi) = \frac{\partial}{\partial x} F\subscript{\xi}(x)$, it follows that the log-likelihood is given by

$$l(\beta, \sigma^2; \xi) = -\frac{N}{2}\log(2\pi) - \frac{N}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\|y - X\beta\|\subscript{2}^2 + \sum \log f(x\subscript{i}; \xi).$$

It is apparent that $\xi$ is asymptotically uncorrelated with $(\beta, \sigma^2)$. Thus, the estimation of $(\beta, \sigma^2)$ and $\xi$ can be performed separately without efficiency loss.

Maximum likelihood estimators and information matrix

Using basic rules of differentiation of matrices, vectors and scalar forms we arrive at

$$ \begin{eqnarray} \frac{\partial l}{\partial \beta} &=& -\frac{1}{\sigma^2} (X^T X \beta - X^T y), \nonumber \\\ \frac{\partial l}{\partial \sigma^2} &=& -\frac{N}{2\sigma^2} + \frac{\|y - X\beta\|\subscript{2}^2}{2(\sigma^2)^2}, \nonumber \\\ \frac{\partial^2 l}{\partial \beta \partial \beta} &=& -\frac{1}{\sigma^2} X^T X, \nonumber \\\ \frac{\partial^2 l}{\partial \beta \partial \sigma^2} &=& \frac{\partial^2 l}{\partial \sigma^2 \partial \beta} = \frac{1}{(\sigma^2)^2} (X^T X \beta - X^T y), \nonumber \\\ \frac{\partial^2 l}{\partial \sigma^2 \partial \sigma^2} &=& \frac{N}{2(\sigma^2)^2} - \frac{\|y - X\beta\|\subscript{2}^2}{(\sigma^2)^3}. \nonumber \end{eqnarray} $$

Setting the first derivatives to zero, it follows that the maximum likelihood estimators are

$$ \begin{equation} \label{LSE} \hat{\beta} = (X^T X)^{-1} X^T y, \end{equation} $$

which is also known as the least squares estimator of $\beta$, and

$$ \begin{equation} \label{MSE} \hat{\sigma}^2 = \frac{\|y - X\beta\|\subscript{2}^2}{N}. \end{equation} $$

In order to obtain the information matrix we take expectations of the second derivatives:

$$ \begin{eqnarray} \mathrm{E}\left(\frac{\partial^2 l}{\partial \beta \partial \beta}\right) &=& -\frac{1}{\sigma^2} \sum\subscript{i=1}^N \mathrm{E}(x\subscript{i} x\subscript{i}^T) = - \frac{N}{\sigma^2} \int zz^T \mathrm{d}F\subscript{\xi}(z), \nonumber \\\ \mathrm{E}\left(\frac{\partial^2 l}{\partial \beta \partial \sigma^2}\right) &=& \frac{1}{(\sigma^2)^2} \left[ \mathrm{E}(X^T X \beta) - \mathrm{E}(\mathrm{E}(X^T y |X))\right] \nonumber \\\ &=& \frac{1}{(\sigma^2)^2} \left[ \mathrm{E}(X^T X \beta) - \mathrm{E}(X^T X \beta) \right] = 0, \nonumber \\\ \mathrm{E}\left(\frac{\partial^2 l}{\partial \sigma^2 \partial \sigma^2}\right) &=& \frac{N}{2(\sigma^2)^2} - \frac{\mathrm{E}(\mathrm{E}(\|y - X\beta\|\subscript{2}^2 | X))}{(\sigma^2)^3} \nonumber \\\ &=& \frac{N}{2(\sigma^2)^2} - \frac{\mathrm{E}(\|X \beta + \varepsilon - X\beta\|\subscript{2}^2)}{(\sigma^2)^3} \nonumber \\\ &=& \frac{N}{2(\sigma^2)^2} - \frac{N\sigma^2}{(\sigma^2)^3} = - \frac{N}{2(\sigma^2)^2}. \nonumber \end{eqnarray} $$

Consequently, the information of $(\beta, \sigma^2)$ is given by

$$ \mathcal{I}\left( \begin{pmatrix} \beta \\ \sigma^2 \end{pmatrix} \right) = \begin{pmatrix} \frac{N}{\sigma^2} \int zz^T \mathrm{d}F\subscript{\xi}(z) & 0 \\ 0 & \frac{N}{2(\sigma^2)^2} \end{pmatrix}. $$

Consistency, asymptotic normality and efficiency

In the model considered above, $(\beta, \sigma^2)$ parametrize distinct distributions with a common support, and the observations $(y\subscript{i}, x\subscript{i}, \varepsilon\subscript{i})$ are i.i.d. Therefore, we can use the standard asymptotic theory presented in Chapter 6 of TPE.

Assumptions

Still, in order to apply Theorem 5.1 of Section 6 in TPE, a number of additional conditions have to be verified. These are given as (A), (B), (C) and (D) in Section 5 of Chapter 6, and are checked in the following.

(A) This is the assumption that the true parameter $(\beta, \sigma^2)$ is contained in a small open neighborhood, on which $l$ is three times differentiable. Unless the above model is degenerate (i.e. $\sigma^2 = 0$), the assumption should be valid.

(B) Here we need to verify:

$\mathrm{E}\left(\frac{\partial l}{\partial \beta}\right) = -\frac{1}{\sigma^2} \mathrm{E}(X^T X \beta - X^T y) = 0$.
$\mathrm{E}\left(\frac{\partial l}{\partial \sigma^2}\right) = -\frac{N}{2\sigma^2} + \frac{\mathrm{E}\|y - X\beta\|\subscript{2}^2}{2(\sigma^2)^2} = 0$.
$\forall j,k \in \{1,\dots,p+1\} : \mathrm{E}\left(-\frac{\partial^2 l}{\partial \theta\subscript{j} \partial \theta\subscript{k}} \right) = \mathrm{E}\left(\frac{\partial l}{\partial \theta\subscript{j}} \cdot \frac{\partial l}{\partial \theta\subscript{k}} \right)$, where $\theta := (\beta, \sigma^T)^T$. This easily follows by observing that

$$\frac{\partial^2 l}{\partial \theta\subscript{j} \partial \theta\subscript{k}} = \frac{\partial^2}{\partial \theta\subscript{j} \partial \theta\subscript{k}} \log(g) = \frac{\frac{\partial^2 g}{\partial \theta\subscript{j} \partial \theta\subscript{k}}}{g} - \frac{\frac{\partial g}{\partial \theta\subscript{j}} \frac{\partial g}{\partial \theta\subscript{k}}}{g^2} = \frac{\frac{\partial^2 g}{\partial \theta\subscript{j} \partial \theta\subscript{k}}}{g} - \frac{\partial l}{\partial \theta\subscript{j}} \frac{\partial l}{\partial \theta\subscript{k}},$$

where $g$ is the joint density of $(y\subscript{i}, x\subscript{i})$, and then taking the expectation of both sides.

(C) The information matrix $\mathcal{I}\left( \begin{pmatrix} \beta \\ \sigma^2 \end{pmatrix} \right)$ has to be positive definite. It is a covariance matrix and therefore at least positive semi-definite. In practice, positive definiteness can be safely assumed, because a covariance matrix $\mathrm{Cov}(z)$ is only positive semi-definite rather than positive definite if and only if $a^T z$ constant with probability 1 for some vector $a$.

(D) This condition essentially requires that the third derivatives of the log-likelihood should be bounded in a small neighborhood of the true parameter $(\beta, \sigma^2)$. By looking at the previously computed second derivatives, it is clear that the condition is satisfied.

Conclusion

Thus, all assumptions of Theorem 6.5.1 in TPE are satisfied, and it follows the conclusion:

The maximum likelihood estimators ($\ref{LSE}$) and ($\ref{MSE}$) are consistent for estimating $\beta$ and $\sigma^2$. That is, the well-known least squares estimator $\hat{\beta}$ converges in probability to the true parameter $\beta$, and $\hat{\sigma}^2$ converges in probability to the true $\sigma^2$.
The derived maximum likelihood estimators ($\ref{LSE}$) and ($\ref{MSE}$) are asymptotically normal:

$$\mathcal{I}\left( \begin{pmatrix} \beta \\ \sigma^2 \end{pmatrix} \right)^{1/2} \left[ \begin{pmatrix} \hat{\beta} \\ \hat{\sigma}^2 \end{pmatrix} - \begin{pmatrix} \beta \\ \sigma^2 \end{pmatrix} \right] \overset{\mathcal{L}}{\longrightarrow} \mathcal{N}(0, I),$$

where

$$ \mathcal{I}\left( \begin{pmatrix} \beta \\ \sigma^2 \end{pmatrix} \right) = \begin{pmatrix} \frac{N}{\sigma^2} \int zz^T \mathrm{d}F\subscript{\xi}(z) & 0 \\ 0 & \frac{N}{2(\sigma^2)^2} \end{pmatrix}. $$
The estimator $\hat{\sigma}^2$ as well as all estimators $\hat{\beta}\subscript{i}$ (for each $i\in\{1,\dots,p\}$) are asymptotically efficient in the sense that each of those achieves its respective Cramér–Rao lower bound asymptotically.

Alternative proof

In Section 3.6.2 of Demidenko (2013) "Mixed Models: Theory and Applications with R" (2nd ed.), consistency and asymptotic normality of the least squares estimator in a linear model with stochastic predictors $x\subscript{i}$ is established based on the Law of Large Numbers and a multivariate version of the Central Limit Theorem.

Asymptotics of a deterministic linear model

Even though, for reasons delineated above, the theory of Chapter 6 in TPE cannot be applied when the predictors $x\subscript{i}$ are considered deterministic rather than random, asymptotic properties of the estimators can still be established.

Fahrmeir and Kaufmann (1985) "Consistency and Asymptotic Normality of the Maximum Likelihood Estimator in Generalized Linear Models" under some regularity conditions give a proof of the weak and strong consistency as well as asymptotic normality of estimators in generalized linear models, of which the linear model is a simple special case.
Robbins and Wei (1978) "Strong consistency of least squares estimates in multiple regression" prove that the least squares estimator $\hat{\beta} = (X^T X)^{-1} X^T y$ is consistent if $(X^T X)^{-1} \rightarrow 0$ as $N\rightarrow \infty$, without any distributional assumptions on $y$ apart from independence.
Section 13.1.1 in Demidenko (2013) "Mixed Models: Theory and Applications with R" (2nd ed.) presents the following proof of asymptotic normality of the least squares estimator.

Consider the model $y\subscript{i} = x\subscript{i}^T \beta + \varepsilon\subscript{i}$ for $i = 1,\ldots,N$, where $y\subscript{i}, \varepsilon\subscript{i} \in \mathbb{R}$ and $x\subscript{i}, \beta \in \mathbb{R}^p$. Assume that the $\varepsilon\subscript{i}$ are i.i.d. with $\mathrm{E}(\varepsilon\subscript{i}) = 0$ and $\mathrm{Var}(\varepsilon\subscript{i}) = \sigma^2$. Assume that $\lim N^{-1} \sum\subscript{i=1}^N x\subscript{i}x\subscript{i}^T = V$, and that there is a constant $c$ such that $\|x\subscript{i}\| \leq c$. Then by a multivariate version of the Central Limit Theorem it holds that

$$ \begin{equation} \label{CLT} \frac{1}{\sqrt{N}} \sum\subscript{i=1}^N x\subscript{i}\varepsilon\subscript{i} \overset{\mathcal{L}}{\longrightarrow} \mathcal{N}(0, \sigma^2 V). \end{equation} $$

The least squares estmator can be expressed as

$$ \begin{equation} \label{OLS} \hat{\beta} = \left(\sum\subscript{i=1}^N x\subscript{i} x\subscript{i}^T \right)^{-1} \left(\sum\subscript{i=1}^N x\subscript{i} y\subscript{i} \right). \end{equation} $$

Together, $(\ref{CLT})$ and $(\ref{OLS})$ imply that

$$\hat{\beta} \overset{\mathcal{L}}{\longrightarrow} \mathcal{N}\left(\beta, \sigma^2 \left(\sum\subscript{i=1}^N x\subscript{i} x\subscript{i}^T \right)^{-1} \right).$$

It is also worth pointing out that under a normality assumption on the error terms, (small sample) distributions of the estimators $\hat{\beta}$ and $\hat{\sigma}^2$ can be easily derived (in particular, $\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 (X^T X)^{-1})$ and $\frac{\|y - X\hat{\beta}\|\subscript{2}^2}{\sigma^2} \sim \chi\subscript{N-p}^2$). The normality assumption in fact seems currently to be the standard in the statistical modeling literature. Moreover, under the normality assumption the estimators fulfill further optimality assumptions, such as uniform minimum variance unbiasedness and minimum risk equivariance.