Memento

Memento

Memento mathématique et statistique, en construction...

This memento is a compilation of mathematical and statistical subjects taken from various online sources, including PlanetMath, and Wikipedia Encyclopedia.

Likelihood

Let X=( $X_1,\ldots,X_n$ ) be a random vector and

$\lbrace f_{\mathbf{X}}(\boldsymbol{x}\mid\boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta \rbrace$

a statistical model parametrized by $\boldsymbol{\theta}=(\theta_1,\ldots,\theta_k)$ , the parameter vector in the parameter space $\Theta$ . The likelihood function is a map $L: \Theta \to \mathbb{R}$ given by

$L(\boldsymbol{\theta}\mid\boldsymbol{x}) = f_{\mathbf{X}}(\boldsymbol{x}\mid\boldsymbol{\theta}).$

In other words, the likelikhood function is functionally the same in form as a probability density function. However, the emphasis is changed from the $\boldsymbol{x}$ to the $\boldsymbol{\theta}$ . The pdf is a function of the $x$ 's while holding the parameters $\theta$ 's constant, $L$ is a function of the parameters $\theta$ 's, while holding the $x$ 's constant.

When there is no confusion, $L(\boldsymbol{\theta}\mid\boldsymbol{x})$ is abbreviated to be $L(\boldsymbol{\theta})$ .

The parameter vector $\hat{\boldsymbol{\theta}}$ such that $L(\hat{\boldsymbol{\theta}})\geq L(\boldsymbol{\theta})$ for all $\boldsymbol{\theta}\in\Theta$ is called a maximum likelihood estimate, or MLE, of $\boldsymbol{\theta}$ .

Many of the density functions are exponential in nature, it is therefore easier to compute the MLE of a likelihood
function $L$ by finding the maximum of the natural log of $L$ , known as the log-likelihood function:

$\ell(\boldsymbol{\theta}\mid\boldsymbol{x}) = \operatorname{ln}(L(\boldsymbol{\theta}\mid\boldsymbol{x}))$

due to the monotonicity of the log function.

Examples:

A coin is tossed $n$ times and $m$ heads are observed. Assume that the probability of a head after one toss is $\pi$ . What is the MLE of $\pi$ ?
Solution: Define the outcome of a toss be 0 if a tail is observed and 1 if a head is observed. Next, let $X_i$ be the outcome of
the $i$ th toss. For any single toss, the density function is $\pi^x(1-\pi)^{1-x}$ where $x\in \lbrace 0,1\rbrace$ . Assume that the tosses are independent events, then the joint probability density is

$f_{\mathbf{X}}(\boldsymbol{x}\mid\pi)=\binom{n}{\Sigma x_i}\pi^{\Sigma x_i}(1-\pi)^{\Sigma (1-x_i)}=\binom{n}{m}\pi^m(1-\pi)^{n-m},$

which is also the likelihood function $L(\pi)$ . Therefore, the log-likelihood function has the form

$\ell(\pi\mid\boldsymbol{x})=\ell(\pi)=\operatorname{ln}\binom{n}{m}+m\operatorname{ln}(\pi)+(n-m)\operatorname{ln}(1-\pi).$

Using standard calculus, we get that the MLE of $\pi$ is

$\hat{\pi}=\frac{m}{n}=\overline{x}.$
Suppose a sample of $n$ data points $X_i$ are collected. Assume that the $X_i\sim N(\mu,\sigma^2)$ and the $X_i$ 's are independent of each other. What is the MLE of the parameter vector $\boldsymbol{\theta}=(\mu,\sigma^2)$ ?
Solution: The joint pdf of the $X_i$ , and hence the likelihood function, is

$L(\boldsymbol{\theta}\mid\boldsymbol{x})=\frac{1}{\sigma^n(2\pi)^{n/2}}\operatorname{exp}(-\frac{\Sigma(x_i-\mu)^2}{2\sigma^2}).$

The log-likelihood function is

$\ell(\boldsymbol{\theta}\mid\boldsymbol{x})=-\frac{\Sigma(x_i-\mu)^2}{2\sigma^2}-\frac{n}{2}\operatorname{ln}(\sigma^2)-\frac{n}{2}\operatorname{ln}(2\pi).$

Taking the first derivative (gradient), we get

$\frac{\partial\ell}{\partial \boldsymbol{\theta}}=(\frac{\Sigma(x_i-\mu)}{\sigma^2},\frac{\Sigma(x_i-\mu)^2}{2\sigma^4}-\frac{n}{2\sigma^2}).$

Setting

$\frac{\partial\ell}{\partial \boldsymbol{\theta}}=\boldsymbol{0}\mbox{ See score function}$

and solve for $\boldsymbol{\theta}=(\mu,\sigma^2)$ we have

$\boldsymbol{\hat{\theta}}=(\hat{\mu},\hat{\sigma}^2)=(\overline{x},\frac{n-1}{n}s^2),$

where $\overline{x}=\Sigma x_i/n$ is the sample mean and $s^2=\Sigma (x_i-\overline{x})^2/(n-1)$ is the sample variance. Finally, we verify that $\hat{\boldsymbol{\theta}}$ is indeed the MLE of $\boldsymbol{\theta}$ by checking the negativity of the 2nd derivatives (for each parameter).

Deviance

Background

In testing the fit of a generalized linear model $\mathcal{P}$ of some data (with response variable Y and explanatory variable(s) X), one way is to compare $\mathcal{P}$ with a similar model $\mathcal{P}_0$ . By similarity we mean: given $\mathcal{P}$ with the response variable $Y_i\sim f_{Y_i}$ and link function $g$ such that $g(\operatorname{E}[Y_i])={\textbf{X}_i}^{\operatorname{T}}\boldsymbol{\beta}$ , the model $\mathcal{P}_0$

is a generalized linear model of the same data,
has the response variable $Y$ distributed as $f_Y$ , same as found in $\mathcal{P}$
has the same link function $g$ as found in $\mathcal{P}$ , such that $g(\operatorname{E}[Y_i])={\textbf{X}_i}^{\operatorname{T}}\boldsymbol{\beta_0}$

Notice that the only possible difference is found in the parameters $\boldsymbol{\beta}$ .

It is desirable for this $\mathcal{P}_0$ to be served as a base model in case when more than one models are being assessed. Two possible candidates for $\mathcal{P}_0$ are the null model and the saturated model. The null model $\mathcal{P}_{null}$ is one in which only one parameter $\mu$ is used so that $g(\operatorname{E}[Y_i])=\mu$ , all responses have the same predicted outcome. The saturated model $\mathcal{P}_{max}$ is the other extreme where the maximum number of parameters are used in the model so that the observed response values equal to the predicted response values exactly, $g(\operatorname{E}[Y_i])={\textbf{X}_i}^{\operatorname{T}}\boldsymbol{\beta}_{max}=y_i$

Definition
The deviance of a model $\mathcal{P}$ (generalized linear model) is given by

$\operatorname{dev}(\mathcal{P})=2\big[\ell(\hat{\boldsymbol{\beta}}_{max}\mid\textbf{y})-\ell(\hat{\boldsymbol{\beta}}\mid\textbf{y})\big],$

where $\ell$ is the log-likelihood function, $\hat{\boldsymbol{\beta}}$ is the MLE of the parameter vector $\boldsymbol{\beta}$ from $\mathcal{P}$ and $\hat{\boldsymbol{\beta}}_{max}$ is the MLE of parameter vector $\boldsymbol{\beta}_{max}$ from the saturated model $\mathcal{P}_{max}$ .

Example
For a normal or general linear model, where the link function is the identity:

$\operatorname{E}[Y_i]={\textbf{x}_i}^{\operatorname{T}}\boldsymbol{\beta},$

where the $Y_i$ 's are mutually independent and normally distributed as $N(\mu_i,\sigma^2)$ . The log-likelihood function is given by

$\ell(\boldsymbol{\beta}\mid\textbf{y})=-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\mu_i)^2-\frac{n\operatorname{ln}(2\pi\sigma^2)}{2},$

where $\mu_i={\textbf{x}_i}^{\operatorname{T}}\boldsymbol{\beta}$ is the predicted response values, and $n$ is the number of observations.

For the model in question, suppose $\hat{\mu}_i={\textbf{X}_i}^{\operatorname{T}}\hat{\boldsymbol{\beta}}$ is the expected mean calculated from the maximum likelihood estimate $\hat{\boldsymbol{\beta}}$ of the parameter vector $\boldsymbol{\beta}$ . So,

$\ell(\hat{\boldsymbol{\beta}}\mid\textbf{y})=-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\hat{\mu}_i)^2-\frac{n\operatorname{ln}(2\pi\sigma^2)}{2},$

For the saturated model $\mathcal{P}_{max}$ , the predicted value $(\hat{\mu}_{max})_i$ = the observed response value $y_i$ . Therefore,

$\ell(\hat{\boldsymbol{\beta}}_{max}\mid\textbf{y})=-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-(\hat{\mu}_{max})_i)^2-\frac{n\operatorname{ln}(2\pi\sigma^2)}{2}=-\frac{n\operatorname{ln}(2\pi\sigma^2)}{2}.$

So the deviance is

$\operatorname{dev}(\mathcal{P})=2\big[\ell(\hat{\boldsymbol{\beta}}_{max}\mid\textbf{y})-\ell(\hat{\boldsymbol{\beta}}\mid\textbf{y})\big]=\frac{1}{\sigma^2}\sum_{i=1}^{n}(y_i-\hat{\mu}_i)^2,$

which is exactly the residual sum of squares, or RSS, used in regression models.

Remarks

The deviance is necessarily non-negative.
The distribution of the deviance is asymptotically a chi square distribution with $n-p$ degrees of freedom, where $n$ is the number of observations and $p$ is the number of parameters in the model $\mathcal{P}$ .
If two generalized linear models $\mathcal{P}_1$ and $\mathcal{P}_2$ are nested, say $\mathcal{P}_1$ is nested within $\mathcal{P}_2$ , we can perform hypothesis testing $H_0$ : the model for the data is $\mathcal{P}_1$ with $p_1$ parameters, against $H_1$ : the model for the data is the more general $\mathcal{P}_2$ with $p_2$ parameters, where $p_1<p_2$ . The deviance difference $\Delta$ (dev) $=\operatorname{dev}(\mathcal{P}_2)-\operatorname{dev}(\mathcal{P}_1)$ can be used as a test statistic and it is approximately a chi square distribution with $p_2-p_1$ degrees of freedom.

Sums of squares (SS)

{Type I SS are order-dependent (hierarchical, sequential); each effect is adjusted for all other effects that appear earlier (to the left) in the model, but not for any effects that appear later in the model. Type I SS are computed as the decrease in the Error SS (increase in the Model SS) when the effect is added to a model. The sum of all of the effects SS will equal the total Model SS for Type I SS -- this is not generally true for the other types of SS (which exclude some or all of the variance that cannot be unambiguously allocated to one and only one effect). Type I SS are appropriate for balanced (orthogonal, equal n) analyses of variance in which the effects are specified in proper order (main effects, then two-way interactions, then three-way interactions, etc.) and for trend analysis where the powers for the quantitative factor are ordered from lowest to highest in the model statement.}
{Type II SS are the reduction in the SSE due to adding the effect to a model that contains all other effects except effects that contain the effect being tested. An effect is contained in another effect if it can be derived by deleting terms in that effect.}
{Type III SS are each adjusted for all other effects in the model, regardless of order. Strongly recommended in nonorthogonal ANOVA. Type IV SS are identical to Type III SS for designs with no missing cells.}

Conjugate distributions

In Bayesian probability theory, a class of prior probability distributions $p(\theta)$ is said to be conjugate to a class of likelihood functions $p(x|\theta)$ if the resulting posterior distributions $p(\theta|x)$ are in the same family as $p(\theta)$ (see Table ).

\input{conjugates_table}

Wishart and inverted-Wishart distributions

Let $X_i \sim N_n(\mu_i,\Sigma),\quad i=1,\ldots, p$ be independent $n$ -dimensional random variables, which are multivariate normally distributed. Let $S= \sum_{i=1}^p X_i X_i^{'}$ . Let $M$ be the $p\times n$ matrix with $\mu_1, \ldots, \mu_p$ as rows.
Then the joint distribution of the elements of $S$ is said to be a Wishart distribution on p degrees of freedom, and is denoted by $W_n(p,\Sigma,M)$ . If $M=0$ , the distribution is central and is denoted by $W_n(p,\Sigma)$ .

$W_n$ has a density function when $p \geq n$ :

$f_{\mathbf W}(w)=\frac{\left|w\right|^{(n-p-1)/2}\left|{\mathbf \Sigma}\right|^{-n/2}e^{\left[ - {\rm tr}({\mathbf \Sigma}^{-1}w/2 )\right]}}{2^{np/2}\Gamma_p(n/2)}$

$\Gamma_p(.)$ is the multivariate Gamma distribution.
The Wishart distribution is a multivariate generalization of the $\chi^2$ distribution.

Inverse-Wishart

The inverse-Wishart distribution is used as the conjugate for the covariance matrix of a multivariate normal distribution.

$\mathbf{B}\sim W^{-1}({\mathbf\Psi},m)$ if its probability density function is written as follows:

$p(\mathbf{B}|\mathbf{\Psi},m) = \frac{\left|{\mathbf\Psi}\right|^{m/2}\left|\mathbf B\right|^{-(m+p+1)/2}e^{-\mathrm{tr}({\mathbf\Psi}{\mathbf B}^{-1})/2}<br /> }{2^{mp/2}\Gamma_p(m/2)}$

where $\mathbf{B}$ is a $p\times p$ matrix. The matrix $\mathbf{\Psi}$ is supposed to be positive definite.

If ${\mathbf A}\sim W({\mathbf\Sigma},m)$ and ${\mathbf\Sigma}$ is $p \times p$ , then ${\mathbf B}={\mathbf A}^{-1}$ has an inverse Wishart distribution ${\mathbf B}\sim W^{-1}(\mathbf{\Psi},m)$ with $\mathbf{\Psi} = \mathbf{\Sigma}^{-1}$

Gaussian Conjugate

Let $\mathbf{\Sigma}$ be a covariance matrix prior whose prior $p(\mathbf{\Sigma})$ has a $W^{-1}({\mathbf\Psi},m)$ distribution. If the observations $\mathbf{X}=[\mathbf{x}_1,\ldots,\mathbf{x}_n]$ are independent p-variate gaussian variables drawn from a $N(\mathbf{0},{\mathbf \Sigma})$ distribution, then the conditional distribution ${p(\mathbf{\Sigma}|\mathbf{X})}$ has a $W^{-1}({\mathbf A}+{\mathbf\Psi},n+m)$ distribution, where ${\mathbf{A}}=\mathbf{X}\mathbf{X}^T$ is $n$ times the sample covariance matrix.

Because the prior and posterior distributions are from the same family, the inverse Wishart distribution is conjugate to the multivariate Gaussian.

Gamma distributions in R and BUGS

Gamma

BUGS:

$x\sim dgamma(r,\mu) =\frac{1}{\Gamma(r)}\mu^rx^{r-1}e^{-\mu x};~x>0$

$dgamma(x, shape = a, scale = s)=\frac{1}{\Gamma(a)}\frac{1}{s^a}x^{a-1}e^{-\frac{x}{s}}$

$E(X)=as; Var(X)=as^2$

Correspondence:

$\mu=\frac{1}{s}$

$r=a$

If $a$ is an integer then the distribution represents the sum of $a$ independent exponentially distributed random variables, each of which has a mean of $s$ (which is equivalent to a rate parameter of $s^{-1}$ ).