Memento
Memento
This memento is a compilation of mathematical and statistical subjects taken from various online sources, including PlanetMath, and Wikipedia Encyclopedia.
Likelihood
Let X=() be a random vector and
![]() |
a statistical model parametrized by , the parameter vector in the parameter space
. The likelihood function is a map
given by
![]() |
In other words, the likelikhood function is functionally the same in form as a probability density function. However, the emphasis is changed from the to the
. The pdf is a function of the
's while holding the parameters
's constant,
is a function of the parameters
's, while holding the
's constant.
When there is no confusion, is abbreviated to be
.
The parameter vector such that
for all
is called a maximum likelihood estimate, or MLE, of
.
Many of the density functions are exponential in nature, it is therefore easier to compute the MLE of a likelihood
function by finding the maximum of the natural log of
, known as the log-likelihood function:
![]() |
due to the monotonicity of the log function.
Examples:
-
A coin is tossed
times and
heads are observed. Assume that the probability of a head after one toss is
. What is the MLE of
?
Solution: Define the outcome of a toss be 0 if a tail is observed and 1 if a head is observed. Next, let
be the outcome of
theth toss. For any single toss, the density function is
where
. Assume that the tosses are independent events, then the joint probability density is
which is also the likelihood function
. Therefore, the log-likelihood function has the form
Using standard calculus, we get that the MLE of
is
-
Suppose a sample of
data points
are collected. Assume that the
and the
's are independent of each other. What is the MLE of the parameter vector
?
Solution: The joint pdf of the
, and hence the likelihood function, is
The log-likelihood function is
Taking the first derivative (gradient), we get
Setting
and solve for
we have
where
is the sample mean and
is the sample variance. Finally, we verify that
is indeed the MLE of
by checking the negativity of the 2nd derivatives (for each parameter).
Deviance
Background
In testing the fit of a generalized linear model of some data (with response variable Y and explanatory variable(s) X), one way is to compare
with a similar model
. By similarity we mean: given
with the response variable
and link function
such that
, the model
- is a generalized linear model of the same data,
- has the response variable
distributed as
, same as found in
- has the same link function
as found in
, such that
Notice that the only possible difference is found in the parameters .
It is desirable for this to be served as a base model in case when more than one models are being assessed. Two possible candidates for
are the null model and the saturated model. The null model
is one in which only one parameter
is used so that
, all responses have the same predicted outcome. The saturated model
is the other extreme where the maximum number of parameters are used in the model so that the observed response values equal to the predicted response values exactly,
Definition
The deviance of a model (generalized linear model) is given by
![]() |
where is the log-likelihood function,
is the MLE of the parameter vector
from
and
is the MLE of parameter vector
from the saturated model
.
Example
For a normal or general linear model, where the link function is the identity:
![]() |
where the 's are mutually independent and normally distributed as
. The log-likelihood function is given by
![]() |
where is the predicted response values, and
is the number of observations.
For the model in question, suppose is the expected mean calculated from the maximum likelihood estimate
of the parameter vector
. So,
![]() |
For the saturated model , the predicted value
= the observed response value
. Therefore,
![]() |
So the deviance is
![]() |
which is exactly the residual sum of squares, or RSS, used in regression models.
Remarks
- The deviance is necessarily non-negative.
- The distribution of the deviance is asymptotically a chi square distribution with
degrees of freedom, where
is the number of observations and
is the number of parameters in the model
.
- If two generalized linear models
and
are nested, say
is nested within
, we can perform hypothesis testing
: the model for the data is
with
parameters, against
: the model for the data is the more general
with
parameters, where
. The deviance difference
(dev)
can be used as a test statistic and it is approximately a chi square distribution with
degrees of freedom.
Sums of squares (SS)
Four Types of Sums of Squares for ANOVA Effects (Copyright 2006, Karl L. Wuensch
- {Type I SS are order-dependent (hierarchical, sequential); each effect is adjusted for all other effects that appear earlier (to the left) in the model, but not for any effects that appear later in the model. Type I SS are computed as the decrease in the Error SS (increase in the Model SS) when the effect is added to a model. The sum of all of the effects SS will equal the total Model SS for Type I SS -- this is not generally true for the other types of SS (which exclude some or all of the variance that cannot be unambiguously allocated to one and only one effect). Type I SS are appropriate for balanced (orthogonal, equal n) analyses of variance in which the effects are specified in proper order (main effects, then two-way interactions, then three-way interactions, etc.) and for trend analysis where the powers for the quantitative factor are ordered from lowest to highest in the model statement.}
- {Type II SS are the reduction in the SSE due to adding the effect to a model that contains all other effects except effects that contain the effect being tested. An effect is contained in another effect if it can be derived by deleting terms in that effect.}
- {Type III SS are each adjusted for all other effects in the model, regardless of order. Strongly recommended in nonorthogonal ANOVA. Type IV SS are identical to Type III SS for designs with no missing cells.}
Conjugate distributions
In Bayesian probability theory, a class of prior probability distributions is said to be conjugate to a class of likelihood functions
if the resulting posterior distributions
are in the same family as
(see Table ).
\input{conjugates_table}
Wishart and inverted-Wishart distributions
Let be independent
-dimensional random variables, which are multivariate normally distributed. Let
. Let
be the
matrix with
as rows.
Then the joint distribution of the elements of is said to be a Wishart distribution on p degrees of freedom, and is denoted by
. If
, the distribution is central and is denoted by
.
has a density function when
:
![]() |
is the multivariate Gamma distribution.
The Wishart distribution is a multivariate generalization of the distribution.
Inverse-Wishart
The inverse-Wishart distribution is used as the conjugate for the covariance matrix of a multivariate normal distribution.
if its probability density function is written as follows:
![]() |
where is a
matrix. The matrix
is supposed to be positive definite.
If and
is
, then
has an inverse Wishart distribution
with
Gaussian Conjugate
Let be a covariance matrix prior whose prior
has a
distribution. If the observations
are independent p-variate gaussian variables drawn from a
distribution, then the conditional distribution
has a
distribution, where
is
times the sample covariance matrix.
Because the prior and posterior distributions are from the same family, the inverse Wishart distribution is conjugate to the multivariate Gaussian.
Gamma distributions in R and BUGS
Gamma
BUGS:
![]() |
R:
![]() |
![]() |
Correspondence:
![]() |
![]() |
If is an integer then the distribution represents the sum of
independent exponentially distributed random variables, each of which has a mean of
(which is equivalent to a rate parameter of
).
Inverse gamma
Definition
![]() |
Derivation from
distribution
If ,
follows
with
![]() |
![]() |