Memento
Memento
This memento is a compilation of mathematical and statistical subjects taken from various online sources, including PlanetMath, and Wikipedia Encyclopedia.
Likelihood
Let X=() be a random vector and
a statistical model parametrized by , the parameter vector in the parameter space . The likelihood function is a map given by
In other words, the likelikhood function is functionally the same in form as a probability density function. However, the emphasis is changed from the to the . The pdf is a function of the 's while holding the parameters 's constant, is a function of the parameters 's, while holding the 's constant.
When there is no confusion, is abbreviated to be .
The parameter vector such that for all is called a maximum likelihood estimate, or MLE, of .
Many of the density functions are exponential in nature, it is therefore easier to compute the MLE of a likelihood
function by finding the maximum of the natural log of , known as the loglikelihood function:
due to the monotonicity of the log function.
Examples:

A coin is tossed times and heads are observed. Assume that the probability of a head after one toss is . What is the MLE of ?
Solution: Define the outcome of a toss be 0 if a tail is observed and 1 if a head is observed. Next, let be the outcome of
the th toss. For any single toss, the density function is where . Assume that the tosses are independent events, then the joint probability density is
which is also the likelihood function . Therefore, the loglikelihood function has the form
Using standard calculus, we get that the MLE of is

Suppose a sample of data points are collected. Assume that the and the 's are independent of each other. What is the MLE of the parameter vector ?
Solution: The joint pdf of the , and hence the likelihood function, is
The loglikelihood function is
Taking the first derivative (gradient), we get
Setting
and solve for we have
where is the sample mean and is the sample variance. Finally, we verify that is indeed the MLE of by checking the negativity of the 2nd derivatives (for each parameter).
Deviance
Background
In testing the fit of a generalized linear model of some data (with response variable Y and explanatory variable(s) X), one way is to compare with a similar model . By similarity we mean: given with the response variable and link function such that , the model
 is a generalized linear model of the same data,
 has the response variable distributed as , same as found in
 has the same link function as found in , such that
Notice that the only possible difference is found in the parameters .
It is desirable for this to be served as a base model in case when more than one models are being assessed. Two possible candidates for are the null model and the saturated model. The null model is one in which only one parameter is used so that , all responses have the same predicted outcome. The saturated model is the other extreme where the maximum number of parameters are used in the model so that the observed response values equal to the predicted response values exactly,
Definition
The deviance of a model (generalized linear model) is given by
where is the loglikelihood function, is the MLE of the parameter vector from and is the MLE of parameter vector from the saturated model .
Example
For a normal or general linear model, where the link function is the identity:
where the 's are mutually independent and normally distributed as . The loglikelihood function is given by
where is the predicted response values, and is the number of observations.
For the model in question, suppose is the expected mean calculated from the maximum likelihood estimate of the parameter vector . So,
For the saturated model , the predicted value = the observed response value . Therefore,
So the deviance is
which is exactly the residual sum of squares, or RSS, used in regression models.
Remarks
 The deviance is necessarily nonnegative.
 The distribution of the deviance is asymptotically a chi square distribution with degrees of freedom, where is the number of observations and is the number of parameters in the model .
 If two generalized linear models and are nested, say is nested within , we can perform hypothesis testing : the model for the data is with parameters, against : the model for the data is the more general with parameters, where . The deviance difference (dev) can be used as a test statistic and it is approximately a chi square distribution with degrees of freedom.
Sums of squares (SS)
Four Types of Sums of Squares for ANOVA Effects (Copyright 2006, Karl L. Wuensch
 {Type I SS are orderdependent (hierarchical, sequential); each effect is adjusted for all other effects that appear earlier (to the left) in the model, but not for any effects that appear later in the model. Type I SS are computed as the decrease in the Error SS (increase in the Model SS) when the effect is added to a model. The sum of all of the effects SS will equal the total Model SS for Type I SS  this is not generally true for the other types of SS (which exclude some or all of the variance that cannot be unambiguously allocated to one and only one effect). Type I SS are appropriate for balanced (orthogonal, equal n) analyses of variance in which the effects are specified in proper order (main effects, then twoway interactions, then threeway interactions, etc.) and for trend analysis where the powers for the quantitative factor are ordered from lowest to highest in the model statement.}
 {Type II SS are the reduction in the SSE due to adding the effect to a model that contains all other effects except effects that contain the effect being tested. An effect is contained in another effect if it can be derived by deleting terms in that effect.}
 {Type III SS are each adjusted for all other effects in the model, regardless of order. Strongly recommended in nonorthogonal ANOVA. Type IV SS are identical to Type III SS for designs with no missing cells.}
Conjugate distributions
In Bayesian probability theory, a class of prior probability distributions is said to be conjugate to a class of likelihood functions if the resulting posterior distributions are in the same family as (see Table ).
\input{conjugates_table}
Wishart and invertedWishart distributions
Let be independent dimensional random variables, which are multivariate normally distributed. Let . Let be the matrix with as rows.
Then the joint distribution of the elements of is said to be a Wishart distribution on p degrees of freedom, and is denoted by . If , the distribution is central and is denoted by .
has a density function when :
is the multivariate Gamma distribution.
The Wishart distribution is a multivariate generalization of the distribution.
InverseWishart
The inverseWishart distribution is used as the conjugate for the covariance matrix of a multivariate normal distribution.
if its probability density function is written as follows:
where is a matrix. The matrix is supposed to be positive definite.
If and is , then has an inverse Wishart distribution with
Gaussian Conjugate
Let be a covariance matrix prior whose prior has a distribution. If the observations are independent pvariate gaussian variables drawn from a distribution, then the conditional distribution has a distribution, where is times the sample covariance matrix.
Because the prior and posterior distributions are from the same family, the inverse Wishart distribution is conjugate to the multivariate Gaussian.
Gamma distributions in R and BUGS
Gamma
BUGS:
R:
Correspondence:
If is an integer then the distribution represents the sum of independent exponentially distributed random variables, each of which has a mean of (which is equivalent to a rate parameter of ).
Inverse gamma
Definition
Derivation from distribution
If , follows with