Partial derivatives



Partial derivatives

Drawing contours is fine when you have only a couple of parameters, but it may get really hard for more parameters. If you have more than two parameters, you actually have to do a 2-D likelihood profile; this is the analogue of the 1-D profiles we learned to calculate, but now you have to take every combination of the two parameters you're interested in (e.g. a 50 ×50 grid of parameter values) and maximize with respect to all the other parameters for every point on that surface, and then use the values you've calculated to draw a contour. Especially when the likelihood function itself is hard to calculate, this can be really tedious.

One shortcut we can take in general is to look at the second derivative(s) of the log-likelihood as a function of the parameter(s), which gives information about the curvature of the surface, which gives us information about how rapidly the log-likelihood gets worse, which gives us an estimate of the confidence intervals. This involves a second level of approximation (again becoming more accurate as the number of data points increases), but it can be worth it when you want to calculate bivariate regions quickly or deal with correlations in high-dimensional parameter spaces.

To motivate this, let's briefly go back to a one-dimensional normal distribution and do some algebra to calculate the maximum likelihood point, maximum likelihood value, and confidence boundaries. The likelihood for a normal distribution (say we are trying to estimate a constant mean, μ, given a constant (known) variance, σ2) is

|L(μ,σ) = |

| |

|∏ |

|i  |

| |

|1 |

|[pic] |

| |

| |

|√ |

|  |

|[pic] |

|2 π |

|  |

|σ |

| |

| |

|exp |

|⎛ |

|⎜ |

|⎝ |

|− |

|(xi−μ)2 |

|[pic] |

|σ2 |

|⎞ |

|⎟ |

|⎠ |

|, |

| |

so the negative log-likelihood is (ignoring a constant term nlog(√{2 π}))

|−logL(μ,σ) = n logσ+ |

| |

|∑ |

|i  |

| |

|⎛ |

|⎜ |

|⎝ |

|(xi−μ)2 |

|[pic] |

|σ2 |

|⎞ |

|⎟ |

|⎠ |

|. |

| |

The derivative with respect to μ is ∑−(xi−μ)/σ2 = −1/σ2 (∑xi − n μ), which is 0 (minimizing the negative log-likelihood) when μ = ∑xi/n. (In this case the likelihood profile for μ is particularly simple; no matter what σ is, μ = ∑xi/n minimizes the likelihood.) The maximum likelihood value is n logσ, because the other terms cancel.

Now we know the maximum likelihood value for μ. Where is the univariate confidence boundary for μ? In other words, for what value of μ is the neg. log likelihood equal to the minimum value plus χ21(α)/2? (Write the maximum likelihood estimate, ∑xi/n, as m, and the value we are looking for as c. The maximum likelihood is (a constant plus) ∑(xi−m)2/(2 σ2), while the likelihood for μ = c is ∑(xi−c)2/(2 σ2); the deviance is twice the difference, so

|deviance = χ21(α) = 2 |(3) |

|⎡ | |

|⎢ | |

|⎣ | |

| | |

|∑ | |

|i  | |

| | |

|⎛ | |

|⎜ | |

|⎝ | |

|(xi−c)2 | |

|[pic] | |

|2 σ2 | |

|⎞ | |

|⎟ | |

|⎠ | |

|− | |

| | |

|∑ | |

|i  | |

| | |

|⎛ | |

|⎜ | |

|⎝ | |

|(xi−m)2 | |

|[pic] | |

|2 σ2 | |

|⎞ | |

|⎟ | |

|⎠ | |

|⎤ | |

|⎥ | |

|⎦ | |

|. | |

| | |

Multiplying both sides by σ2 and simplifying slightly:

| |(4) |

|∑ | |

|(xi−c)2 − | |

|∑ | |

|(xi−m)2 = χ21(α) σ2. | |

| | |

Expanding the squared terms (and using ∑i = 1n C = nC, where C is any constant):

| |(5) |

|∑ | |

|xi2 −2 c | |

|∑ | |

|xi − nc2 − | |

|∑ | |

|xi2 +2 m | |

|∑ | |

|xi −nm2 = χ21(α) σ2. | |

| | |

Cancelling ∑xi2, and substituting ∑xi = mn:

|−2 c m n + nc2 − +2 m2 n −nm2 = χ21(α) σ2. |(6) |

| | |

Grouping terms:

|n (m2 −2 c m + c2) = χ21(α) σ2 . |(7) |

| | |

Collapsing the square and dividing by n:

|(m−c)2 = χ21(α) σ2/n. |(8) |

| | |

Taking square roots on both sides and rearranging:

|c = m ± |(9) |

| | |

|√ | |

|  | |

|[pic] | |

|χ21(α) | |

|  | |

|·σ/√n | |

| | |

This might look familiar. We've just rederived the expression for the confidence limits of the mean! A few things to note:

• σ/√n is the standard error of the mean;

• it turns out that √{χ21(α)}, which appears in the last equation, is identical to the α quantile for the normal distribution (try sqrt(qchisq(0.95,1)) and qnorm(0.975) in R to test this theory [the 0.975 is because this translates to a two-tailed test on the normal distribution but a one-tailed test on the χ2 distribution, because the χ2 is the distribution of a squared normal deviate]);

• you can also note that the test is on a normal distribution rather than a Student's t distribution, which is a reminder that the LRT is an asymptotic result for large data sets, just as the normal distribution is OK to use instead of a t distribution for large n;

• all of this assumes that we know the true variance, σ2. As H&M point out in Chapter 7, we really ought to draw bivariate confidence intervals for the mean and variance, which would give us a wider confidence region.

For the normal distribution, the second derivative of negative log-likelihood wrt μ equals

| |

|d2( |

|∑ |

|(xi−μ)2)/(2 σ2) |

| |

|[pic] |

|μ2 |

| |

| |

| |

|= |

| |

| |

| |

|d(−2 |

|∑ |

|(xi−μ)/(2 σ2) |

| |

|[pic] |

|μ |

| |

| |

| |

|(10) |

| |

| |

| |

| |

|= |

| |

| |

| |

|d(−2 |

|⎛ |

|⎝ |

|∑ |

|(xi)−n μ) |

|⎞ |

|⎠ |

|/(2 σ2) |

| |

|[pic] |

|μ |

| |

| |

| |

|(11) |

| |

| |

| |

| |

|= |

| |

| |

| |

|2n |

|[pic] |

|(2 σ2) |

| |

| |

| |

|(12) |

| |

| |

| |

| |

|= |

| |

| |

| |

|n |

|[pic] |

|(σ2) |

| |

| |

| |

|(13) |

| |

So the reciprocal of the second derivative is σ2/n, the variance of the mean. In general, for a one-parameter likelihood with parameter a, the width of our confidence region is

|n |(14) |

| | |

|√ | |

|  | |

|[pic] | |

|χ2p(α) ·(∂2(logL)/∂a2)−1 | |

|  | |

|. | |

| | |

We can get the second derivative by (i) calculating it analytically (sometimes not as hard as you'd think); (ii) calculating it numerically, using

| |

|∂2 f |

|[pic] |

|∂a2 |

|⎢ |

|⎢ |

|⎢ |

| |

| |

|a = m  |

| |

| |

| |

|= |

| |

| |

| |

|∂ |

|⎛ |

|⎜ |

|⎝ |

| |

|∂f |

|[pic] |

|∂a |

|⎞ |

|⎟ |

|⎠ |

| |

|[pic] |

|∂a |

|⎢ |

|⎢ |

|⎢ |

|⎢ |

|⎢ |

|⎢ |

| |

| |

| |

| |

|a = m  |

| |

| |

| |

|(15) |

| |

| |

| |

| |

|≈ |

| |

| |

| |

|∂f |

|[pic] |

|∂a |

|⎢ |

|⎢ |

|⎢ |

| |

| |

|a = m+Δx  |

|− |

|∂f |

|[pic] |

|∂a |

|⎢ |

|⎢ |

|⎢ |

| |

| |

|a = m  |

| |

|[pic] |

|Δx |

| |

| |

| |

|(16) |

| |

| |

| |

| |

|≈ |

| |

| |

| |

|f((m+Δx)+Δx) − f(m+Δx) |

|[pic] |

|Δx |

|− |

|f(m+Δx) − f(m) |

|[pic] |

|Δx |

| |

|[pic] |

|Δx |

| |

| |

| |

|(17) |

| |

| |

| |

| |

|= |

| |

| |

| |

|f(m+2 Δx) − 2 f(m+Δx) + f(m) |

|[pic] |

|(Δx)2 |

|; |

| |

| |

|(18) |

| |

(3) letting R calculate it for us (if we are using nlm), by putting hessian=TRUE into the function call.

If we have more than one parameter, the same idea still works (and indeed becomes worth doing, which it isn't really in the one-dimensional case), but we have to learn a little bit more about second derivatives. A multi-parameter likelihood surface will have more than one second partial derivative: in fact, what we get is a matrix of second partial derivatives, called the Hessian. For example, for the log-likelihood L of the normal distribution with parameters μ and σ, the Hessian looks like this:

| |(19) |

|⎛ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎝ | |

| | |

| | |

|∂2 L | |

|[pic] | |

|∂μ2 | |

| | |

| | |

| | |

| | |

|∂2 L | |

|[pic] | |

|∂μ∂σ | |

| | |

| | |

| | |

| | |

| | |

|∂2 L | |

|[pic] | |

|∂μ∂σ | |

| | |

| | |

| | |

| | |

|∂2 L | |

|[pic] | |

|∂σ2 | |

| | |

| | |

| | |

| | |

| | |

|⎞ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎠ | |

|. | |

| | |

In the very simplest case (one parameter), this reduces to a single number (e.g. ∂2 L/∂μ2 as above), and the estimated variance of the parameter is just (∂2 L/∂μ2)−1 (again as above).

In the next most complicated case, the parameters are uncorrelated, and the matrix is diagonal:

| |(20) |

|⎛ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎝ | |

| | |

| | |

|∂2 L | |

|[pic] | |

|∂μ2 | |

| | |

| | |

| | |

|0 | |

| | |

| | |

| | |

|0 | |

| | |

| | |

| | |

|∂2 L | |

|[pic] | |

|∂σ2 | |

| | |

| | |

| | |

| | |

| | |

|⎞ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎠ | |

|. | |

| | |

Then we can compute the variance of each parameter independently-they're the reciprocals of the second partial derivative with respect to each parameter (e.g., (∂2 L/∂μ2)−1 and (∂2 L/∂σ2)−1).

When the off-diagonal elements are not zero, however, we have to find the inverse of the matrix numerically, which we can do in R using solve(hess) if hess is the Hessian matrix. If I is the matrix we get this way, and a and b are our two parameters, then the new matrix is

|I = |(21) |

|⎛ | |

|⎜ | |

|⎝ | |

| | |

|σ2a | |

| | |

| | |

|σab | |

| | |

| | |

| | |

|σab | |

| | |

| | |

|σ2b | |

| | |

| | |

| | |

| | |

|⎞ | |

|⎟ | |

|⎠ | |

|, | |

| | |

where σ2a and σ2b are the variances of a and b and σab is the covariance between them; the correlation between the parameters is σab/(σa σb).

The ellipse function in misc.R will plot an ellipse on a graph given either a variance-covariance matrix or a vector of variances of both parameters and a correlation between them.

[pic]

Comparing the (approximate) 95% confidence ellipse to the real contours for this case where n = 20 (that's how many simulated points are in our data set), it doesn't look too bad. The real region is a bit skewed-it includes more points where d and r are both larger than the maximum likelihood estimate, and fewer where both are smaller-while the approximate ellipse is symmetric around the max. likelihood estimate.

Finally, this method extends to more than two parameters, even if it gets a bit hard to draw the pictures. If you have p parameters, then the Hessian is a p ×p matrix. If you invert the Hessian using solve you get a (symmetric) matrix

|I = |(22) |

|⎛ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎜ | |

|⎝ | |

| | |

|σ21 | |

| | |

| | |

|σ12 | |

| | |

| | |

|… | |

| | |

| | |

|σ1p | |

| | |

| | |

| | |

|σ21 | |

| | |

| | |

|σ22 | |

| | |

| | |

|… | |

| | |

| | |

|σ2p | |

| | |

| | |

| | |

|: | |

| | |

| | |

|: | |

| | |

| | |

|··· | |

| | |

| | |

|: | |

| | |

| | |

| | |

|σp1 | |

| | |

| | |

|σp2 | |

| | |

| | |

|… | |

| | |

| | |

|σ2p | |

| | |

| | |

| | |

| | |

|⎞ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎟ | |

|⎠ | |

|, | |

| | |

where σi2 is the estimated variance of variable i and where σij = σji is the estimated covariance between variables i and j: the correlation between i and j is σij/(σi σj).

That's probably a big enough mouthful for one file: see the next lecture for information on comparing different models (but note that we've laid a lot of the groundwork for that topic already).

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download