Implementing and evaluating the nested maximum likelihood estimation technique

Estimating parameters describing response time distributions is difficult. The most commonly used method for parameter estimation is the maximum likelihood method (ML). However, this method applied on the three-parameter Weibull distribution returns biased estimates and the amount of bias is unknown. A recent method, that we call nested maximum likelihood, was proposed by Gourdin, Hansen and Jaumard (1994). Due to its complexity, it has never been used and tested systematically. Here I compare it to the maximum likelihood method. The results shows that nested maximum likelihood is slightly better than ML. Although the gains are marginal, the method has important implications for future research in parameter estimation.


Denis Cousineau Université de Montréal
Estimating parameters describing response time distributions is difficult.The most commonly used method for parameter estimation is the maximum likelihood method (ML).However, this method applied on the three-parameter Weibull distribution returns biased estimates and the amount of bias is unknown.A recent method, that we call nested maximum likelihood, was proposed by Gourdin, Hansen and Jaumard (1994).Due to its complexity, it has never been used and tested systematically.Here I compare it to the maximum likelihood method.The results shows that nested maximum likelihood is slightly better than ML.Although the gains are marginal, the method has important implications for future research in parameter estimation.
Statistical methods in psychology are dominated by the normal distribution.However, very few measures in experimental psychology follow this distribution.For example, the time to complete a task (maybe the most direct access to cognitive processes) is always asymmetrical with a long tail to the right (e. g.Cousineau and Shiffrin, 2004, Hockley, 1984, with an exception, Hopkins and Kristofferson, 1980).Hence, it is of prime importance that we move towards a description of the response times (RTs) that acknowledge this asymmetry.
The most natural such description assumes that there is a true minimum RT below which it is not possible to respond (fast-guessing notwithstanding).Then, three convenient descriptors could be: the lowest possible RT, the width of the distribution and the degree of asymmetry.See Rouder, Lu, Speckman, Sun and Jiang (2005) for reasons supporting this choice.Whereas the width is akin to standard deviation, We would like to thank Sébastien Hélie for his comments on an earlier version of this text.Request for reprint should be addressed to Denis Cousineau, Département de psychologie, Université de Montréal, C. P. 6128, succ.Centre-ville, Montréal (Québec) H3C 3J7, CANADA, or using e-mail at Denis.Cousineau@Umontreal.CA.This research was supported in part by le Conseil de la Recherche en Sciences Naturelles et en Génie du Canada.
there is nothing resembling a mean parameter in the above descriptors, highlighting the conceptual gap with the normal distribution.Figure 1 illustrates two distributions from Cousineau and Larochelle (2004) with the corresponding descriptors.
A parametric approach for quantifying parameters consists in first assuming an underlying theoretical distribution and then adjusting its parameters to the data set through best-fitting techniques.

Theoretical distributions
There exist many families of distributions that could be fit to a data set in order to get parameters (Luce, 1986, Townsend andAshby, 1983).One of the most often-used distribution in psychology is the ExGaussian distribution (e.g.Ratcliff, 1978, Hohle, 1965).However, it has the implausible assumption that valid RTs could occurs before the stimulus (the Gumbel distribution has the same assumption).An alternative distribution is the Lognormal distribution (also called the Galton distribution; Limpton, Stahel and Appt, 2001, West and Schlesinger, 1990, Galton, 1879).However, more and more, the Weibull distribution is used (Weibull, 1952, Logan, 1992, Palmer, 1998, Tuerlinckx, 2004, and many others).Rouder, Lu, Speckman, Sun and Jiang (2005) review practical reasons to use this family of distributions.Also, Cousineau, Goodman and Shiffrin (2003) suggest that it could be a consequence of the way the human brain works.Figure 2 illustrates some possible Weibull distributions along with their parameters.The parameters are the shift (the left-right position of the minimum), the scale (the width) and the shape (the asymmetry).They are often denoted with the greek letters α, β and γ respectively.

Fitting techniques
There exist a few families of techniques to find the bestfitting values of the parameters.One is the method of moment (e. g.Harter and Moore, 1965), another is through Bayesian estimation techniques (e. g.Rouder, Sun, Speckman, Lu and Zhou, 2003), but the most commonly used method is the maximum likelihood (ML) parameter estimation method.Refer to Myung (2003) for a tutorial or Cousineau and Larochelle (1997), Cousineau, Brown and Heathcote (2004).It requires a function computing the likelihood of one possible set of parameters given empirical RTs (noted X).This function is noted L(α, β, γ | X).This function is subjected to a maximization procedure which varies freely the parameter values until the likelihood is maximized.Often, minus the likelihood function is minimized, as minimization procedures used to be more easily available.Also, to avoid underflow on most computers, the log of the likelihood function is used.Hence, the process is to find α, β and γ such that minus the log of the likelihood is minimized, noted in short: ( ) Α, Β and Γ are the domains of the parameters.For the In practical application, it is preferable to use Γ = {0 < γ < 5} as RT distributions are never asymmetrical to the right.

Why another method?
ML is the most efficient method to estimate parameters (see next for a formal definition of efficiency).However, it is also known to be biased (Hirose, 1999): On average, the parameter estimated is not going to be equal to the true parameter of a given population.The bias can be quite large for small sample sizes.For example, for a sample of 8 RTs (taken from a simulated population), the scale parameter is underestimated on average by near 40%!For larger samples, the bias tends to disappear (asymptotically unbiased).The trouble is that the exact amount of bias is unknown.One consequence is that it is not possible to compare parameters taken from samples differing in size.Also, we don't know whether bias depends on the asymmetry or not.This is why new techniques may potentially be important: They may find estimates with smaller bias.Previous variations on the ML methods are MPS (Cheng and Amin, 1983), QMP (Heathcote, Brown and Cousineau, 2004) and prior-informed ML (Cousineau, submitted).

The nested maximum likelihood technique
This technique was proposed by Gourdin, Hansen and Jaumard a decade ago (1994).However, due to the complexity of implementing this method within traditional computer languages, it has never been used.Further, the authors never tested their method on samples taken from simulated populations with known parameters so that the amount of bias be estimated.
The method differs from regular ML in that it does not fit all three parameters simultaneously.Instead, it explores one parameter, adjusting the other two so that this parameter yields the best fit possible.To adjust the two "inner" parameters, it just acts likewise: exploring one parameter, the last one being best-fitted accordingly.Hence, this technique replaces one difficult minimization problem with three simpler (one-dimensional) nested problems.
Replacing -Log(L(θ | X) ) with LL(θ) for short, the method is formally given by:  Gourdin et al. (1994) were able to demonstrate that this nested ML method has all the properties of the regular ML method (efficiency, asymptotically unbiased when n is large).However, for a small n, we don't know if bias is smaller with this technique compared to other techniques.
The pseudo-code of the implementation proposed by Gourdin et al. spans two whole pages; the finished program certainly required over 40 pages in C. Here, we propose the same procedure implemented using Mathematica which takes only one page (Wolfram, 1996).See Listing 1 for the full program.

Testing nested ML estimates against regular ML estimates
To assess the capabilities of nested ML to estimate correctly the parameters, we ran a series of Monte Carlo simulations.The samples are taken from a simulated population with known parameters and then the parameters were estimated using both methods.Because α and β are scale parameters, they were fixed at α = 300 and β = 100.the shape, being one possible cause of bias, was varied (γ = 1.0, 1.5 or 2.0) as well as the sample size (n = 8, 16, 32, 64, 128).
Simulations for each combination of γ and n were replicated a thousand times.We measured: 1-Bias: the difference between the true parameters and the mean of the estimates.Formally, Bias = E (θi -θT ) = θ -θT where θ is one of the parameter, θT is the true parameter value, θi is the estimated parameter on simulation i, and E is the mean.2-Efficiency: the amount of variability in the estimates around their average.Bad methods have very large efficiency so that fitting one data set can result in wildly differing estimates.Formally, where SD is the standard deviation.Both methods used the gradient descent minimization algorithm (Chandler, 1965) and both had the same parameter domains: A = {0 < α < Min(X) }, B = {β > 0}, Γ= {0 < γ < 5}.
The results are seen in Figure 3.As seen, both methods are biased in the same directions (shift is underestimated, scale and shape are overestimated) and by the same magnitude.Whereas bias is little dependant on the shape parameter, it seems that efficiency is, being worst for γ = 2.
The same results are presented in Figure 4 collapsed across the γ values.In this figure, we used the root mean square error of estimation (RMSE) where It can be shown that RMSE 2 = Bias 2 + Eff 2 .It is a good measure when both bias and efficiency are equally important.As seen, the parameter α is the best estimated parameter (in percent, RMSE are 4.7% and 5.0% across sample sizes for nested ML and ML respectively).The other two parameters are poorly estimated (in percent, near 27% and 39% for both methods).There is systematically a small advantage of nested MLE over ML, but the gain is very small.

Discussion
Nested ML is definitely not the solution to adopt for practical applications.So why bother?The demonstrations that accompany the method have profound implications for future research.It shows that the problem of parameter estimation can be broken down in encapsulated problems that can be attacked independently.Among other things, it opens the door to mixed solutions.For example, one parameter might be estimated using another technique than ML.So doing will not yield an efficient method (as ML are generally the most efficient) but if bias can be reduced so that global RMSE will not deteriorate, it is going to be an important progress.

Figure 1 .Figure 2 .
Figure 1.Two distributions of RTs with an illustration of the parameters describing them.The distribution to the right is shifted towards large RTs, has smaller width and is more asymmetrical than the one on the left.0 200 400 600 800 1000 1200 1400 Frequency new α value is tried at the top level (3), it starts a cascade of computations going down to the bottom level (1).

Figure 3 .
Figure3.Bias and efficiency for all the combinations of ϒ (from left to right) of the estimated parameters (from top to bottom).In each graph, the left part is the nested ML estimation technique and the right part is the regular ML estimation technique.The symbols denotes the various samples sizes (from left to right: 8, 16, 32, 64 and 128).

Figure 4 .
Figure 4. RMSE for the three parameters across ϒ. Format follows that of Figure 4. On the left is RMSE, on the right, RMSE expressed in percentage of the true parameter value.