Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( m ( x , θ θ --> i ) = θ θ --> 1 + θ θ --> 2 x ( θ θ --> 3 ) {\displaystyle m(x,\theta _{i})=\theta _{1}+\theta _{2}x^{(\theta _{3})}} ).
Consider a set of m {\displaystyle m} data points, ( x 1 , y 1 ) , ( x 2 , y 2 ) , … … --> , ( x m , y m ) , {\displaystyle (x_{1},y_{1}),(x_{2},y_{2}),\dots ,(x_{m},y_{m}),} and a curve (model function) y ^ ^ --> = f ( x , β β --> ) , {\displaystyle {\hat {y}}=f(x,{\boldsymbol {\beta }}),} that in addition to the variable x {\displaystyle x} also depends on n {\displaystyle n} parameters, β β --> = ( β β --> 1 , β β --> 2 , … … --> , β β --> n ) , {\displaystyle {\boldsymbol {\beta }}=(\beta _{1},\beta _{2},\dots ,\beta _{n}),} with m ≥ ≥ --> n . {\displaystyle m\geq n.} It is desired to find the vector β β --> {\displaystyle {\boldsymbol {\beta }}} of parameters such that the curve fits best the given data in the least squares sense, that is, the sum of squares
The minimum value of S occurs when the gradient is zero. Since the model contains n parameters there are n gradient equations:
In a nonlinear system, the derivatives ∂ ∂ --> r i ∂ ∂ --> β β --> j {\textstyle {\frac {\partial r_{i}}{\partial \beta _{j}}}} are functions of both the independent variable and the parameters, so in general these gradient equations do not have a closed solution. Instead, initial values must be chosen for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation,
Here, k is an iteration number and the vector of increments, Δ Δ --> β β --> {\displaystyle \Delta {\boldsymbol {\beta }}} is known as the shift vector. At each iteration the model is linearized by approximation to a first-order Taylor polynomial expansion about β β --> k {\displaystyle {\boldsymbol {\beta }}^{k}}
Substituting these expressions into the gradient equations, they become
The normal equations are written in matrix notation as
These equations form the basis for the Gauss–Newton algorithm for a non-linear least squares problem.
Note the sign convention in the definition of the Jacobian matrix in terms of the derivatives. Formulas linear in J {\displaystyle J} may appear with factor of − − --> 1 {\displaystyle -1} in other articles or the literature.
When the observations are not equally reliable, a weighted sum of squares may be minimized,
Each element of the diagonal weight matrix W should, ideally, be equal to the reciprocal of the error variance of the measurement.[1] The normal equations are then, more generally,
In linear least squares the objective function, S, is a quadratic function of the parameters.
Some problems of ill-conditioning and divergence can be corrected by finding initial parameter estimates that are near to the optimal values. A good way to do this is by computer simulation. Both the observed and calculated data are displayed on a screen. The parameters of the model are adjusted by hand until the agreement between observed and calculated data is reasonably good. Although this will be a subjective judgment, it is sufficient to find a good starting point for the non-linear refinement. Initial parameter estimates can be created using transformations or linearizations. Better still evolutionary algorithms such as the Stochastic Funnel Algorithm can lead to the convex basin of attraction that surrounds the optimal parameter estimates.[citation needed] Hybrid algorithms that use randomization and elitism, followed by Newton methods have been shown to be useful and computationally efficient[citation needed].
Any method among the ones described below can be applied to find a solution.
The common sense criterion for convergence is that the sum of squares does not increase from one iteration to the next. However this criterion is often difficult to implement in practice, for various reasons. A useful convergence criterion is
Again, the numerical value is somewhat arbitrary; 0.001 is equivalent to specifying that each parameter should be refined to 0.1% precision. This is reasonable when it is less than the largest relative standard deviation on the parameters.
There are models for which it is either very difficult or even impossible to derive analytical expressions for the elements of the Jacobian. Then, the numerical approximation
Some information is given in the corresponding section on the Weighted least squares page.
Multiple minima can occur in a variety of circumstances some of which are:
Not all multiple minima have equal values of the objective function. False minima, also known as local minima, occur when the objective function value is greater than its value at the so-called global minimum. To be certain that the minimum found is the global minimum, the refinement should be started with widely differing initial values of the parameters. When the same minimum is found regardless of starting point, it is likely to be the global minimum.
When multiple minima exist there is an important consequence: the objective function will have a maximum value somewhere between two minima. The normal equations matrix is not positive definite at a maximum in the objective function, as the gradient is zero and no unique direction of descent exists. Refinement from a point (a set of parameter values) close to a maximum will be ill-conditioned and should be avoided as a starting point. For example, when fitting a Lorentzian the normal equations matrix is not positive definite when the half-width of the band is zero.[2]
A non-linear model can sometimes be transformed into a linear one. Such an approximation is, for instance, often applicable in the vicinity of the best estimator, and it is one of the basic assumption in most iterative minimization algorithms. When a linear approximation is valid, the model can directly be used for inference with a generalized least squares, where the equations of the Linear Template Fit[3] apply.
Another example of a linear approximation would be, when the model is a simple exponential function,
Another example is furnished by Michaelis–Menten kinetics, used to determine two parameters V max {\displaystyle V_{\max }} and K m {\displaystyle K_{m}} :
The normal equations
If divergence occurs, a simple expedient is to reduce the length of the shift vector, Δ Δ --> β β --> {\displaystyle \Delta {\boldsymbol {\beta }}} , by a fraction, f
When using shift-cutting, the direction of the shift vector remains unchanged. This limits the applicability of the method to situations where the direction of the shift vector is not very different from what it would be if the objective function were approximately quadratic in the parameters, β β --> k . {\displaystyle {\boldsymbol {\beta }}^{k}.}
If divergence occurs and the direction of the shift vector is so far from its "ideal" direction that shift-cutting is not very effective, that is, the fraction, f required to avoid divergence is very small, the direction must be changed. This can be achieved by using the Marquardt parameter.[5] In this method the normal equations are modified
Various strategies have been proposed for the determination of the Marquardt parameter. As with shift-cutting, it is wasteful to optimize this parameter too stringently. Rather, once a value has been found that brings about a reduction in the value of the objective function, that value of the parameter is carried to the next iteration, reduced if possible, or increased if need be. When reducing the value of the Marquardt parameter, there is a cut-off value below which it is safe to set it to zero, that is, to continue with the unmodified Gauss–Newton method. The cut-off value may be set equal to the smallest singular value of the Jacobian.[6] A bound for this value is given by 1 / tr --> ( J T W J ) − − --> 1 {\displaystyle 1/\operatorname {tr} \left(\mathbf {J} ^{\mathsf {T}}\mathbf {WJ} \right)^{-1}} where tr is the trace function.[7]
The minimum in the sum of squares can be found by a method that does not involve forming the normal equations. The residuals with the linearized model can be written as
The residual vector is left-multiplied by Q T {\displaystyle \mathbf {Q} ^{\mathsf {T}}} .
This has no effect on the sum of squares since S = r T Q Q T r = r T r {\displaystyle S=\mathbf {r} ^{\mathsf {T}}\mathbf {Q} \mathbf {Q} ^{\mathsf {T}}\mathbf {r} =\mathbf {r} ^{\mathsf {T}}\mathbf {r} } because Q is orthogonal The minimum value of S is attained when the upper block is zero. Therefore, the shift vector is found by solving
These equations are easily solved as R is upper triangular.
A variant of the method of orthogonal decomposition involves singular value decomposition, in which R is diagonalized by further orthogonal transformations.
The relative simplicity of this expression is very useful in theoretical analysis of non-linear least squares. The application of singular value decomposition is discussed in detail in Lawson and Hanson.[6]
There are many examples in the scientific literature where different methods have been used for non-linear data-fitting problems.
Direct search methods depend on evaluations of the objective function at a variety of parameter values and do not use derivatives at all. They offer alternatives to the use of numerical derivatives in the Gauss–Newton method and gradient methods.
More detailed descriptions of these, and other, methods are available, in Numerical Recipes, together with computer code in various languages.
{{cite book}}
|volume=
Lokasi Pengunjung: 3.21.98.206