We now devote our attention to a method of category (3) from page .
The method differs from the Steepest Descent and Conjugate Gradients method, which
both are of category (2), in that the information of the second derivative is used to
locate the minimum of the function
. This results in faster convergence, but
not necessarily less computing time. The computation of the second derivate and the
handling of its matrix can be very time-consuming, especially for large systems.
The idea behind the Newton-Raphson method is to approximate the given function
in each iteration by a quadratic function, as given in equation
(4.6), and then move to the minimum of this quadratic. The quadratic function
for a point
in a suitable neighbourhood of the current point
is
given by a truncated Taylor series:
where both the gradient and the Hessian matrix
are
evaluated at
. The derivative of (4.15) is
The Hessian matrix is always symmetric if the function is twice
continuously differentiable at every point, which is the case here. Hence,
(4.16) reduces to
If we assume that takes its minimum at
, the
gradient is zero; i.e.
which is noting else but a linear system. The Newton-Raphson method uses the
as the next current point, resulting in the iterative formula
where is often referred to as the Newton
direction.
If the approximation in (4.15) is valid, the method will converge in just a
few iterations. For example, if the function to be optimized, , is a
n-dimensional quadratic function, it will converge in only one step from any starting
point.
Even though the convergence may seem to be fast, judged by the number of iterations, each iteration does include the calculation of the second derivative and handling of the Hessian. The performance of the method is therefore very dependent on certain qualities of the Hessian. One of these qualities is positive definiteness. For the method to converge towards a minimum, the Newton direction needs to be a direction of descent. Thus, we require
which, by inserting (4.19), gives
This inequality is satisfied at all points for which
if
is positive definite. The further
is from the solution, the worse the quadratic approximation in
(4.15) gets, which can result in the Hessian not being positive definite.
The method is then no longer guaranteed to proceed toward a minimum; it may end up at
any other critical point, whether it be a saddle point or a maximum point.
There are several ways to make sure that the Hessian is positive definite, e.g. by
adding a large enough , where
is the unit matrix and
is a positive scalar, or by diagonalizing the matrix by the use of eigenvalues.
Descriptions of some of these methods can be found in textbooks like GMW,DS.
The size of the Hessian can also be crucial to the effectiveness of the Newton-Raphson
method. For systems with a large number of dimensions, i.e. that the function
has a large number of variables, both the computation of the matrix,
and calculations that includes it, will be very time-consuming. This can be mended by
either just using the diagonal terms in the Hessian; i.e. ignoring the cross terms, or
just not recalculate the Hessian at each iteration (can be done due to slow variation of
the second derivative).
Another serious disadvantage of the Newton-Raphson method is that it is not necessarily globally convergent, meaning that it may not converge from any starting point. If it does converge, it is not unusual that it expend significant computational efforts in getting close enough to the solution where the approximation in (4.15) becomes valid, and the convergence is fast. An aid to this problem is to adjust the stepsize in the Newton direction, assuring the function value to decrease. Equation (4.19) is then as follows:
There are several methods for doing this, for example linear searches and
fixed stepsize mentioned in Subsection 4.3.1, but it is recommended to use a
backtracking scheme [e.g. by][]Press, where the full Newton step, ,
is tried first. If this step fail to satisfy the criterion for the decrease of the
function, one backtracks in a systematic way along the Newton direction. The advantage
of doing this is that one will be able to enjoy the fast convergence of the
Newton-Raphson method close to the solution. For details on the backtracking scheme, see
for example [page 377]Press; [Section 6.3]DS.
In spite of these mentioned problems, the Newton-Raphson method enjoys a certain amount of popularity because of its fast convergence in a sufficiently small neighbourhood of the stationary value. The convergence is quadratic, which, loosely speaking, means that the number of significant digits double after each iteration. In comparison, the convergence of Steepest Descent method is at best linear and for the Conjugate Gradients method it is superlinear. The resultant algorithm for the Newton-Raphson method using backtracking is given in Algorithm 4.3.
As shown in Subsection 4.3.1, the method of Steepest Descent start out having a rather rapid convergence, while Newton-Raphson has just the opposite quality; starts out slowly and ends with a very rapid convergence. It is too obvious to be ignored that these two methods can be combined, resulting in a very efficient method. One starts out with the Steepest Descent method, and switch to the Newton-Raphson method when the progress by the former gets slow, and enjoys the quadratic convergence of the latter. An example of this type of combination is discussed next (and in Section 5.3: the Hybrid method).