Comparing the iterative formula of the Steepest Descent (4.5) with
(4.22) shows that the Steepest Decent method is identical to the
Newton-Raphson when ,
being the unit matrix. This is
the basis for the Secant method, also known as the Variable Metric
method.
As seen in the previous subsection, the Hessian matrix needed in the Newton-Raphson method can be both laborious to calculate and invert for systems with a large number of dimensions. The idea of the Secant method is not to use the Hessian matrix directly, but rather start of the procedure with an approximation to the matrix, which, as one gets closer to the solution, gradually approaches the Hessian.
The earliest Secant methods were formulated in terms of constructing a sequence of
matrices which builds up the inverse Hessian; that is,
Even better if the limit is achieved after n iterations instead of infinitely many iterations. The iterative formula is then
where is the gradient. Later, methods for constructing the
ordinary Hessian were presented. The inverse Hessian approximation might seem to offer
an advantage in terms of the number of arithmetic operation required to perform an
iteration, since one needs to inverse the matrix for the Hessian update. This seeming
defect was eliminated by, instead of updating the Hessian, the Cholesky factors
of the Hessian were modified at each iteration GMW. The
number of arithmetic operations is then the same for the two methods. Both methods are
being used today, the choice being a matter of convenience and taste. The inverse
Hessian update will be used in this thesis, and therefore only this method will be
presented in the following. For details on the Hessian update, see GMW,DS.
For a n-dimensional quadratic function, the inverse Hessian will be achieved after at most n iterations, and the solution will then be located, from what we saw in Subsection 4.3.3, in only one iteration. Hence, the solution will be arrived at after at most n iterations. In other words, this method resembles the method of Steepest Descent near the starting point of the procedure, while near the optimal point, it metamorphose into the Newton-Raphson method.
The Secant method is quite similar to the Conjugate Gradients, in some senses. They both
converge after n iterations for an n-dimensional quadratic function, they both
accumulate information from successive iterations, and they both require the
calculation of only the first derivative. (The latter makes the Secant method part of
category (2) from page .) The main difference between the two methods
is the way they store and update the information; the Secant method requires an
matrix while the Conjugate Gradients method only needs an n-dimensional
vector. Usually, the storage of the matrix in the Secant method is no disadvantage, at
least not for a moderate number of variables.
So how are the being updated at each iteration? There are two main schemes
GMW, namely Davidon-Fletcher-Powell (DFP) and Broyden-Fletcher-Goldfarb-Shanno
(BFGS). Let us first consider the DFP algorithm. At a point
, the
approximated inverse Hessian at the subsequent point is given by Press
where and
,
being the gradient. This expression
is constructed by forcing the
-s to be mutually
-conjugate. Even
though the deduction of equation (4.25) is in regard of a quadratic function,
it basically applies for all kinds of equations/systems. Another property of
(4.25) is that if
is positive definite, it will remain so for
each k; i.e.
is a sequence of positive definite matrices. For proofs and a
detailed deduction of (4.25), see Walsh.
The BFGS algorithm was later introduced to improved some unbecoming features of the DFP formula, concerning, for example, roundoff error and convergence tolerance. The BFGS updating formula is exactly the same, apart from one additional term Press
where is defined by
The BFGS formula is usually to be preferred GMW.
The step length in equation (4.24) is determined as for the
Newton-Raphson method; i.e either by linear search, backtracking, or by deciding on
and gradually changing it to unity as the solution is being
approached. The algorithm for the BFGS update with backtracking is given in Algorithm
4.4.
The Secant method has become very popular for optimization: it converges fast (at best superlinear), it is stable, and spends relatively modest computing time at each iteration. The method is often being preferred over the Conjugate Gradients method, but the latter will have an advantage over the former for systems with large number of dimensions (e.g. 10,000). Since both the DFP and the BFGS updating formulas are guaranteed to uphold the positive definiteness of the initial approximation of the inverse Hessian (assuming that there are no roundoff errors that results in loss of positive definiteness), the Secant method will only be able to locate minimum points.