[Grundlehren der mathematischen Wissenschaften] Convex Analysis and Minimization Algorithms II Volume 306 || Inner Construction of the Approximate Subdifferential: Methods of ε-Descent

XIII. Inner Construction of the Approximate Subdifferential: Methods of e-Descent

Prerequisites. Basic concepts of numerical optimization (Chap. II); bundling mechanism for lninimizing convex functions (Chap. IX, essential); definition and elementary properties of approximate subdifferentials and difference quotients (Chap. XI, especially §XI.2.2).

Introduction. In this chapter, we study a first lninimization algorithm in detail, including its numerical implementation. It is able to minimize rather general convex functions, without any smoothness assumptions - in contrastto the algorithms of Chap. II. On the other hand, it is fully implementable, which was not the case ofthe bundling algorithm exposed in Chap. IX.

We do not study this algorithm for its practical value: in our opinion, it is not suitable for "real life" problems. However, it is the one that comes first to mind when starting from the ideas developed in the previous chapters. Furthermore, it provides a good introduction to the (more useful) methods to come in the next chapters: rather than a dead end, it is a point of departure, comparable to the steepest-descent method for lninimizing smooth functions, which is bad but basic. Finally, it can be looked at in a context different from optimization proper, namely that of separating closed convex sets.

The situation is the same as in Chap. IX:

- f : lRn -+ lR is a (finite-valued) convex function to be lninimized;

- f(x) and s(x) E Bf(x) are computed when necessary in a black box (Ul), as in §.1I.1.2.

- the norm II . II used to compute steepest-descent directions and projections (see §VIII.l and §IX.l) is the Euclidean norm: •. W = II· II = ,,[M.

In other words, we want to minilnize f even though the information available is fairly poor: one subgradient at each point, instead of the full subdifferential as in Chap. VIII. Recall that the notation s(x) is misleading but can be used in practice; this was explained in Remark VIII.3.5.1.

1 Introduction. Identifying the Approximate Sub differential

1.1 The Problem and Its Solution

In Chap. IX, we have studied a mechanism for constructing the subdifferential of I at a given x, or at least to resolve the following alternatives:

- If there exists a hyperplane separating a/(x) and {OJ, i.e. a descent direction, i.e. a d with I' (x, d) < 0, find one.

J.-B. Hiriart-Urruty et al., Convex Analysis and Minimization Algorithms II© Springer-Verlag Berlin Heidelberg 1993

196 XIII. Methods of e-Descent

- Or, if there does not exist any such direction, explain why, i.e. find a subgradient which is (approximately) 0 - the difficulty being that the black box (01) is not supposed to ever answer s(x) = O.

This mechanism consisted in collecting information about a I (x) in a bundle, and worked as follows (see Algorithm IX. 1.6 for example):

- Given a compact convex polyhedron Sunder-estimating al (x), i.e. with Seal (x), one computed a hyperplane separating S and {OJ. This hyperplane was essentially defined by its normal vector d, interpreted as a direction, and one actually computed the best such hyperplane, namely the projection of 0 onto S.

- Then one made a line-search along this d, with two possible exits:

(a) The hyperplane actually separated a/(x) and {OJ, in which case the process was terminated. The line-search was successful and I could be improved in the direction d, to obtain the next iterate, say x+.

(b) Or the hyperplane did not separate a I (x) and {O}, in which case the line-search produced a new subgradient, say s+ E a/(x), to improve the current S - the line-search was unsuccessful and one looped to redo it along a new direction, issuing from the same x, but obtained from the better S.

We explained that this mechanism could be grafted onto each iteration of a descent scheme. It would thus allow the construction of descent directions without computing the full subdifferential explicitly, a definite advantage over the steepest-descent method of Chap. VIII. A fairly general algorithm would then be obtained, which could for example minimize dual functions associated with abstract optimization problems; such an algorithm would thus be directly comparable to those of §XII.4.

Difficulty 1.1.1 If we do such a grafting, however, two difficulties appear, which were also mentioned in Remark IX. 1.7:

(i) According to the whole idea of the process, the descent direction found in case (a) will be "at best" the steepest-descent direction. As a result, the sequence of iterates will very probably not be a minimizing sequence: this is the message of §VIII.2.2.

(ii) Finding the subgradient s+ in case (b) requires performing an endless line-search: the stepsize must tend to 0 so as to compute the directional derivative I' (x, d) and to obtain s+ by virtue of Lemma VI.6.3.4. Unless I has some very specific properties (such as being piecewise affine), this cannot be implemented.

In other words, the grafting will result in a non-convergent and non-implementa-ble algorithm. 0

Yet the idea is good if a simple precaution is taken: rather than the purely local set a I (x), it is ae I (x) that we must identify. It gathers differential information from a specific, "finitely small", neighborhoodofx -called Ye(x) in TheoremXI.4.2.5 -instead ofa neighborhood shrinking to {x}, as is the case for a/(x) - see Theorem VI.6.3.1. Accordingly, we can guess that this will fix (i) and (ii) above. In fact:

1 Introduction. IdentifYing the Approximate Subdifferential 197

(ie) A direction d such that f; (x, d) < 0 is downhill not only at x but at any point in Ye(x), so a line-search along such a direction is able to drive the next iterate out of Ye (x) and the difficulty 1.1.1 (i) will be eliminated.

(iie) A subgradient at a point of the form x + td is in 8e f(x) for t small enough, more precisely whenever x + td E Ye(x); so it is no longer necessary to force the stepsize in (ii) to tend to O.

Definition 1.1.2 A nonzero d E IRn is said to be a direction of e-descent for f at x if f;(x, d) < 0; in other wordsd defines a hyperplane separating 8ef(x) and {OJ.

A point x E IRn is said to be e-minimal if there is no such separating d, i.e. f;(x, d) ~ 0 for all d, i.e. 0 E 8ef(x). 0

The reason for this terminology is straightforward:

Proposition 1.1.3 A direction d is of e-descent if and only if

f(x +td) < f(x) - e for some t > O.

A point x is e-minimal if and only if it minimizes f within e, i.e.

f(y) ~ f(x) - e for all y E IRn .

PROOF. Use for example Theorem XI.2.Ll: for given d '" 0, let TJ := f;(x, d). If TJ < 0, we can find t > 0 such that

f(x +td) - f(x) +e ~ ~tTJ < O.

Conversely, ~ ~ 0 means that

f(x + td) - f(x) + e ~ TJ ~ 0 for all t > O. o

Then consider the following algorithmic scheme, of"e-descent". The descent iterations are denoted here by a superscript p; the SUbscript k that was used in Chap. VIII will be reserved to the bundling iterations to come later.

Algorithm 1.1.4 (Conceptual Algorithm of e-Descent) Start from some Xl E IRn. Choose e > O. Set p = 1.

STEP 1. If 0 E 8e f(x P) stop. Otherwise compute a direction of e-descent, say d P•

STEP 2. Make a line-search along dP to obtain a stepsize tP > 0 such that

f(x P + tPdP) < f(x P) - e.

Setx p+1 := x P + tPdP. Replace p by p + 1 and loop to Step 1. o

If this scheme is implementable at all, it will be an interesting alternative to the steepest-descent scheme, clearly eliminating the difficulty 1.1.1 (i).

Proposition 1.I.S In Algorithm 1.1.4, either f (x P) ~ -00, or the stop of Step 1 occurs for some finite iteration index P*, at which x P* is e-minimal.

198 XIII. Methods of c-Descent

PROOF. Fairly obvious: one has by construction

f(x P) < f(x l ) - (p - l)e for p = 1,2, ...

Therefore, if f is bounded from below, say by i := infx f(x), the number p* of iterations cannot be larger than [f (x I) - ill e, after which the stopping criterion of Step 1 plays its role. 0

Remark 1.1.6 This is just the basic scheme, for which several variants can easily be imagined. For example, e = eP can be chosen at each iteration. A stop at iteration p* will then mean that x P* is e P* - minimal. Various conditions will ensure that this stop does eventually occur; for example

P* LeP ~ +00 when p* ~ 00.

p=1

(1.1.1)

It is not the first time that we encounter this idea of taking a divergent series: the subgradient algorithm of §XII.4.1 also used one for its step size. It is appropriate to recall Remark XII.4.1.S: here again, a condition like (1.1.1) makes little sense in practice. This is particularly visible with p* appearing explicitly in the summation: when choosing e P, we do not know if the present pth iteration is going to be the last one. In the present context, we just mention one simple "online" rule, more reasonable: take a fixed e but, when 0 E IJef(x P), diminish e, say divide it by 2, until it reaches a final acceptable threshold. It is straightforward to extend the proof of Proposition 1.1.5 to such a strategy.

A general convergence theory of these schemes with varying e's is rather trivial and is of little interest; we will not insist on this aspect here. The real issue would be of a practical nature, consisting of a study of the values of e (= eP), and/or of the norming III . III, to reach maximal efficiency - whatever this means! 0

Remark 1.1.7 To solve a numerical problem (for example of optimization) one normally constructs a sequence - say {x P}, usually infinite - for which one proves some desired asymptotic property: for example that any, or some, cluster point of {x P} solves the initial problem - recall Chap. II, and more particularly (11.1.1.8). In this sense, the statement of Proposition 1.1.5 appears as rather non-classical: we construct afinite sequence {x P}, and then we establish how close the last iterate comes to optimality. This point of view will be systematically adopted here. It makes convergence results easier to prove in our context, and furthermore we believe that it better reflects what actually happens on the computer.

Observe the double role played by e at each iteration of Algorithm 1.1.2: an "active" role to compute d P , and a "passive" role as a stopping criterion. We should really have two different epsilons: one, possibly variable, used to compute the direction; and the other to stop the algorithm. Such considerations will be of importance for the next chapter. 0

1 Introduction. Identifying the Approximate Subdifferential 199

Having eliminated the difficulty 1.1.1(i), we must now check that 1.1.1(ii) is eliminated as well; this is not quite trivial and motivates the present chapter. We therefore study one single iteration of Algorithm 1.1.4, i.e. one execution of Step 1, regardless of any sophisticated choice of e. In other words, we are interested in the static aspect of the algorithm, in which the current iterate x P is considered as fixed.

Dropping the index p from our notation, we are given fixed x E IRn and e > 0, and we apply the mechanism of Chap. IX to construct compact convex polyhedra in 3d(x).

Ifwe look at Step 1 of Algorithm 1.1.4 in the light of Algorithm IX.l.6, we see that it is going to be an iterative subprocess which works as follows; we use the subscript k as in Chap. IX.

Process 1.1.8 The black box (U1) has computed a first e-subgradient SJ, and a first polyhedron SI is on hand:

SI := s(x) E 3f(x) c 3d(x) and SI:= {sd .

- At stage k, having the current polyhedron Sk C 3e f (x), compute the best hyperplane separating Sk and {OJ, i.e. compute

(recall that Proj denotes the Euclidean projection onto the closed convex hull of a set).

- Then determine whether

(ae) dk separates not only Sk but even 3d(x) from {OJ; then the work is finished, 3e f (x) is properly identified, dk is an e-descent direction and we can pass to the next iteration in Algorithm 1.1.4;

or

(be) dk does not separate from {OJ; then enlarge Sk with a new Sk+1 E 3d(x) and loop to the next k. 0

Of course, the above alternatives (ae) - (be) play the same role as (a) - (b) of §l.l. In Chapter IX, (a) - (b) were resolved by a line-search minimizing f along dk. In case (b), the optimal step size was 0, the event f(x + tdk) < f(x) was impossible to obtain, and step sizes t -l- 0 were produced. The new element sk+ I, to be appended to Sk, was then a by-product of this line-search, namely a corresponding cluster point of {s(x + tdk)}t.\-O.

1.2 The Line-Search Function

Here, in order to resolve the alternatives (ae) - (be), and detect whether dk is an e-descent direction (instead of a mere descent direction), we can again minimize f along dk and see whether e can thus be dropped from f. This is no good, however: in case of failure we will not obtain a suitable sk+ I.


Instead of simply applying Proposition 1.1.3 and checking whether

f(x + tdk) < f(x) - s for some t > 0, (1.2.1)

a much better idea is to use Definition 1.1.2 itself: compute the support function f; (x, dk) and check whether it is negative. From Theorem IX.2.1.1, this means minimizing the perturbed difference quotient, a problem that the present § 1.2 is devoted to. Clearly enough, the alternatives (ae) - (be) will thus be resolved; but in case of failure, when (be) holds, we will see below that a by-product of this minimization is an sk+1 suitable for our enlargement problem.

Naturally, the material in this section takes a lot from §XI.2. In the sequel, we drop the index k: given x E IRn, d =F ° and s > 0, the perturbed difference quotient is as in §XI.2.2:

{ +00 ift ~ 0,

qe(t) := f(x + td) t- f(x) + s ift > 0,

and we are interested in inf qe(t). t>o

(1.2.2)

(1.2.3)

Remark 1.2.1 The above problem (1.2.3) looks like a line-search, just as in Chap. II: after all, it amounts to finding a stepsize along the direction d. We should mention, however, that such an interpretation is slightly misleading, as the motivation is substantially different. First, the present "line-search" is aimed at diminishing the perturbed difference quotient, rather than the objective function itself.

More importantly, as we will see in §2.2 below, qe must be minimized rather accurately. By contrast, we have insisted long enough in §II.3 to make it clear that a line-search had little to do with one-dimensional minimization: its role was rather to find a ''reasonable'' stepsize along the given direction, "reasonable" being understood in terms of the objective function.

We just note here that the present "line-search" will certainly not produce a zero-stepsize since qe(t) ~ +00 when t -I- 0 (cf. Proposition XI.2.2.2). This confirms our observation (iie) in§1.1. 0

Minimizing qe is a (hidden) convex problem, via the change of variable u = 1/ t. We recall the following facts from §XI.2.2: the function

{ qe(~) = u [t(x + ~d) - f(x) +s]

u ~ re(u) := f/,o(d) = limt~oo qe(t)

+00

is in Conv lR. Its subdifferential at u > ° is

ifu > 0,

ifu = 0,

ifu < °

8re(u) = {s - e(x, x + ~d, s) : s E 8f(x + ~d)} ;

here, e(x, y, s) := f(x) - [f(y) + (s, x - y)]

(1.2.4)

(1.2.5)

1 Introduction. IdentifYing the Approximate Subdifferential 201

is the linearization error made at x when f is linearized at y with slope s (see the transportation formula of Proposition XI.4.2.2). The function re is minimized over a nonempty compact interval Ue = U!e, ue], with

and its positive minima are characterized by the property

e(x, x + ~d, s) = e for some s E of (x + ~d) .

Since we live in the geometrical world of positive stepsizes, we find it convenient to translate these results into the t-Ianguage. They allow the following supplement to the classification at the end of §XI.2.2:

Lemma 1.2.2 The set

Te := {t = ~ : u E Ue and u > o}

of minima of qe is a closed interval (possibly empty, possibly not bounded from above), which does not contain O. Denoting by t..e := llue and Fe := ll!ie the endpoints of Te (the convention 110 = +00 is used), there holds for any t > 0:

(i) tETe {=::::} e(x, x + td, s) = e for some s E of (x + td) (ii) t < t..e {=::::} e(x, x + td, s) < e for all s E of (x + td)

(iii) t > Fe {=::::} e(x, x + td, s) > e for all s E of (x + td).

Of course, (iii) is pointless if!ie = O.

PROOF. All this comes from the change of variable t = 1 I u in the convex function r e (remember that it is minimized "at finite distance"). The best way to "see" the proof is probably to draw a picture, for example Figs. 1.2.1 and 1.2.2. 0

For our present concern of minimizing qe, the two cases illustrated respectively by Fig. 1.2.1 (ue > 0, Te nonempty) and Fig. 1.2.2 (ue = !ie = 0, Te empty) are rather different. For each step size t > 0, take s E of (x + td). When t increases,

- the inverse stepsize u = 1 I t decreases; - being the slope of the convex function re, the number e - e(x, x + td, s) decreases

(not continuously); -therefore e(x, x + td, s) increases. - Altogether, e(x, x + td, s) starts from 0 and increases with t > 0; a discontinuity

occurs at each t such that of (x + td) has a nonzero breadth along d. The difference between the two cases in Figs. 1.2.1 and 1.2.2 is whether or not e(x, x + td, s) crosses the value e; this property conditions the non-vacuousness of Te.

Lemma 1.2.2 is of course crucial for a one-dimensional search aimed at minimizing q e. First, it shows that q e is mildly nonconvex (the technical term is quasi-convex). Also, the number e - e, calculated at a given t, contains all the essential information

202 XIII. Methods of c:-Descent

Fig. 1.2.1. A "normal" approximate difference quotient

f ~ (x,d) .................................. - ................. - ....................................................... _ .................... .

Fig. 1.2.2. An approximate difference quotient with no minimum

of a "derivative". Ifit happens to be 0, t is optimal; otherwise, it indicates whether qs is increasing or decreasing near t.

We must now see why it is more interesting to solve (1.2.3) than (1.2.1). Concerning the test for appropriate descent, minimizing qs does some good in terms of diminishing I:

Lemma 1.2.3 Suppose that d is a direction of e-descent: I; (x, d) < O. Then

I(x + td) < I(x) - e

for 0 < t close enough to Ts (i.e. t large enough when Ts = 0).

PROOF. Under the stated assumption, qsCt) has the sign of its infimum, namely "-". o

This fixes the (as)-problem in Process 1.1.8. As for the new s+, necessary in the (bs )-case, we remember that it must enjoy two properties:

- It must be an e-subgradient at x. By virtue of the transportation formula (XI.4.2.2), this requires precisely that qs be locally decreasing: we have

al(x + td) 3 s E as/ex) if and only if e(x, x + td, s) ~ e. (1.2.6)

-It must separate the current S from {O} "sufficiently well", which means it must make (s+, d) large enough, see for example Remark IX.2.l.3.

Then the next result will be useful.

Lemma 1.2.4 Suppose that d is not a direction of e-descent: I; (x, d) ~ O. Then:

(j) Ift E Ts, there is some s E al(x + td) such that

(s, d) ~ 0 and s E as/ex) .

1 Introduction. Identifying the Approximate Subdifferential 203

(jj) Suppose Te = 0. Then

al(x + td) C aef(x) for all t ~ 0, and

for all TJ > 0, there exists M > 0 such that t ~ M, s E al(x + td) ===> (s, d) ~ - TJ •

PROOF. (0)] Combine Lemma 1.2.2(i) with (1.2.6) to obtains E al(x +td) naef(x) such that

I(x) = I(x + td) - t(s, d) + 8;

since t > 0, the property I(x) ~ I(x + td) + 8 implies (s, d) ~ o. [OJ)] In case (jj), every t > 0 comes under case (ii) of Lemma 1.2.2, and any set) E

al(x+td) is in ae I(x) by virtue of(1.2.6). Also, sinced is nota direction of8-descent,

I(x + td) ~ I(x) - 8

~ I(x + td) - t{s(t), d) - 8. [because s(t) E 8f(x + td)]

Thus, (s(t), d) + 81t ~ 0, and lim inft-+oo (s(t), d) ~ o. o

Remark 1.2.5 Compare with § XI.4.2: for all t smaller than the largest element te (d) of Te, al(x + td) C aef(x) because x + td E Ve(x). For t = te(d), one can only ascertain that some subgradient at x + td is an 8-subgradient at x and can fruitfully enlarge the current polyhedron S. For t > teCd), al(x + td) n oef(x) = 0 because x + td f/ Vs(x). 0

In summary, when t minimizes qe (including the case "t large enough" ifTe = 0), it is theoretically possible to find in a I (x + t d) the necessary sub gradient s+ to improve the current approximation of ae I (x). We will see in §2 that it is not really possible to find s+ in practice; rather, it is a convenient approximation of it which will be obtained during the process of minimizing qe.

1.3 The Schematic Algorithm

We begin to see how Algorithm 1.1.4 will actually work: Step 1 will be a projection onto a convex polyhedron, followed by a minimization of the convex function reo In anticipation of §2, we restate Algorithm 1.1.4, with a somewhat more detailed Step 1. Also, we incorporate a criterion to stop the algorithm without waiting for the (unlikely) event "0 E Sk".

Algorithm 1.3.1 (Schematic Algorithm of 8-Descent) Start from some x I E IRn. Choose the descent criterion 8 > 0 and the convergence parameter 8 > o. Set p = 1.

STEP 1.0 (starting to find a direction of 8-descent). Compute s(xP) E al(xP). Set Sl = s(xP), k = 1.


STEP 1.1 (computing the trial direction). Solve the minimization problem in a

(1.3.1)

where ilk is the unit simplex of IRk. Set dk = - .E7=, aisi; iflldkll ~ 8 stop. STEP 1.2 (line-search). Minimize re and conclude:

(ae) either f;(x P, dk) < 0; then go to Step 2.

(be) or f;(x P, dk) ~ 0; then obtain a suitable Sk+1 E ad(xP),replacekbyk+ I and loop to Step 1.1.

STEP 2 (descent). Obtain t > 0 such that f(x P + tdk) < f(x P) - 6. Set x p+1 = x P + tdk, replace p by p + I and loop to Step 1.0. 0

Remark 1.3.2 Within Step 1, k increases by one at each subiteration. The number of such subiterations is not known in advance, so the number of subgradients to be stored can grow arbitrarily large, and the complexity of the projection problem as well. We know, however, that it is not really necessary to keep all the subgradients at each iteration: Theorem IX.2.1. 7 tells us that the (k + 1)SI projection must be made onto a polyhedron which can be as small as the segment [-dko sk+d. In other words, the number of subgradients to be stored, i.e. the number of variables in the quadratic problem of Step 1.1, can be as small as 2 (but at the price of a substantial loss in actual performance: remember Fig. IX.2.2.3). 0

Remark 1.3.3 Ifwe compare this algorithm to those of Chap. II, we see that it works rather similarly: at each iterate x P, it constructs a direction dk, along which it performs a line-search. However there are two new features, both characteristic of all the minimization algorithms to come.

- First, the direction is computed in a rather sophisticated way, as the solution of an auxiliary minimization problem instead of being given explicitly in terms of the "gradient" Sk.

- The second feature brings a rather fundamental difference: it is that the line-search has two possible exits (ae) and (be). The first case is the normal one and could be called a descent step, in which the current iterate x P is updated to a better one. In the second case, the iterate is kept where it is and the next line-search will start from the same x p . As in the bundling Algorithm IX.1.6, this can be called a null-step.

o

We can guess that there will be two rather different cases in Step 1.2, corresponding to those of Figs. 1.2.1, 1.2.2. This will be seen in more detail in §2, but we give already here an idea of the convergence proof, assuming a simple situation: only the case of Fig. 1.2.1 occurs, qe is minimized exactly at each line-search, and correspondingly, an Sk+1 predicted by Lemma 1.2.4 is found at each null-step. Note that the last two assumptions are rather unrealistic.

Theorem 1.3.4 Suppose that each execution of Step 1.2 in Algorithm 1.3.1 produces an optimal Uk > 0 and a corresponding sk+1 E af(xp + IJUk dk) such that

1 Introduction. Identifying the Approximate Subditferential 205

where

ek:= f(x P) - f(xp + ulkdk) + (Sk+b ulkdk).

Then: either f(x P) -+ -00, or the stop of Step 1.1 occursforsomefiniteiteration index P*, at which there holds

f(x P*) ~ fey) + e + 811y - xP*1I for all y E ]Rn. (1.3.2)

PROOF. Suppose f(x P) is bounded from below. As in Proposition 1.1.5, Step 2 cannot be executed infinitely many times. At some finite iteration index P*, Algorithm 1.3.1 therefore loops between Steps 1.1 and 1.2, i.e. case (be) always occurs at this iteration. Then Lemma 1.2.4(j) applies for each k: first

Sk+l E aekf(xP*) = ad(xP*) for all k;

we deduce in particular that the sequence {Sk} c aef(xP*) is bounded (Theorem XL 1.1.4). Second, (sk+ b dk) ~ 0 which, together with the minimality conditions for the projection -dk, yields

(-dk, Sj - Sk+l) ~ IIdkll2 for j = 1, ... , k.

Lemma IX.2.1.1 applies (with Sk = -dk and m = 1): dk -+ 0 if k -+ 00, so the stop must eventually occur. Finally, each -dk is a convex combination of e-subgradients s[, ... ,Sk of fat x P* (note that Sl E asf(xP*) by construction) and is therefore an e-subgradient itself:

fey) ~ f(x P*) - (dk, y - x P*) - e for all y E ]Rn .

This implies (1.3.2) by the Cauchy-Schwarz inequality. D

It is worth mentioning that the division Step 1 - Step 2 is somewhat artificial, especially with Remark 1.3.3 in mind; we will no longer make this division. Actually, it is Step 1.2 which contains the bulk of the work - and Step 1.1 to a lesser extent. Step 1.2 even absorbs the work of Step 2 (the suitable t and its associated shave already been found during the line-search) and of Step 1.0 (the starting Sl has been found at the end of the previous line-search). We have here one more illustration of the general principle that line-searches are the most important ingredient when implementing optimization algorithms.

Remark 1.3.5 The boundedness of {Sk} for each outer iteration p is a key property. Technically, it is interesting to observe that it is automatically satisfied, because of the very fact that Sk E ad(xP): no boundedness of {x P} is needed.

Along the same idea, the assumption that f is finite everywhere has little importance: it could be replaced by something more local, for example

Sf(xl)(f):= {x E]Rn : f(x) ~ f(xl)} C intdomf;

Theorem XL 1.1.4 would still apply. This would just imply some sophistication in the line-search of the next section, to cope with the case of (Ul) being called at a point x P + t dk f/. dom f. In this case, the question arises of what (U 1) should return for f(x P + tdk) and s(xP + tdk). D


2 A Direct Implementation: Algorithm of g-Descent

This section contains the computational details that are necessary to implement the algorithm introduced in § 1, sketched as Algorithm 1.3 .1. We mention that these details are by no means fundamental for the next chapters and can therefore be skipped by a casual reader - it has already been mentioned that methods of e-descent are not advisable for actual use in "real life" problems. On the other hand, these details give a good idea of the kind of questions that arise when methods fornonsmooth optimization are implemented.

To obtain a really implementable form of Algorithm 1.3.1, we need to specify two calculations: in Step 1.1, how the a-problem is solved, and in Step 1.2, how the line-search is performed. The a-problem is a classical convex quadratic minimization problem with linear constraints; as such, it poses no particular difficulty. The linesearch, on the contrary, is rather new since it consists of minimizing the nonsmooth function qe or reo It forms the subject of the present section, in which we largely use the principles of §II.3.

A totally clear implementation of a line-search implies answering three questions.

(i) Initialization: how should the first trial stepsize be chosen? Here, the line-search must be initialized at each new direction dk. No really satisfactory initialization is known - and this is precisely one of the drawbacks of the present algorithm. We will not study this problem, considering that the choice t = 1 is simplest, if not excessively sensible! (instead of 1, one must of course at least choose a number which, when multiplied by IIdkll, gives a reasonable move from x P).

(ii) Iteration: given the current stepsize, assumed not suitable, how the next one can be chosen? This will be the subject of §2.1.

(iii) Stopping Criterion: when can the current stepsize be considered as suitable (§2.2)? As is the case with all line-searches, this is by far the most delicate question in the present context. It is more crucial here than ever: it gives not only the conditions for stopping the line-search, but also those for determining the next subgradient to be added to the current approximation of aef(xP).

2.1 Iterating the Line-Search

In order to make the notation less cumbersome, when no confusion is possible, we drop the indices p and k, since they are fixed here and in the next subsection. We are therefore given x E IRn and 0 oF d E IRn , the starting point and the direction for the current line-search under study. We are also given e > 0, which is crucial in all this chapter.

As already mentioned in § 1, the aim of the line-search is to minimize the function qe of (1.2.2), or equivalently the function re of (1.2.4). We prefer to base our development on the more natural qe, despite its nonconvexity. Of course, the reciprocal correspondence (1.2.4) must be kept in mind and used whenever necessary.

At the point y = x + td, it is convenient to simplify the notation: set) will stand for s(x + td) E af(x + td) and

2 A Direct Implementation: Algorithm of c-Descent 207

e(t) := e(x, x + td, s(t» = f(x) - f(x + td) + t(s(t), d} (2.1.1)

for the linearization error (1.2.5). This creates no confusion because x and d are fixed and because, once again, the subgradient s is considered as a single-valued function, depending on the black box (UI). Recall that £ - e(t) is in 8re(lft) (we remind the reader that the last expression means the subdifferential of the convex function U 1-+ re(u) at the point u = 1ft).

The problem to be solved in this subsection is: during the process of minimizing qe, where should we place each trial stepsize? Applying the principles of §1I.3.1, we must decide whether a given t > 0 is

(0) convenient, so the minimization of qe can be stopped; (L) on the left of the set of convenient stepsizes; (R) on their right.

The key to designing (L) and (R) lies in the statements 1.2.2 and 1.2.3: in fact, call the black box (VI) to compute s(t) E 8f(x + td); then compute e(t) of (2.1.1) and compare it to £. There are three possibilities:

(0) :=" {e(t) = £}" (an extraordinary event!). Then this t is optimal and the linesearch is finished.

(L) :=" {e(t) < £}". This t is too small, in the sense that no optimal t can lie on its left. It can serve as a lower bound for all subsequent stepsizes, therefore set tL = t before looping to the next trial.

(R) :="{e(t) > £}".NotonlyTecanbebutontheleftofthist,butalsos(t) ¢ 8ef(x), so t is too large (see Proposition XI.4.2.5 and the discussion following it). This makes two reasons to set t R = t before looping.

Apart from the stopping criterion, we are now in a position to describe the linesearch in some detail. The notation in the following algorithm is exactly that of §II.3.

Algorithm 2.1.1 (Prototype Line-Search for e-Descent)

STEP 0 (initialization). Set tL = 0, tR = +00. Choose an initial t > O. STEP 1 (work). Obtain f(x + td) and s(t) E 8f(x + td); compute e(t) defined by

(2.1.1). STEP 2 (dispatching). If e(t) < £ set tL = t; if e(t) > £ set t R = t. STEP 3 (stopping criterion). Apply the stopping criterion, which must include in par

ticular: stop the line-search if e(t) = £ or if f(x + td) < f(x) - £. STEP 4 (new stepsize). If tR = +00 then extrapolate i.e. compute a new t > tL.

IftR < +00 then interpolate i.e. compute a new t E ]tL, tR[' 0

As explained in §II.3.1, and as can be seen by a long enough contemplation of this algorithm, two sequences of step sizes {tR} and {td are generated They have the properties that {td is increasing, {tR} is decreasing. At each cycle,

tL ~ t ~ tR

for any t minimizing qe. Once an interpolation is made, i.e. once some real tR has been found, no extrapolation is ever made again. In this case one can ascertain that qe


has a minimum "at finite distance". The new t in Step 4 must be computed so that, if infinitely many extrapolations [resp. interpolations] are made, then tL --+ 00 [resp. tR - tL --+ 0]. This is what was called the safeguard-reduction Property 11.3.1.3.

Remark 2.1.2 See §II.3A for suitable safeguarded strategies in Step 4. Without entering into technical details, we mention a possibility for the interpolation formula; it is aimed at minimizing a convex function, so we switch to the (u, r)-notation, instead of (t, q).

Suppose that some U L > 0 (i.e. t R = 1/ U L < +(0) has been generated: we do have an actual bracket [UL, UR], with corresponding actual function-and slope-values; let us denote them by r Land ri, r R and r~ respectively. We will subscript by G the endpoint that is better than the other. Look for example at Fig. 2.1.1: uo = UL because rL < rR. Finally, call u the minimum ofre, assumed unique for simplicity.

Fig. 2.1.1. Quadratic and polyhedral approximations

To place the next iterate, say U+, two possible ideas can be exploited.

Idea Q (quadratic ).Assume that r e is smooth near its minimum. Then it is a good idea to adopt a smooth model for r, for example a convex quadratic function Q:

re(u) ~ Q(u) := ro + rc,(u - uo) + 4c(u - uO)2 ,

where c > 0 estimates the local curvature of r e. This yields a proposal U Q := argmin Q for the next iterate.

Idea P (polyhedral).Assume that re is kinky near its minimum. Then, best results will probably be obtained with a piecewise affine model P:

re(u) ~ P(u) := max {rL + ri (u - UL), rR + r~(u - UR)}

yielding a next proposal Up := argmin P.

The limited information available from the function re makes it hard to guess a proper value for c, and to choose safely between U Q and Up. Nevertheless, the following strategy is efficient:

- Remembering the secant method of Remark 11.2.3.5, take for c the difference quotient between slopes computed on the side ofu where Uo lies: in Fig.2.1.1, these slopes would have been computed at uo and at a previous U L (if any). By virtue of Theorem 104.2.1 (iii), {rD has a limit, so normally this c behaves itself

- Then choose for U+ either uQ or Up, namely the one that is closer to Uo; in Fig. 2.1.1, for example, U+ would be chosen as Up.

It can be shown that the resulting interpolation formulae have the following properties: without any additional assumption on re, i.e. on /, Uo does converge to u, even though c may grow without bound. If r e enjoys some natural additional assumption concerning difference quotients between slopes, then the convergence is also fast in some sense. 0

2 A Direct Implementation: Algorithm of e-Descent 209

2.2 Stopping the Line-Search

It has been already mentioned again and again that the stopping criterion (Step 3 in Algorithm 2.1.1) is by far the most important ingredient of the line-search. Without it, no actual implementation can even be considered. It is a pivot between Algorithms 2.1.1 and 1.3.1.

As it is clear from Step 1.2 in Algorithm 1.3.1, the stopping test consists in choosing between three possibilities:

- either the present t is not suitable and the line-search must be continued;

- or the present t is suitable because case (ae) is detected: f£(x, d) < 0; this simply amounts to testing the descent inequality

(2.2.1)

if it is true, update x P to x p+1 := xP + tdk and increase p by 1; - or the present t is suitable because case (be) is detected: f£(x P, dk) ~ 0 and also

a suitable e-subgradient Sk+l has been obtained; then append Sk+l to the current polyhedral approximation Sk of 8e f(x P) and increase k by 1.

The second possibility is quite trivial to test and requires no special comment. As for the third, it has been studied all along in Chap. IX. Assume that (2.2.1) never happens, i.e. that k ~ 00 in Step 1 of Algorithm 1.3.1. Then the whole point is to realize that 0 E 8e f(x P), i.e. to have dk ~ O. For this, each iteration k must produce Sk+l E IRn satisfying two properties, already seen before Lemma 1.2.4:

(2.2.2)

(2.2.3)

Here, the symbol "0" can be taken as the mere number 0, as was the case in the framework of Theorem 1.3.4. However, we saw in Remark IX.2.1.3 that "0" could also be a suitable negative tolerance, namely a fraction of -lIdkIl2. Actually, the following specification of (2.2.3):

(2.2.4)

preserves the desired property dk ~ 0: we are in the framework of Lemma IX.2.1.1. On the other hand, if sk+l is to be the current subgradient s(t) = s(xP + tdk)

(t > 0 being the present trial stepsize, tL or tR), we know from (1.2.6) that (2.2.2) is equivalent to e(xP, x P + tdb s(t)) ~ e, i.e.

(2.2.5)

It can be seen that the two requirements (2.2.4) and (2.2.5) are antagonistic. They may even be incompatible:

Example 2.2.1 With n = 1, take the objective function:


x ~ I(x) = max {f1(X), h(x)} with I l/:l«X» := 2 - x 2X :=expx.

(2.2.6)

The solution i of exp x + x = 2 is the only point where 1 is not differentiable. It is also the minimum of I, and I(i) = expi. Note that i E ]0, 1[ and I(i) E ]1, 2[. The subdifferential of 1 is

{ {-l} if x <i,

a/(x) = [-1,expi] if x =i, expx if x > i.

Now suppose that the £-descent Algorithm 1.3.1 is initialized at Xl = 0, where the direction of search is d l = 1. The trace-function t ~ 1(0 + t1) is differentiable except at a certain t E [0, 1 [, corresponding to i. The linearization-error function e of (2.1.1) is

{ 0 if 0 ~ t < t, e(t)= 2+(t-1)expt ift>!.

At t = t, the number e(t) is somewhere in [0, e*] (depending on which subgradient s(i) is actually computed by the black box), where

e* := 2 + (t - 1) exp t = 3t - P > 1.

This example is illustrated in Fig.2.2.l. Finally suppose that £ ~ e*, for example £ = 1. Clearly enough,

Fig.2.2.1. Discontinuity ofthe function e

-ift E [0,([, then (s(t),dl ) = -1 = -lIddl 2 so (2.2.4) cannot hold, no matter how m' is chosen in ]0, 1 [;

- if t > t, then e(t) > e* ~ £, so (2.2.5) cannot hold.

In other words, it is impossible to obtain simultaneously (2.2.4) and (2.2.5) unless the following two extraordinary events happen:

(i) the line-search must produce the particular stepsize t = t (said otherwise, qe or re must be exactly minimized);

(ii) at this i = 0 + t· 1, the black box must produce a rather particular subgradient s, namely one between the two extreme points -1 and expi, so that (s, d l ) is large enough and (2.2.5) has a chance to hold.

Note also that, if £ is large enough, namely £ ~ 2 - I(i), (2.2.1) cannot be obtained, just by definition of i. 0

2 A Direct Implementation: Algorithm of €-Descent 211

Remark 2.2.2 A reader not familiar with numerical computations may not consider (i) as so extraordinary. He should however remember how a line-search algorithm has to work (see §1I.3, and especially Example 11.3.1.5). Furthermore, i is given by a non-solvable equation and cannot be computed exactly. This is why we bothered to take a somewhat complicated f in (2.2.6). The example would be just as good, say with f(x) = Ix - 11. 0

From this example, we draw the following conclusion: it may happen that for the given f, x, d and e, no line-search algorithm can produce a suitable new element in asf(x); and this no matter how m' is chosen in ]0,1[. When this phenomenon happens, Algorithm 2.1.1 is stuck: it loops forever between Step 1 and Step 4. Of course, relaxing condition (2.2.4), say by taking m' ;? 1, does not help because then, it is Algorithm 1.3.1 which may be stuck, looping within Step 1: dk will have no reason to tend to ° and the stop of Step 1.1 will never occur.

The diagnostic is that (1.2.6), i.e. the transportation formula, does not allow us to reach those e-subgradients that we need. The remedy is to use a slightly more sophisticated formula, to construct e-subgradients as combinations of subgradients computed at several other points:

Lemma 2.2.3 Let be given: x E lRn, Yj E lRn and Sj E af(Yj) for j = 1, ... , m; set

ej := e(x, Yj, Sj) = f(x) - f(Yj) - (Sj, x - Yj) for j = 1, ... , m.

With a = (aJ, ... ,am) in the unit simplex .1m, set e := LJ!=1 ajej. Then there holds

m

S := L ajsj E ae!(x) . j=l

PROOF. Immediate: proceed exactly as for the transportation formula XI.4.2.2. 0

This result can be applied to our line-search problem. Suppose that the minima of qs have been bracketed by tL and tR: in view of Lemma 1.2.2, we have on hand two subgradients SL := sex + tLd) and SR := sex + tRd) satisfying

tL (SL' d) < e - f(x) + f(x + tLd)

tR(SR, d) > e - f(x) + f(x + tRd).

Then there are two key observations. One is that, since (2.2.1) is assumed not to hold at t = t R (otherwise we have no problem), the second inequality above implies 0< (sR,d).Second,(1.2.6)impliessL E ae!(x) and we may just assume (SL' d) < ° (otherwise we have no problem either). In summary,

(2.2.7)

Therefore, consider a convex combination

Sf.L := f.LSL + (1 - f.L)SR, f.L E [0,1].


According to (2.2.7), (sJ.L, d) ~ - m'IIdll 2 if IL is small enough, namely IL ~ jL with

On the other hand, Lemma 2.2.3 shows that sJ.L E aef(x) if IL is large enough, namely IL ~ !!:.' with

eR -s IL := E ]0, I [ . (2.2.8) - eR - eL

It turns out (and this will follow from Lemma 2.3 .2 below) that, from the property f(x+td) ~ f(x)-s, wehaveIL ~ jLfortR-tL small enough. When this happens, we are done becausewecanfindsk~1 = sJ.L satisfying (2.2.4) and (2.2.5). Algorithm 1.3.1 can be stopped, and the current approximation of aef(xP) can be suitably enlarged.

There remains to treat the case in which no finite t R can ever be found, which happens if qe behaves as in Fig. 1.2.2. Actually, this case is simpler: tL .... 00 and Lemma 1.2.4 tells us that s(x + tLd) is eventually convenient.

A simple and compact way to carry out the calculations is to take for IL the maximal value (2.2.8), being understood that IL = 1 if tR = 00. This amounts to taking systematically sJ.L in aef(x). Then the stopping criterion is:

(ae) stop the line-search if f(x + td) < f(x) - s (and pass to the next x); (be) stop the line-search if (sJ.L, d) ~ a IIdll 2 (and pass to the next d).

In all other cases, continue the line-search, i.e. pass to the next trial t. The consistency of this stopping criterion will be confirmed in the next section.

2.3 The e-Descent Algorithm and Its Convergence

We are now in a position to realize this Section 2 and to write down the complete organization of the algorithm proposed. The initial iterate x 1 is on hand, together with the black box (Ul) which, for each x E IRn, computes f(x) and s(x) E af(x). Furthermore, a number k ~ 2 is given, which is the maximal number of n-vectors that can be stored, in view of the memory allocated in the computer. We choose the tolerances s > 0, " > ° and m' E ]0, 1[. Our aim is to furnish a final iterate x P* satisfying

f(y) ~ f(x P*) - s - "lIy - x P* II for all y E IRn . (2.3.1)

This gives some hints on how to choose the tolerances: s is homogeneous to functionvalues; as in Chap. II, " is homogeneous to gradient-norms; and the value m' = 0.1 is reasonable.

The algorithm below is of course a combination of Algorithms 1.3.1,2.1.1, and of the stopping criterion introduced at the end of §2.2. Notes such as e) refer to explanations given afterwards.

Algorithm 2.3.1 (Algorithm of s-Descent) The initial iterate x 1 is given. Set p = 1. Compute f(x P) ands(x P) E af(xP). Set fO = f(x P), Sl = s(xl).e)


STEP 1 (starting to find a direction of e-descent). Set k = 1.

STEP 2 (computing the trial direction). Set d = - L~=, ajSj, where a solves

STEP 3 (final stopping test). If lid II ~ 8 stop.e) STEP 4 (initializing the line-search). Choose an initial t > O.

Set tL = 0, SL = s" eL = 0; tR = 0, SR = O.e)

STEP 5. Compute f = f(x P + td) and S = s(xP + td) E af(xp + td); set

e = fO - f + t (s, d) .

STEP 6 (dispatching). If e < e, set tL = t, eL = e, SL = s. If e ~ e, set tR = t, eR = e, SR = s. IftR = 0, set JL = 1; otherwise set JL = (eR - e)/(eR - eL).(4)

STEP 7 (stopping criterion of the line-search). If f < fO - e go to Step 11. Set sIL = JLSL + (1 - JL)SR.CS) If (sIL, d) ~ - m'lIdll2 go to Step 9.

STEP 8 (iterating the line-search). If tR = 0, extrapolate i.e. compute a new t > tL. IftR "# 0, interpolate i.e. compute a new tin ]tL, tR[' Loop to Step 5.

STEP 9 (managing the computer-memory). If k = k, delete at least two (arbitrary) elements from the list s" ... , Sk. Insert in the new list the element -d coming from Step 2 and let k < k be the number of elements thus obtained.(6)

STEP 10 (iterating the direction-finding process). Set sk+' = sIL .C) Replace k by k + 1 and loop to Step 2.

STEP II (iterating the descent process). Set x P+' = x P + td, s, = S.(8) Replace p by p + 1 and loop to Step 1. 0

Comments

C') fO is the objective-value at the origin-point x P of each line-search; s' is the subgradient computed at this origin-point.

e) This stop may not be a real stop but a signal to reduce e and/or 8 (cf. Remark 1.3.5). It should be clear that, at this stage of the algorithm, the current iterate satisfies the approximate minimality condition (2.3.1).

e) The initializations of SL and eL are "normal": in view ofC') above, s, E af(xP) and, as is obvious from (2.1.1), e(xP, x P, s,) = o. On the other hand, the initializations of t R and S R are purely artificial (see Algorithm 11.3 .1.2 and the remark following it): they simply help define sIL in the forthcoming Step 7. Initializing e R is not necessary.

(4) In view of the values of eL and eR, JL becomes < 1 as soon as some tR > 0 is found.

CS) If tR = 0, then JL = 1 and sIL = SL. If eR = e, then JL = 0 and sIL = SR. Otherwise sIL is a nontrivial convex combination. In all cases sIL E aef(xP).

214 XIII. Methods of g-Descent

(6) We have chosen to give a specific rule for cleaning the bundle, instead of staying abstract as in Algorithm IX.2.1.5. If k < ;C, then k can be increased by one and the next subgradient can be appended to the current approximation of aef(xP). Otherwise, one must make room to store at least the current projection -d and the next subgradient sIL, thus making it possible to use Theorem IX.2.1.7. Note, however, that if the deletion process turns out to keep every Sj corresponding to (Xj > 0, then the current projection need not be appended: it belongs to the convex hull of the current sub gradients and will not improve the next polyhedron anyway.

C) Here we arrive from Step 7 via Step 10 and slL is the convenient convex combination found by the line-search. This slL is certainly in aef(xP): if /-t = 1, then eL ~ e and the transportation formula (XI.4.2.2) applies; if /-t E ]0, 1[, the corresponding linearization error /-teL + (1- /-t)eR is equal to e, and it is Lemma 2.2.3 that applies.

(8) At this point, s is the last subgradient computed at Step 5 of the line-search. It is a subgradient at the next iterate x p+1 = x P + td. 0

A loop from Step 11 to Step 1 represents an actual e-descent, with x P moved; this was called a descent-step in Remark 1.3.3. A loop from Step 10 to Step 2 represents one iteration of the direction-finding procedure, i.e. a null-step: one subgradient is appended to the current approximation of ae f (x P). The detailed line-search is expanded between Steps 5 and 8. At each cycle, Step 7 decides whether

- to iterate the line-search, by a loop from Step 8 to Step 5, - to iterate the direction-finding procedure,

- or to iterate the descent process, with an actual e-descent obtained.

Now we have to prove convergence of this algorithm. First, we make sure that each line-search terminates.

Lemma 2.3.2 Let f : lRn -+- lR be convex. Suppose that the extrapolation and interpolation formulae in Step 8 of Algorithm 2.3.1 satisfy the safeguard-reduction Property 11.3.1.3. Then,for each iteration (p, k), the number of loops from Step 8 to Step 5 is finite.

PROOF. Suppose for contradiction that, for some fixed iteration (p, k), the line-search does not terminate. Suppose first that no t R > 0 is ever generated. By construction, /-t = 1 and slL = s = s L forever. One has therefore at each cycle

where the first inequality holds because Step 7 never exits to Step 11; the second is because slL E af(xp + tLd). We deduce

(2.3.2)

Now, the assumptions on the extrapolation formulae imply that tL -+- 00. In view of the test Step 7 -+- Step 9, (2.3.2) cannot hold infinitely often.


Thus some t R > 0 must eventually be generated, and Step 6 shows that, from then on, 1.Le L + (1 - fL)e R = 6 at every subsequent cycle. This can be expressed as follows:

f(x P) - fLf(x P + tLd) - (1 - fL)f(x P + tRd) = 6 - tLfL(SL, d) - tR(1 - fL)(SR, d) .

Furthermore, non-exit from Step 7 to Step 11 implies that the left-hand side above is smaller than 6, so

(2.3.3)

By assumption on the interpolation formulae, tL and tR have a common limit t ~ 0, and we claim that t > O. If not, the Lipschitz continuity of f in a neighborhood of x P (Theorem IY.3.1.2) would imply the contradiction

0<6 < eR = f(x P) - f(x P + tRd) + tR(SR. d) ~ 2LtR ~ 0

(L being a Lipschitz constant around x P).

Thus, after division by t > 0, (2.3.3) can be written as

fL(SL, d) + (1 - fL)(SR, d) + TJ ~ 0

where the extra term

tL - t tR - t TJ := ---fL(SL, d) + ---(1 - fL)(SR, d)

t t tends to 0 when the number of cycles tends to infinity. This is impossible because of the test Step 7 ~ Step 9. 0

The rest is now a variant of Theorem 1.3.4.

Theorem 2.3.3 The assumptions are those of Lemma 2.3.2. Then, either f (x P) ~ -00, or the stop of Step 3 occurs for some finite P*, yielding x P* satisfYing the approximate minimality condition (2.3.1).

PROOF. In Algorithm 2.3.1, p cannot go to +00 if f is bounded from below. Consider the last iteration, say the (p*)th. From Lemma 2.3.2, each iteration k exits to Step 9. By construction, there holds for all k

Sk+l E 8e f(x P*)

hence the sequence {Sk} is bounded (x P* is fixed!), and

(Sk+io dk) ~ - m'lIdkll 2 •

We are in the situation of Lemma IX.2.l.l: k cannot go to +00 in view of Step 3. 0

Remark 2.3.4 This proof reveals the necessity of taking the tolerance m' > ° to stop each line-search. If m' were set to 0, we would obtain a non-implementable algorithm of the type IX. 1.6.

Other tolerances can equally be used. In the next section, precisely, we will consider variations around the stopping criterion of the line-search. Here, we propose the following exercise: reproduce Lemma 2.3.2 with m' = 0, but with the descent criterion in Step 7 replaced by

If f < fo - s' go to Step 11

where s' is fixed in ]0, s[. o


3 Putting the Algorithm in Perspective

As already stated in Remark 1.1.6, we are not interested in giving a specific choice for £ in algorithms such as 2.3.1. It is important for numerical efficiency only; but precisely, we believe that these algorithms cannot be made numerically efficient under their present form. We nevertheless consider two variants, obtained by giving £ rather special values. They are interesting for the sake of curiosity; furthermore, they illustrate an aspect of the algorithm not directly related to optimization, but rather to separation of closed convex sets - see §IX.3.3.

3.1 A Pure Separation Form

Suppose that j := inf {f(y) : y E ]Rn} > -00 is known and set

B:= f(x) - j, (3.1.1)

x = Xl being the starting point of the £-descent Algorithm 2.3.1. If we take this value B for £ in that algorithm, the exit from Step 7 to Step 11 will never occur. The line-searches will be made along a sequence of successive directions, all issuing from the same starting x.

Then, we could let the algorithm run with £ = e, simply as stated in 2.3.1 (the exit-test from Step 7 to Step 11 - and Step 11 itself-could simply be suppressed with no harm). However, this does no good in terms of minimizing f: when the algorithm stops, we will learn that the approximate minimality condition (2.3.1) holds with x P* = x = Xl and £ = e; but we knew this before, even with 8 = O.

Fortunately, there are better things to do. Observe first that, instead of minimizing f within e (which has already been done), the idea of the algorithm is now to solve the following problem: by suitable calls to the black box (VI), compute e-subgradients at the given x, so as to obtain 0 - or at least a vector of small norm - in their convex hull. Even though the definition of B implies 0 E ae f (x), the only constructive information concerning this latter set is the black box (VI), and 0 might never be produced by (VI). In other words, we must try to separate 0 from aef(x). We will fail but the point is to explain this failure.

Remark 3.1.1 The above problem is not a theoretical pastime but is of great interest in Lagrangian relaxation. Suppose that (Ul) is a "Lagrange black box", as discussed in §XII.l.l: each subgradient is of the form s = -c(u), U being a primal variable. When we solve the above separation problem, we construct primal points U I, ... , Uk and associated convex multipliers ai, ... , ak such that

k

Lajc(uj) = O. j=1

In favourable situations - when some convexity is present - the corresponding combination L~=I ajUj is the primal solution that was sought in the first place; see Theorem XII.2.3.4, see also Theorem XII.4.2.5. 0

3 Putting the Algorithm in Perspective 217

Thus, we are precisely in the framework of §IX.3.3: rather than a minimization algorithm, Algorithm 2.3.1 becomes a separation one, namely a particular instance of Algorithm IX.3.3.1. Using the notation S : = oe f (x), what the latter algorithm needs in its Step 2 is a mechanism for computing the support function os (d) of S at any given d E IRn , together with a solution s(d) in the exposed face F s(d). The comparison becomes perfect if we interpret the line-search of Algorithm 2.3.1 precisely as this mechanism.

So, upon exit from Step 7 to Step 9, we still want slL to be an e-subgradient at x;

but we also want (SIL, d) ~ (s, d) for all s E S,

in order to have

The next result shows that, once again, our task is to minimize re: this sk+1 must be some s E of (x + I/ud), with u yielding 0 E ore(u).

Theorem 3.1.2 Let x, d "# 0 and 8 ~ 0 be given. Suppose that s E IRn satisfies one of the following two properties:

(i) either, for some t > 0, s E of (x + td) and e(x, x + td, s) = 8;

(ii) or s is a cluster point ofa sequence (s(t) E of (x + td)lt with t ~ +00 and

e(x, x +td,s(t» ~ 8 forallt ~ O.

Then s E oef(x) and (s, d) = f;(x, d) .

PROOF. The transportation formula (Proposition XI.4.2.2) directly implies that the s in (i) and the s(t) in (ii) are in oef(x). Invoking the closedness of oef(x) for case (ii), we see that s E oef(x) in both cases.

Now take an arbitrary s' E Os f (x), satisfying in particular

f(x + td) ~ f(x) + t(s', d) - 8 for all t > O.

[Case (iJ] Replacing 8 in (3.1.2) by e(x, x + td, s), we obtain

o ~ t(s', d) - t(s, d) .

Divide by t > 0: (s, d) = (Taef(x)(d) since s' was arbitrary.

(3.1.2)

[Case (ii)] Add the subgradient inequality f(x) ~ f(x + td) - t(s(t), d) to (3.1.2):

o ~ t(s', d) - t(s(t), d) - 8.

Divide by t > 0 and let t ~ +00 to see that (s, d) ~ (s', d). o

In terms of Algorithm 2.3.1, we realize that the slL computed in Step 7 precisely aims at satisfying the conditions in Theorem 3.1.2:


(i) Suppose first that some tR > 0 is generated (which implies Te =f:. 0). When sufficiently many interpolations are made, tL and tR become close to their common limit, say teo Because a/O is outer semi-continuous, SL and SR are both close to al(x + ted). Their convex combination slI- is also close to the convex set al(x + ted). Finally, e(x, x + ted, sll-) is close to I.LeL + (l - IL)eR = e, by continuity of I. In other words, sll- almost satisfies the assumptions of Theorem 3.1.2(i). This is illustrated in Fig. 3.1.1.

f(x+td) ~--------~------~---------

-E

Fig. 3.1.1. Identifying the approximate directional derivative

(ii) Ifno tR > 0 is generated and tL --+ 00 (which happens when Te = 0), it is case (ii) that is satisfied by the corresponding SL = sll-.

In a word, Algorithm 2.3.1 becomes a realization of Algorithm IX.3.3.1 if we let the interpolations [resp. the extrapolations] drive tR - tL small enough [resp. tL large enough] before exiting from Step 7 to Step 9. It is then not difficult to adapt the various tolerances so that the approximation is good enough, namely (sll-, dk) is close enough to O'S(dk).

These ideas have been used for the numerical experiments of Figs. IX.3.3.1 and IX. 3 .3 .2. It is the above variant that was the first form of Algorithm IX.3.3.1, alluded to at the end of §IX.3.3. In the case OfTR48 (Example IX.2.2.6), ael(x) was identified at x = 0, a significantly non-optimal point: 1(0) = -464816, while j = -638565. All generated Sk had norms roughly comparable. In the example MAXQUAD ofVIII.3.3.3, this was far from being the case: the norms of the subgradients were fairly dispersed, remember Table VIII.3.3 .1.

Remark 3.1.3 An interesting question is whether this variant produces a minimizing sequence. The answer is no: take the example of (IX. 1. 1 ), where the objective function was

Start from the initial x = (2, 1) and take e = e = 4. We leave it as an exercise to see how Algorithm 1.3.1 proceeds: the first iteration, along the direction (-1, -2), produces Y2 anywhere on the half-line {(2 - t, 1 - 2t) : t ~ 1/2} ands2 is unambiguously (1, -2). Then, d 2 is collinear to (-1,0), Y3 = (-1, 1) and S3 = (-1/3,2/3) (for this last calculation, take S3 = a( -1,0) + (1 - a)(l, 2), a convex combination of subgradients at n, and adjust a so that the corresponding linearization error e(x, Y3, S3) is equal to e = 4). Finally d3 = 0,

3 Putting the Mgorithm in Perspective 219

although neither Y2 nor Y3 is even close to the minimum (0, 0). These operations are illustrated by Fig. 3.1.2.

Note in this example that d3 = 0 is a convex combination of S2 and S3: SI plays no role. Thus 0 is on the boundary of CO{SI, S2, S3} = 04/(X). If e is decreased, S2 and S3 are slightly perturbed and 0 ¢ CO{SI, S2, S3}: d3 is no longer zero (although it stays small). The property o E bdo4/(x) does not hold by chance but was proved in Proposition XI. 1.3.5: e = 4 is the critical value of e and the normal cone to 04/(X) at 0 is not reduced to {O}. 0

Fig. 3.1.2. Three iterations of the separating algorithm

3.2 A TotaDy Static Minimization Algorithm

Our second variant is also a sort of separation Algorithm IX.3.3 .1. It also performs one single descent iteration, the starting point of the line-searches never being updated. It differs from the previous example, though: it is really aimed at minimizing I, and e is no longer fixed but depends on the direction-index k. Specifically, for fixed x = Xl

and d =F 0 in Algorithm 2.3.1, we set

e = e(x, d) := I(x) - inf I(x + td) . 1>0

(3.2.1)

A second peculiarity of our variant is therefore that e is not known; and also, it may be 0 (for simplicity, we assume e(x, d) < +00). Nevertheless, the method can still be made implementable in this case. In fact, a careful study of Algorithm 2.3.1 shows that the actual value of e is needed in three places:

- for the exit-test from Step 7 to Step 11; with the e of (3 .2.1), this test is never passed and can be suppressed, just as in §3.1;

- in Step 6, to dispatch t as a t L or t R; Proposition 3.2.1 below shows that the precise value (3.2.1) is useless for this particular operation;

- in Step 6 again, to compute /L; it is here that lack of knowledge of e has the main consequences.

Proposition 3.2.1 For given x, d =F 0 in lR.n and e of (3.2.1), the minimizers of t 1-+ I(x + td) over lR.+ are also minimizers of qe - knowing that qo is the ordinary difforence quotient, with qo(O) = If (x, d). The converse is true if e > O.


PROOF. Letiminimize f(x+, d) overjR+. 1ft = 0, thene(x, d) = o and qe(x ,d) (0) = f'(x, d) ~ qe(x,d)(t) for all t ~ 0 (monotonicityofthe difference quotient); the proof is finished.

If i > 0, the definition

f(x + id) = f(x) - e(x, d)

combined with the minimality condition

o E (af(x + id), d}

shows that i satisfies the minimality condition of Lemma 1.2.2(i) for qe(x,d)' Now let {'fk} C jR+ be a minimizing sequence for f(x + ·d):

f(x + 'fkd ) ~ f(x) - e,

where e is the e(x, d) of(3.2.1). If e > 0, then 0 cannot be a cluster point of {'fd: we can write

f(x + 'fkd) - f(x) + e -=--.:.----.:.:~-:......:......:......-~O,

'fk

and this implies f;(x, d) ~ O. Ifte minimizes qe, we have

Thus

f(x + ted) - f(x) + e = f;(x, d) ~ O. te

f(x + ted) ~ f(x) - e,

te does minimize f(x + . d) (and f;(x, d) is actually zero). o

It is therefore a good idea to minimize f(x + td) with respect to t ~ 0: the implicitly given function qe(x,d) will by necessity be minimized - and it is a much more natural task. A first consequence is that the decision to dispatch t is even simpler than before, namely:

if {s, d} < 0, t must become a t L (f seems to be decreasing), (3.2.2)

if {s, d} > 0, t must become a tR (f seems to be increasing), (3.2.3)

knowing that {s, d} = 0 terminates the line-search, of course. However, the object of the line-search is now to obtain t and s+ E af(x + td)

satisfying {s+, d} = 0; the property s+ E ae(x,d)f(x) will follow automatically. The choice of J.L in Step 6 can then be adapted accordingly. The rationale for J.L in Algorithm 2.3.1 was based on forcing slL to be a priori in aef(x), and it was only eventually that {slL, d} became close to zero, i.e. large enough in that context. Here, it is rather the other way round: we do not know in what e(x, d)-subdifferential the subgradient slL should be, but we do know that we want to obtain {slL, d} = O. Accordingly, it becomes suitable to compute J.L in Step 6 by

(3.2.4)

when tR > 0 (otherwise, take J.L = 1 as before). Observe that, fortunately, this is in harmony with the dispatching (3.2.2), (3.2.3): provided that SR exists, J.L E [0, 1].

It remains to find a suitable stopping criterion for the line-search.

3 Putting the Algorithm in Perspective 221

- First, ifno tR > 0 is ever found (f decreasing along d), there is nothing to change: in view of Lemma 1.2.2(ii), stt = s(x +tLd) E ae(x,d)f(x) for all tL; furthermore, Lemma 1.2.4Gj) still applies and eventually, (stt, d)?: - m'lIdIl 2•

- The other case is when a true bracket [t L, t R] is eventually found; then Lemma 2.2.3 implies stt E ae f (x) where

(3.2.5)

and J.L is defined by (3.2.4). As for e(x, d), we obviously have

e(x, d) ?: ~:= max {f(x) - f(x + tL), f(x) - f(x + tRd)}. (3.2.6)

Continuity of f ensures that e and~have the common limit e(x, d) when tL andtR both tend to the optimal (and finite) t ?: O. Then, because of outer semi-continuity of e ~ aef(x), stt is close to ae(x,d)f(x) when tR and tL are close together.

In a word, without entering into the hair-splitting details of the implementation, the following variant of Algorithm 2.3.1 is reasonable.

Algorithm 3.2.2 (Totally Static Algorithm - Sketch) The initial point x is given, together with the tolerances TJ > 0,0 > 0, m' E ]0, 1[. Set k = 1, Sl = s(x).

STEP 1. Compute dk := - Proj O/{Sl, ... , Sk}. STEP 2. If IIdk II ~ 0 stop. STEP 3. By an approximate minimization oft 1-+ f(x + tdk) over t ?: 0, find tk > 0

andsk+l such that

and

where ek := f(x) - f(x + tkdk) + TJ·

STEP 4. Replace k by k + I and loop to Step 1.

(3.2.7)

o

Here, there is no longer any descent test, but a new tolerance TJ is introduced, to make sure that e - ~ of (3.2.5), (3.2.6) is small. Admitting that Step 3 terminates for each k, this algorithm is convergent:

Theorem 3.2.3 In Algorithm 3.2.2, let

xk = x + tkdk E Argmin{f(x + tidi) : i = 1, ... , k}

be one of the best points found during the successive iterations. Iff is bounded from below, the stop of Step 2 occurs for some finite k, and then

f(y) ?: f(Xk) - TJ - oily - xkll for all y E IRn . (3.2.8)

222 XIII. Methods of s-Descent

PROOF. Corresponding to the best point xb set

(3.2.9)

The sequence {Sk} is increasing. By construction, ek ~ Sk and it follows that Sk+l E

aekf(x) for each k. If Sk ~ +00 when k ~ +00, then f(Xk) ~ -00. Otherwise, {sd is bounded,

{sHd is therefore bounded as well. In view of (3.2.7), Lemma IX.2.l.1 applies: the algorithm must stop for some iteration k. At this k,

Si E aei f(x) c aekf(x) for i = 1, ... , k

hence dk E aekf(x). Using (3.2.9), we have for all y E IRn:

fey) ~ f(x) + (dk. y - x) - Sk = f(Xk) + (dk. Y - x) - T} • 0

From a practical point of view, this algorithm is of course not appropriate: it is not a sound idea to start each line-search from the same initial x. Think of a ballistic comparison, where each direction is a gun: to shoot towards a target i, it should make sense to place the gun as close as possible to i-and not at the very first iterate, probably the worst of all! This is reflected in the minimality condition: write (3.2.8) with an optimal y = i (the only interesting y, actually); admitting that IIi - xII is unduly large, an unduly small value of 8 - i.e. of IIdk II - is required to obtain any useful information about optimality of Xk. Remembering our analysis in §IX.2.2, this means an unduly large number of iterations.

The first inequality in (3.2.7) would suffice to let the algorithm run. The reason for the second inequality is to illustrate another aspect of the approach. Consider again the value 8 = f (x) - j of (3 .1.1). Obviously in Algorithm 3.2.2,

for each k. Coming back to the problem of separating asf(x) =: S and {OJ, we see that Algorithm 3.2.2 is the second form of Algorithm IX.3.3.l that was used for Figs. IX.3.3.1 and IX.3.3.2: for each direction d, its black box generates S+ E S with (s+. d) ~ 0 (instead of computing the support function). It is therefore a convenient instrument for illustrating the point made at the end of §IX.3.3.

Documents

[Grundlehren der mathematischen Wissenschaften] Convex Analysis and Minimization Algorithms II Volume 306 || Inner Construction of the Approximate Subdifferential: Methods of ε-Descent