Exam (2015) solutions - Stanford University

Exam (2015) solutions

CS229T/STATS231 (Fall 2018?2019) Note: please do not copy or distribute.

SUNet ID: Name:

Collaborators: none 1. This exam is open notes and open laptop, but you must turn off the wireless. 2. Please write clearly and give rigorous justification for each non-trivial step. 3. You may cite results from the course notes without proof. 4. When we ask for an upper bound, to get full credit, in addition to being mathematically correct, you

must have the right dependencies on all the problem-dependent quantities (e.g., if it is possible to get d, then d log d is not acceptable, but 2d is acceptable).

1

1. Online non-convex optimization (20 points)

Recall that in online convex optimization, nature plays a sequence of convex functions f1, . . . , fT , and at each time step t, the learner outputs wt S which depends only on f1, . . . , ft-1. Convexity is crucial for both algorithmic efficiency and for the regret analysis to work out. But many models/loss functions in practice are non-convex. In this problem, we will obtain a regret analysis for non-convex problems, for the moment deferring on issues of computational efficiency.

In the following, assume each ft : S R, where S = {w Rd : w B} is the set of weight vectors where each component is in the interval [-B, B].

a. (5 points) Show that for any online convex optimization algorithm, there exists a sequence of nonconvex functions f1, . . . , fT that results in Regret = T , even if the range of ft is constrained to [0, 1]. Solution: Since ft can depend on wt, we can just choose ft(w) = I[w = wt]. After T steps, the learner attains a loss of T . But there are still an infinite number of values of w S (namely those that do not equal any of w1, . . . , wT ) that obtain zero loss. Thus, the regret is exactly T .

2

b. (5 points) To make progress, we clearly need to make some assumptions on the ft's. Let g be a non-negative 1-Lipschitz (but not necessarily convex) function (that is, |g(a) - g(b)| |a - b| for all a, b R). Let F be the set of possible loss functions, where each f F has the form f (w) = g(w ?x) for some x 1 L.

Our high-level strategy will be to show that by discretizing S, we can turn it into a learning from expert advice problem. To start off, show that we can construct a set of 2BL d points C Rd such that each w S is close to some v C in the sense that supfF |f (w) - f (v)| .

Solution: Discretize each dimension into Cj = {0, , 2 , . . . , B - } and let C = dj=1(Cj -Cj). Then:

sup |f (w) - f (v)| = sup |g(w ? x) - g(v ? x)|

(1)

f F

x 1L

= sup |w ? x - v ? x| [since g is 1-Lipschitz]

(2)

x 1L

L w - v [H?older's inequality]

(3)

L . [by construction of C]

(4)

To upper bound the last line by , it suffices to take = L . The size of the cover is ( 2B )d. Plugging in = L in yields the result.

3

c. (5 points) Recall that the exponentiated gradient (EG) algorithm maintains a distribution pt m over m experts. If zt,j 0 is the loss of expert j at time step t (for each t = 1, . . . , T, j = 1, . . . , m), then we have that the regret of the EG algorithm is upper bounded as follows for all q m:

T

log m T

[pt ? zt - q ? zt]

+ 2

zt .

t=1

t=1

Consider the cover C = {v1, . . . , vm} from the previous part, and treat each element of the cover v C as an expert. Show that the expected regret of the learner with respect to any u S (not necessarily in C) is upper bounded:

Regret(u) d=ef

T

d log(2BL/ )

[ft(wt) - ft(u)]

+ (BL +

) T.

t=1

Solution:

Let a = arg min|a|BL g(a). Let zt = [ft(v1) - g(a), . . . , ft(vm) - g(a)], and note that zt 0 because a ranges over all possible arguments to g that might come from v ? x.

Then ft(v) - g(a) = g(v ? x) - g(a) v x 1 + BL 2BL, so zt 2BL. Recall that

log m = d log(2BL/ ). The EG regret bound allows us to compare the learner with the best expert in C.

Specifically, for all v C:

T

d log(2BL/ )

[ft(wt) - ft(v)]

+ BLT .

t=1

Now, for any u S, part (a) says that we can find some v C such that |ft(u) - ft(v)| , since ft F . So the real regret with respect to u S incurs an additional T penalty:

T

Regret(u) = [ft(wt) - ft(u)]

(5)

t=1

T

[ft(wt) - ft(v) + ft(v) - ft(u)]

(6)

t=1

d log(2BL/ )

+ BLT + T.

(7)

4

d. (5 points) Show that if B = L = 1, we can set the step size and discretization in some way to obtain:

Regret(u) = O( dT log T ).

Note that you do not need to optimize and ; just show that you get the desired regret bound. Give the particular values of and that achieve this.

Solution: Each term should be on the order of the regret, so let try setting = = upper bounded by

d

log T

T

.

Then

the

regret

can

be

T

T

Regret(u) d log 2

+ 2 dT log T

(8)

d log T d log T

1

T

d(log 2 + log T )

+ 2 dT log T ,

(9)

2

d log T

where we dropped the d log T inside the log, which can only increase the bound. Simplification yields the result.

5

2. FTL vs. OGD for repeated linear regression (15 points)

In this problem, we will consider the problem of fixed-design linear regression under sequential, repeated measurements. Formally, there is a fixed matrix X Rn?d, with n d and full column rank. At each time t = 1, . . . , T :

? The learner plays parameters wt S (S is the set of allowed weight vectors) ? Nature provides target values Yt Rn

? The learner suffers loss according to the loss function

1 ft(w) = 2n

Xw - Yt

2 2

.

(10)

Note that in this problem, X does not change with time. Throughout this problem, we use i(A) to denote the i'th largest singular value of a matrix A. You

might also find the following useful: For a real matrix A and vector b, Ab 2 1(A) b 2.

a. (3 points) Give an expression for computing wt according to FTL. For this part, assume that S = Rn. (The final expression should not comprise any max or min expressions.)

Solution:

The

FTL

criterion

is

wt

=

argminw

1 2n

t-1 s=1

Xw - Ys

2 2

,

which

is

(strongly)

convex.

Taking

a

derivative

of

the RHS and setting to 0, we have the solution

wt = X X -1 X

1 t-1 t - 1 Yt .

s=1

(11)

6

b. (5 points) Now assume that we have a norm bound on the target values, Yt 2 R for all t. Prove that, with each wt chosen according to FTL (as in part (a)), the regret is bounded above as

R2 log T

Regret O

.

(12)

n

As in the previous part, assume S = Rn and w1 = 0.

Hints: ? It may be notationally helpful to define = X(X X)-1X . ? is a projection matrix.

Solution: For convenience, define = X X X -1 X . This analysis conceptually extends that of mean estimation

(Example 6 in the notes) to having a projection matrix , but requires more work.

We use the one-step lookahead lemma, and bound regret by bounding the sum over ft(wt) - ft(wt+1).

Define

Zt

=

1 t-1

t-1 s=1

Ys.

Note

that

Zt+1

=

(1 - 1/t)Zt + (1/t)Yt.

From

here,

there

are

many

ways

to

bound

on the one-step lookahead sum, some more algebra-heavy than others and leading to different constants that

end up hidden in the O(?) notation. Below is a very explicit algebra-heavy derivation (intentionally using

no clever tricks) that obtains the bound with a good constant factor, namely 3, which is certainly sufficient

but quite stronger than is necessary.

Note that for vectors a, b, y,

a - y 2 - b - y 2 = a 2 - 2a y + y 2 - b 2 + 2b y - y 2

(13)

= a 2 - b 2 + 2y (b - a).

(14)

The difference to the one-step lookahead loss is hence

2n[ft(wt) - ft(wt+1)]

= Zt - Yt 2 - Zt+1 - Yt 2

= Zt 2 - Zt+1 2 + 2Yt (Zt+1 - Zt)

=

Zt 2 -

(1

-

1 t

)Zt

+

1 t

Yt

2 + 2Yt

((1

-

1 t

)Zt

+

1 t

Yt

-

Zt)

=

Zt 2 -

(1

-

1 t

)Zt

+

1 t

Yt

2

+

2 t

Yt

(Yt - Zt)

=

Zt 2 -

(1

-

1 t

)Zt

2-

1 t

Yt

2

-

2 t

(1

-

1 t

)Yt

Zt

+

2 t

Yt

(Yt - Zt)

= (1 - (1 -

1 t

)2

)

Zt

2-

1 t

Yt

2

-

2 t

(1

-

1 t

)Yt

Zt

+

2 t

Yt

(Yt - Zt)

2 t

Zt

2-

1 t

Yt

2

-

2 t

(1

-

1 t

)Yt

Zt

+

2 t

Yt

(Yt - Zt)

2 t

Zt

2

-

2 t

(1

-

1 t

)Yt

Zt

+

2 t

Yt

(Yt - Zt)

=

2 t

Zt

2

-

2 t

(1

-

1 t

-

1)Yt

Zt

+

2 t

Yt

Yt

=

2 t

Zt

2

+

2 t2

Yt

Zt

+

2 t

Yt

Yt

2 t

Zt

2

+

2 t2

Yt

Zt

+

2 t

Yt

Yt

=

2 t

R2

+

2 t2

R2

+

2 t

R2

2 t

R2

+

2 t

R2

+

2 t

R2

=

6 t

R2,

(15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29)

7

where we have used that Yt max() Yt Yt R and similarly for Zt. Dividing both sides by 2n and summing over t then gives

Regret 3 R2(log(T ) + 1).

(30)

n

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download