Exam (2013) solutions

Exam (2013) solutions

CS229T/STATS231 (Fall 2018?2019) Note: please do not copy or distribute.

1. Online learning with regularization (18 points)

Online gradient descent corresponds to optimizing linearizations of a function. In this problem, we will see that if the loss functions have the form

ft(w) = gt(w) + h(w),

then you can linearize around gt only to obtain a loss bound that does not depend on h. We will assume that h is convex, 0 arg minw h(w), and g is convex and differentiable.

Suppose that our set of experts is a convex set S. Consider the update equation

wt+1 = arg min

wS

t

gi(wi)

1 ? w + h(w)t +

2

w

22.

(1)

i=1

a. (5 points) Show that (1) corresponds to Follow The Leader (FTL) on functions f^0, f^1, . . . , f^T , where

f^0(w)

=

1 2

w

2 2

and

f^t(w)

=

gt(wt)

?

w

+

h(w).

Solution: Recall that Follow the Leader on functions f^0, f^1, . . . , f^T with a convex set of experts S corre-

sponds to the update

t

wt+1 = arg min f^i.

(2)

wS

i=0

For f^i given above, we have

t

t

f^i = f^0 + f^i

i=0

i=1

1 =

2

w

2 2

+

t

(gt(wt) ? w + h(w))

i=1

=

t

gt(wt)

1 ? w + th(w) +

2

w

22.

i=1

Substituting into (2) yields (1).

1

b. (5 points) Apply the standard regret bound for FTL from class to obtain the following (feel free to cite the bound without proof as long as you give the bound in its general form and indicate how it applies in this situation)

T

1 [gt(wt) ? (wt - u) + h(wt) - h(u)] 2

u

2 2

+

T

[gt(wt) ? (wt - wt+1) + h(wt) - h(wt+1)].

t=1

t=1

Solution: The bound in class was

T

T

f^t(wt) - f^t(u) f^t(wt) - f^t(wt+1).

(3)

t=0

t=0

Applying it in this situation yields

1 2

T

w0

2 2

-

u

2 2

+

(gt(wt) ? (wt - u) + h(wt) - h(u))

t=1

1

2

T

w0

2 2

-

w1

2 2

+

gt(wt) ? (wt - wt+1) + h(wt) - h(wt+1).

t=1

Re-arranging yields

T

1 (gt(wt) ? (wt - u) + h(wt) - h(u)) 2

u

2 2

-

w1

2 2

+

T

gt(wt) ? (wt - wt+1) + h(wt) - h(wt+1).

t=1

t=1

(4)

Observing that -

w1

2 2

0

yields

the

desired

inequality.

2

c. (5 points) Based on (b), show that:

T

1 [ft(wt) - ft(u)] 2

u

2 2

+

T

[gt(wt) ? (wt - wt+1)].

t=1

t=1

Solution: First, note that

1

w1

=

arg

min

wS

2

w

2 2

=0

Now, if we apply the bound from the previous part and note that

T t=1

h(wt)

- h(wt+1)

telescopes,

we

obtain

T

1 [gt(wt) ? (wt - u) + h(wt) - h(u)] 2

u

2 2

+

T

[gt(wt) ? (wt - wt+1) + h(wt) - h(wt+1)]

t=1

t=1

1 =

2

u

2 2

+

h(w1)

-

h(wT

+1)

+

T

gt(wt) ? (wt - wt+1)

t=1

1 =

2

u

2 2

+

h(0)

-

h(wT

+1)

+

T

gt(wt) ? (wt - wt+1)

t=1

1

2

u

2 2

+

T

gt(wt) ? (wt - wt+1),

t=1

where the last inequality follows from the fact that the global minimum of h(w) is at w = 0. Comparing the initial and final expressions in the above chain of inequalities yields the desired result.

d. (3 points) Explain briefly why we should expect to typically get a smaller regret with this algorithm than if we run online gradient descent directly on the original functions ft.

Solution: The key point is that we are dealing with h rather than a linearization of h. If we had linearized h, then we would incur a penalty based on the squared norm of gt + h, which would be larger than the squared norm of gt.

3

2. Multi-task regression (20 points)

In multi-task regression, we have K regression tasks. Each input is a pair (i, x), where i {1, . . . , K}

identifies the task and x X is an input; the output is y R. We maintain a weight vector for each task:

w1

w=

...

RKd,

where

each

wi

Rd.

For each task i,

define a feature map i

:X

Rd.

The loss

wK function on a single data point is:

((i, x, y), w) = (y - wi ? i(x))2.

We believe the tasks are related, so we'd like to regularize the weights towards each other. Define a regularizer

as

K

(w) =

wi

2 2

+

wi - wj 22.

i=1

i ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download