Logistic Regression: From Binary to Multi-Class - Texas A&M University

Logistic Regression: From Binary to Multi-Class

Shuiwang Ji

Texas A&M University College Station, TX 77843

sji@tamu.edu

Yaochen Xie

Texas A&M University College Station, TX 77843

ethanycx@tamu.edu

1 Introduction

This introduction to the multi-class logistic regression (LR) aims at providing a complete, selfcontained, and easy-to-understand introduction to multi-class LR. We start with a quick review of the binary LR and then generalize the binary LR to multi-class case. We further discuss the connections between the binary LR and the multi-class LR. This document is based on lecture notes by Shuiwang Ji and compiled by Yaochen Xie at Texas A&M University. It can be used for undergraduate and graduate level classes.

2 Binary Logistic Regression

The binary LR predicts the label yi {-1, +1} for a given sample xi by estimating a probability P (y|xi) and comparing with a pre-defined threshold.

Recall the sigmoid function is defined as

es

1

(s) = 1 + es = 1 + e-s ,

(1)

where s R and denotes the sigmoid function. The maps any value in R to a number in (0, 1) and meanwhile preserves the order of any two input numbers as (?) is a monotonically increasing function.

The probability is thus represented by

(wT x) P (y|x) = 1 - (wT x)

if y = 1 if y = -1.

This can also be expressed compactly as

P (y|x) = (ywT x),

(2)

due to the fact that (-s) = 1 - (s). Note that in the binary case, we only need to estimate one probability, as the probabilities for +1 and -1 sum to one.

3 Multi-Class Logistic Regression

The binary logistic regression assumes that the label yi {-1, +1} (i = 1, ? ? ? , N ), while in the multi-class cases there are more than two classes, i.e., yi {1, 2, ? ? ? , K} (i = 1, ? ? ? , N ), where K is the number of classes and N is the number of samples. In this case, we need to estimate

the probability for each of the K classes. The hypothesis in binary LR is hence generalized to the

multi-class case as

P (y = 1|x; w)

hw (x)

=

P

(y

= ?

2|x; ??

w)

(3)

P (y = K|x; w)

A critical assumption here is that there is no ordinal relationship between the classes. So we will need one linear signal for each of the K classes, which should be independent conditioned on x. As a result, in the multi-class LR, we compute K linear signals by the dot product between the input x and K independent weight vectors wk, k = 1, ? ? ? , K as

w1T x

w2T x

...

.

(4)

wKT x

So far, the only thing left to obtain the hypothesis is to map the K linear outputs (as a vector in RK) to the K probabilities (as a probability distribution among the K classes).

3.1 Softmax

In order to accomplish such a mapping, we introduce the softmax function, which is gener-

alized from the sigmoid function and defined as below. Given a K-dimensional vector v = [v1, v2, ? ? ? , vK ]T RK ,

ev1

softmax(v) =

1

K k=1

evk

ev2

...

.

(5)

evK

It is easy to verify that the softmax maps a vector in RK to (0, 1)K. All elements in the output vector of softmax sum to 1 and their orders are preserved. Thus the hypothesis in (3) can be written as

P (y = 1|x; w)

ew1T x

hw (x)

=

P

(y

= ?

2|x; ??

w)

=

P (y = K|x; w)

1

K k=1

ewkT

x

ew2T

x

.

???

ewK T x

(6)

We will further discuss the connection between the softmax function and the sigmoid function by showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2 in Section 4.

3.2 Cross Entropy

We optimize the multi-class LR by minimizing a loss (cost) function, measuring the error between predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy in Equation (7) to measure the distance between two probability distributions.

The cross entropy is defined by

K

H(P , Q) = - pi log(qi),

(7)

i=1

where P = (p1, ? ? ? , pK ) and Q = (q1, ? ? ? , qK ) are two probability distributions. In multi-class LR, the two probability distributions are the true distribution and predicted vector in Equation (3), respectively.

Here the true distribution refers to the one-hot encoding of the label. For label k (k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k, and 0 everywhere else.

2

3.3 Loss Function Now the loss for a training sample x in class c is given by

loss(x, y; w) = H(y, y^)

(8)

= - yk log y^k

(9)

k

= - log y^c

(10)

= - log

ewcT x

K k=1

ewkT

x

(11)

where y denotes the one-hot vector and y^ is the predicted distribution h(xi). And the loss on all samples (Xi, Yi)Ni=1 is

NK

loss(X, Y ; w) = -

I[yi = k] log

i=1 k=1

ewkT xi

K k=1

ewkT

xi

(12)

4 Properties of Multi-class LR

4.1 Shift-invariance in Parameters

The softmax function in multi-class LR has an invariance property when shifting the parameters. Given the weights w = (w1, ? ? ? , wK), suppose we subtract the same vector u from each of the K weight vectors, the outputs of softmax function will remain the same.

Proof. To prove this, let us denote w = {wi}Ki=1, where wi = wi - u. We have

P (y = k|x; w ) =

e(wk-u)T x

K i=1

e(wi -u)T

x

ewkT xe-uT x

=

K i=1

ewiT

xe-uT

x

= (

ewkT xe-uT x

K i=1

ewiT

x)e-uT

x

e(wk)T x

=

K i=1

e(wi )T

x

= P (y = k|x; w),

(13) (14) (15) (16) (17)

which completes the proof.

4.2 Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.

3

Proof.

1 hw(x) = ew1T x + ew2T x

ew1T x ew2T x

1 = e(w1-w1)T x + e(w2-w1)T x

e(w1-w1)T x e(w2-w1)T x

1

=

1+e(w2-w1)T x e(w2-w1)T x

1+e(w2-w1)T x

1

=

1+e-w^T x e-w^T x

1+e-w^T x

1

=

1+e-w^T x

1- 1

1+e-w^T x

=

hw^ (x) 1 - hw^(x)

,

where w^ = w1 - w2. This completes the proof.

(18) (19) (20) (21) (22)

4.3 Relations between binary and multi-class LR

In the assignment, we've already proved that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes. We hereby show the proof again as below.

Proof.

1

arg

min

w

Ein(w)

=

arg

min

w

N

N

ln(1 + e-ynwT xn )

n=1

1N

1

= arg min

w

N

ln

n=1

(ynwT xn)

1N

1

= arg min

ln

wN

n=1

P (yn|xn)

1N

1

1

= arg min

w

N

I [yn

n=1

=

+1] ln

P (yn|xn)

+

I [yn

=

-1] ln

P (yn|xn)

1N

1

1

= arg min

w

N

I [yn

n=1

=

+1] ln

h(xn)

+ I[yn

=

-1] ln

1

- h(xn)

1

1

= arg min p log + (1 - p) log

w

q

1-q

= arg min H({p, 1 - p}, {q, 1 - q})

w

where p = I[yn = +1] and q = h(xn). This completes the proof.

The equivalence between logistic regression loss and the cross-entropy loss, as proved above, shows that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K = 2.

5 Derivative of multi-class LR

To optimize the multi-class LR by gradient descent, we now derive the derivative of softmax and cross entropy. The derivative of the loss function can thus be obtained by the chain rule.

4

5.1 Derivative of softmax

Let pi denotes the i-th element of softmax(a). Then for j = i, we have

pi

=

pi

=

aj ai

eai

K k=1

eak

ai

(23)

eai =

(

K k=1

eak

-

e2ai

K k=1

eak

)2

(24)

eai

=

K k=1

eak

?

K k=1

eak

-

eai

K k=1

eak

(25)

=pi(1 - pi)

(26)

=pi(1 - pj)

(27)

And for j = i,

pi

=

aj

eai

K k=1

eak

aj

(28)

0 - eai eaj

= (

K k=1

eak

)2

=-

eai

K k=1

eak

?

eaj

K k=1

eak

(29) (30)

= - pipj

(31)

If we unify the two cases with the Kronecker delta, we will have

pi aj

= pi(ij

- pj),

where

1 if i = j ij = 0 if i = j

5.2 Derivative of cross entropy loss with softmax

The Cross Entropy Loss is given by:

where pi = softmaxi(a) = derivative of cross entropy is

L = - yilog(pi)

i

eai

K k=1

eak

and yi

denotes the i-th element of the one-hot vector.

The

L =-

ak

i

yi

log(pi ak

)

(32)

=-

i

yi

log(pi) pi

?

pi ak

(33)

=-

i

1 yi pi

?

pi ak

(34)

1

= - i yi pi ? pi(ki - pk)

(35)

= - yk(1 - pk) + yipk

(36)

i=k

K

=pk yi - yk

(37)

i=1

=pk - yk

(38)

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download