Logistic Regression: From Binary to Multi-Class

Logistic Regression: From Binary to Multi-Class

Yaochen Xie

Texas A&M University

College Station, TX 77843

ethanycx@tamu.edu

Shuiwang Ji

Texas A&M University

College Station, TX 77843

sji@tamu.edu

1

Introduction

This introduction to the multi-class logistic regression (LR) aims at providing a complete, selfcontained, and easy-to-understand introduction to multi-class LR. We start with a quick review of the

binary LR and then generalize the binary LR to multi-class case. We further discuss the connections

between the binary LR and the multi-class LR. This document is based on lecture notes by Shuiwang

Ji and compiled by Yaochen Xie at Texas A&M University. It can be used for undergraduate and

graduate level classes.

2

Binary Logistic Regression

The binary LR predicts the label yi ¡Ê {?1, +1} for a given sample xi by estimating a probability

P (y|xi ) and comparing with a pre-defined threshold.

Recall the sigmoid function is defined as

es

1

=

,

(1)

1 + es

1 + e?s

where s ¡Ê R and ¦È denotes the sigmoid function. The ¦È maps any value in R to a number in (0, 1)

and meanwhile preserves the order of any two input numbers as ¦È(¡¤) is a monotonically increasing

function.

¦È(s) =

The probability is thus represented by



P (y|x) =

¦È(wT x)

1 ? ¦È(wT x)

if y = 1

if y = ?1.

This can also be expressed compactly as

P (y|x) = ¦È(ywT x),

(2)

due to the fact that ¦È(?s) = 1 ? ¦È(s). Note that in the binary case, we only need to estimate one

probability, as the probabilities for +1 and -1 sum to one.

3

Multi-Class Logistic Regression

The binary logistic regression assumes that the label yi ¡Ê {?1, +1} (i = 1, ¡¤ ¡¤ ¡¤ , N ), while in the

multi-class cases there are more than two classes, i.e., yi ¡Ê {1, 2, ¡¤ ¡¤ ¡¤ , K} (i = 1, ¡¤ ¡¤ ¡¤ , N ), where

K is the number of classes and N is the number of samples. In this case, we need to estimate

the probability for each of the K classes. The hypothesis in binary LR is hence generalized to the

multi-class case as

?

?

P (y = 1|x; w)

? P (y = 2|x; w) ?

hw (x) = ?

(3)

?

¡¤¡¤¡¤

P (y = K|x; w)

A critical assumption here is that there is no ordinal relationship between the classes. So we will

need one linear signal for each of the K classes, which should be independent conditioned on x. As

a result, in the multi-class LR, we compute K linear signals by the dot product between the input x

and K independent weight vectors wk , k = 1, ¡¤ ¡¤ ¡¤ , K as

?

w1T x

? w2T x ?

?

?

? .. ? .

? . ?

?

(4)

T

wK

x

So far, the only thing left to obtain the hypothesis is to map the K linear outputs (as a vector in RK )

to the K probabilities (as a probability distribution among the K classes).

3.1

Softmax

In order to accomplish such a mapping, we introduce the softmax function, which is generalized from the sigmoid function and defined as below. Given a K-dimensional vector v =

[v1 , v2 , ¡¤ ¡¤ ¡¤ , vK ]T ¡Ê RK ,

? v1 ?

e

e v2 ?

?

1

? . ?.

softmax(v) = PK

(5)

evk ? .. ?

k=1

e vK

It is easy to verify that the softmax maps a vector in RK to (0, 1)K . All elements in the output vector

of softmax sum to 1 and their orders are preserved. Thus the hypothesis in (3) can be written as

?

P (y = 1|x; w)

1

? P (y = 2|x; w) ?

hw (x) = ?

? = PK wT x

¡¤¡¤¡¤

k

k=1 e

P (y = K|x; w)

?

T ?

ew1 x

? ew2T x ?

?

?

? ¡¤¡¤¡¤ ?.

?

e

(6)

T

wK

x

We will further discuss the connection between the softmax function and the sigmoid function by

showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2

in Section 4.

3.2

Cross Entropy

We optimize the multi-class LR by minimizing a loss (cost) function, measuring the error between

predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy

in Equation (7) to measure the distance between two probability distributions.

The cross entropy is defined by

H(P , Q) = ?

K

X

pi log(qi ),

(7)

i=1

where P = (p1 , ¡¤ ¡¤ ¡¤ , pK ) and Q = (q1 , ¡¤ ¡¤ ¡¤ , qK ) are two probability distributions. In multi-class

LR, the two probability distributions are the true distribution and predicted vector in Equation (3),

respectively.

Here the true distribution refers to the one-hot encoding of the label. For label k (k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k, and 0

everywhere else.

2

3.3

Loss Function

Now the loss for a training sample x in class c is given by

loss(x, y; w) = H(y, y?)

X

=?

yk log y?k

(8)

(9)

k

= ? log y?c

(10)

wcT x

e

= ? log PK

(11)

T

k=1

ewk x

where y denotes the one-hot vector and y? is the predicted distribution h(xi ). And the loss on all

samples (Xi , Yi )N

i=1 is

loss(X, Y ; w) = ?

4

4.1

N X

K

X

T

ewk xi

I[yi = k] log PK

Tx

wk

i

k=1 e

i=1 k=1

(12)

Properties of Multi-class LR

Shift-invariance in Parameters

The softmax function in multi-class LR has an invariance property when shifting the parameters.

Given the weights w = (w1 , ¡¤ ¡¤ ¡¤ , wK ), suppose we subtract the same vector u from each of the K

weight vectors, the outputs of softmax function will remain the same.

0

Proof. To prove this, let us denote w0 = {wi0 }K

i=1 , where wi = wi ? u. We have

T

e(wk ?u)

0

x

P (y = k|x; w ) = PK

(wi ?u)T x

i=1 e

T

wk x ?uT x

e

= PK

e

T

ewi x e?uT x

(13)

(14)

i=1

T

wk

x ?uT x

e

e

= PK

T

w

( i=1 e i x )e?uT x

(15)

T

e(wk ) x

= PK

(wi )T x

i=1 e

= P (y = k|x; w),

(16)

(17)

which completes the proof.

4.2

Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based

multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both

LR are equivalent.

3

Proof.

" T #

1

ew1 x

hw (x) = wT x

T

T

e 1 + ew2 x ew2 x

"

#

T

1

e(w1 ?w1 ) x

= (w ?w )T x

T

e 1 1

+ e(w2 ?w1 )T x e(w2 ?w1 ) x

"

#

1

(18)

(19)

T

=

=

=

1+e(w2 ?w1 ) x

T

e(w2 ?w1 ) x

T

1+e(w2 ?w1 ) x

#

"

1

1+e?w?T x

T

e?w? x

1+e?w?T x

"

#

1

1+e?w?T x

1 ? 1+e?1w?T x

(20)

(21)



=



hw? (x)

,

1 ? hw? (x)

(22)

where w? = w1 ? w2 . This completes the proof.

4.3

Relations between binary and multi-class LR

In the assignment, we¡¯ve already proved that minimizing the logistic regression loss is equivalent to

minimizing the cross-entropy loss with binary outcomes. We hereby show the proof again as below.

Proof.

N

T

1 X

arg min Ein (w) = arg min

ln(1 + e?yn w xn )

w

w N

n=1

= arg min

N

1

1 X

ln

N n=1 ¦È(yn wT xn )

= arg min

N

1

1 X

ln

N n=1 P (yn |xn )

= arg min

N

1

1

1 X

I[yn = +1] ln

+ I[yn = ?1] ln

N n=1

P (yn |xn )

P (yn |xn )

= arg min

N

1 X

1

1

I[yn = +1] ln

+ I[yn = ?1] ln

N n=1

h(xn )

1 ? h(xn )

w

w

w

w

1

1

+ (1 ? p) log

q

1?q

= arg min H({p, 1 ? p}, {q, 1 ? q})

= arg min p log

w

w

where p = I[yn = +1] and q = h(xn ). This completes the proof.

The equivalence between logistic regression loss and the cross-entropy loss, as proved above, shows

that we always obtain identical weights w by minimizing the two losses. The equivalence between

the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that

the binary logistic regression is a particular case of multi-class logistic regression when K = 2.

5

Derivative of multi-class LR

To optimize the multi-class LR by gradient descent, we now derive the derivative of softmax and

cross entropy. The derivative of the loss function can thus be obtained by the chain rule.

4

5.1

Derivative of softmax

Let pi denotes the i-th element of softmax(a). Then for j = i, we have

ea i

ak

?pi ? PK

?pi

k=1 e

=

=

?aj

?ai

?ai

PK ak

ai

e

? e2ai

k=1 e

=

PK a 2

( k=1 e k )

PK ak

e ? eai

eai

¡¤ k=1

= PK

P

K

ak

ak

k=1 e

k=1 e

=pi (1 ? pi )

=pi (1 ? pj )

And for j 6= i,

eai

ak

?pi ? PK

k=1 e

=

?aj

?aj

0 ? eai eaj

= PK

( k=1 eak )2

eaj

eai

¡¤ PK

= ? PK

ak

ak

k=1 e

k=1 e

= ? pi pj

If we unify the two cases with the Kronecker delta, we will have

?pi

= pi (¦Äij ? pj ),

?aj

where



1 if i = j

¦Äij =

0 if i 6= j

5.2

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

(31)

Derivative of cross entropy loss with softmax

The Cross Entropy Loss is given by:

L=?

X

yi log(pi )

i

where pi = softmaxi (a) =

derivative of cross entropy is

ai

PKe

k=1

ea k

and yi denotes the i-th element of the one-hot vector. The

X ? log(pi )

?L

=?

yi

?ak

?ak

i

X ? log(pi ) ?pi

=?

yi

¡¤

?pi

?ak

i

X 1 ?pi

=?

yi ¡¤

pi ?ak

i

X 1

=?

yi ¡¤ pi (¦Äki ? pk )

pi

i

X

= ? yk (1 ? pk ) +

yi p k

(32)

(33)

(34)

(35)

(36)

i6=k

=pk

K

X

yi ? yk

(37)

i=1

=pk ? yk

(38)

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download