Logistic Regression: From Binary to Multi-Class

嚜燉ogistic Regression: From Binary to Multi-Class

Yaochen Xie

Texas A&M University

College Station, TX 77843

ethanycx@tamu.edu

Shuiwang Ji

Texas A&M University

College Station, TX 77843

sji@tamu.edu

1

Introduction

This introduction to the multi-class logistic regression (LR) aims at providing a complete, selfcontained, and easy-to-understand introduction to multi-class LR. We start with a quick review of the

binary LR and then generalize the binary LR to multi-class case. We further discuss the connections

between the binary LR and the multi-class LR. This document is based on lecture notes by Shuiwang

Ji and compiled by Yaochen Xie at Texas A&M University. It can be used for undergraduate and

graduate level classes.

2

Binary Logistic Regression

The binary LR predicts the label yi ﹋ {?1, +1} for a given sample xi by estimating a probability

P (y|xi ) and comparing with a pre-defined threshold.

Recall the sigmoid function is defined as

es

1

=

,

(1)

1 + es

1 + e?s

where s ﹋ R and 牟 denotes the sigmoid function. The 牟 maps any value in R to a number in (0, 1)

and meanwhile preserves the order of any two input numbers as 牟(﹞) is a monotonically increasing

function.

牟(s) =

The probability is thus represented by

P (y|x) =

牟(wT x)

1 ? 牟(wT x)

if y = 1

if y = ?1.

This can also be expressed compactly as

P (y|x) = 牟(ywT x),

(2)

due to the fact that 牟(?s) = 1 ? 牟(s). Note that in the binary case, we only need to estimate one

probability, as the probabilities for +1 and -1 sum to one.

3

Multi-Class Logistic Regression

The binary logistic regression assumes that the label yi ﹋ {?1, +1} (i = 1, ﹞ ﹞ ﹞ , N ), while in the

multi-class cases there are more than two classes, i.e., yi ﹋ {1, 2, ﹞ ﹞ ﹞ , K} (i = 1, ﹞ ﹞ ﹞ , N ), where

K is the number of classes and N is the number of samples. In this case, we need to estimate

the probability for each of the K classes. The hypothesis in binary LR is hence generalized to the

multi-class case as

?

?

P (y = 1|x; w)

? P (y = 2|x; w) ?

hw (x) = ?

(3)

?

﹞﹞﹞

P (y = K|x; w)

A critical assumption here is that there is no ordinal relationship between the classes. So we will

need one linear signal for each of the K classes, which should be independent conditioned on x. As

a result, in the multi-class LR, we compute K linear signals by the dot product between the input x

and K independent weight vectors wk , k = 1, ﹞ ﹞ ﹞ , K as

?

w1T x

? w2T x ?

?

?

? .. ? .

? . ?

?

(4)

T

wK

x

So far, the only thing left to obtain the hypothesis is to map the K linear outputs (as a vector in RK )

to the K probabilities (as a probability distribution among the K classes).

3.1

Softmax

In order to accomplish such a mapping, we introduce the softmax function, which is generalized from the sigmoid function and defined as below. Given a K-dimensional vector v =

[v1 , v2 , ﹞ ﹞ ﹞ , vK ]T ﹋ RK ,

? v1 ?

e

e v2 ?

?

1

? . ?.

softmax(v) = PK

(5)

evk ? .. ?

k=1

e vK

It is easy to verify that the softmax maps a vector in RK to (0, 1)K . All elements in the output vector

of softmax sum to 1 and their orders are preserved. Thus the hypothesis in (3) can be written as

?

P (y = 1|x; w)

1

? P (y = 2|x; w) ?

hw (x) = ?

? = PK wT x

﹞﹞﹞

k

k=1 e

P (y = K|x; w)

?

T ?

ew1 x

? ew2T x ?

?

?

? ﹞﹞﹞ ?.

?

e

(6)

T

wK

x

We will further discuss the connection between the softmax function and the sigmoid function by

showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2

in Section 4.

3.2

Cross Entropy

We optimize the multi-class LR by minimizing a loss (cost) function, measuring the error between

predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy

in Equation (7) to measure the distance between two probability distributions.

The cross entropy is defined by

H(P , Q) = ?

K

X

pi log(qi ),

(7)

i=1

where P = (p1 , ﹞ ﹞ ﹞ , pK ) and Q = (q1 , ﹞ ﹞ ﹞ , qK ) are two probability distributions. In multi-class

LR, the two probability distributions are the true distribution and predicted vector in Equation (3),

respectively.

Here the true distribution refers to the one-hot encoding of the label. For label k (k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k, and 0

everywhere else.

2

3.3

Loss Function

Now the loss for a training sample x in class c is given by

loss(x, y; w) = H(y, y?)

X

=?

yk log y?k

(8)

(9)

k

= ? log y?c

(10)

wcT x

e

= ? log PK

(11)

T

k=1

ewk x

where y denotes the one-hot vector and y? is the predicted distribution h(xi ). And the loss on all

samples (Xi , Yi )N

i=1 is

loss(X, Y ; w) = ?

4

4.1

N X

K

X

T

ewk xi

I[yi = k] log PK

Tx

wk

i

k=1 e

i=1 k=1

(12)

Properties of Multi-class LR

Shift-invariance in Parameters

The softmax function in multi-class LR has an invariance property when shifting the parameters.

Given the weights w = (w1 , ﹞ ﹞ ﹞ , wK ), suppose we subtract the same vector u from each of the K

weight vectors, the outputs of softmax function will remain the same.

0

Proof. To prove this, let us denote w0 = {wi0 }K

i=1 , where wi = wi ? u. We have

T

e(wk ?u)

0

x

P (y = k|x; w ) = PK

(wi ?u)T x

i=1 e

T

wk x ?uT x

e

= PK

e

T

ewi x e?uT x

(13)

(14)

i=1

T

wk

x ?uT x

e

e

= PK

T

w

( i=1 e i x )e?uT x

(15)

T

e(wk ) x

= PK

(wi )T x

i=1 e

= P (y = k|x; w),

(16)

(17)

which completes the proof.

4.2

Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based

multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both

LR are equivalent.

3

Proof.

" T #

1

ew1 x

hw (x) = wT x

T

T

e 1 + ew2 x ew2 x

"

#

T

1

e(w1 ?w1 ) x

= (w ?w )T x

T

e 1 1

+ e(w2 ?w1 )T x e(w2 ?w1 ) x

"

#

1

(18)

(19)

T

=

=

=

1+e(w2 ?w1 ) x

T

e(w2 ?w1 ) x

T

1+e(w2 ?w1 ) x

#

"

1

1+e?w?T x

T

e?w? x

1+e?w?T x

"

#

1

1+e?w?T x

1 ? 1+e?1w?T x

(20)

(21)

=

hw? (x)

,

1 ? hw? (x)

(22)

where w? = w1 ? w2 . This completes the proof.

4.3

Relations between binary and multi-class LR

In the assignment, we＊ve already proved that minimizing the logistic regression loss is equivalent to

minimizing the cross-entropy loss with binary outcomes. We hereby show the proof again as below.

Proof.

N

T

1 X

arg min Ein (w) = arg min

ln(1 + e?yn w xn )

w

w N

n=1

= arg min

N

1

1 X

ln

N n=1 牟(yn wT xn )

= arg min

N

1

1 X

ln

N n=1 P (yn |xn )

= arg min

N

1

1

1 X

I[yn = +1] ln

+ I[yn = ?1] ln

N n=1

P (yn |xn )

P (yn |xn )

= arg min

N

1 X

1

1

I[yn = +1] ln

+ I[yn = ?1] ln

N n=1

h(xn )

1 ? h(xn )

w

w

w

w

1

1

+ (1 ? p) log

q

1?q

= arg min H({p, 1 ? p}, {q, 1 ? q})

= arg min p log

w

w

where p = I[yn = +1] and q = h(xn ). This completes the proof.

The equivalence between logistic regression loss and the cross-entropy loss, as proved above, shows

that we always obtain identical weights w by minimizing the two losses. The equivalence between

the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that

the binary logistic regression is a particular case of multi-class logistic regression when K = 2.

5

Derivative of multi-class LR

To optimize the multi-class LR by gradient descent, we now derive the derivative of softmax and

cross entropy. The derivative of the loss function can thus be obtained by the chain rule.

4

5.1

Derivative of softmax

Let pi denotes the i-th element of softmax(a). Then for j = i, we have

ea i

ak

?pi ? PK

?pi

k=1 e

=

=

?aj

?ai

?ai

PK ak

ai

e

? e2ai

k=1 e

=

PK a 2

( k=1 e k )

PK ak

e ? eai

eai

﹞ k=1

= PK

P

K

ak

ak

k=1 e

k=1 e

=pi (1 ? pi )

=pi (1 ? pj )

And for j 6= i,

eai

ak

?pi ? PK

k=1 e

=

?aj

?aj

0 ? eai eaj

= PK

( k=1 eak )2

eaj

eai

﹞ PK

= ? PK

ak

ak

k=1 e

k=1 e

= ? pi pj

If we unify the two cases with the Kronecker delta, we will have

?pi

= pi (汛ij ? pj ),

?aj

where

1 if i = j

汛ij =

0 if i 6= j

5.2

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

(31)

Derivative of cross entropy loss with softmax

The Cross Entropy Loss is given by:

L=?

X

yi log(pi )

i

where pi = softmaxi (a) =

derivative of cross entropy is

ai

PKe

k=1

ea k

and yi denotes the i-th element of the one-hot vector. The

X ? log(pi )

?L

=?

yi

?ak

?ak

i

X ? log(pi ) ?pi

=?

yi

﹞

?pi

?ak

i

X 1 ?pi

=?

yi ﹞

pi ?ak

i

X 1

=?

yi ﹞ pi (汛ki ? pk )

pi

i

X

= ? yk (1 ? pk ) +

yi p k

(32)

(33)

(34)

(35)

(36)

i6=k

=pk

K

X

yi ? yk

(37)

i=1

=pk ? yk

(38)

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches