Logistic Regression: From Binary to Multi-Class - Texas A&M University
Logistic Regression: From Binary to Multi-Class
Shuiwang Ji
Texas A&M University College Station, TX 77843
sji@tamu.edu
Yaochen Xie
Texas A&M University College Station, TX 77843
ethanycx@tamu.edu
1 Introduction
This introduction to the multi-class logistic regression (LR) aims at providing a complete, selfcontained, and easy-to-understand introduction to multi-class LR. We start with a quick review of the binary LR and then generalize the binary LR to multi-class case. We further discuss the connections between the binary LR and the multi-class LR. This document is based on lecture notes by Shuiwang Ji and compiled by Yaochen Xie at Texas A&M University. It can be used for undergraduate and graduate level classes.
2 Binary Logistic Regression
The binary LR predicts the label yi {-1, +1} for a given sample xi by estimating a probability P (y|xi) and comparing with a pre-defined threshold.
Recall the sigmoid function is defined as
es
1
(s) = 1 + es = 1 + e-s ,
(1)
where s R and denotes the sigmoid function. The maps any value in R to a number in (0, 1) and meanwhile preserves the order of any two input numbers as (?) is a monotonically increasing function.
The probability is thus represented by
(wT x) P (y|x) = 1 - (wT x)
if y = 1 if y = -1.
This can also be expressed compactly as
P (y|x) = (ywT x),
(2)
due to the fact that (-s) = 1 - (s). Note that in the binary case, we only need to estimate one probability, as the probabilities for +1 and -1 sum to one.
3 Multi-Class Logistic Regression
The binary logistic regression assumes that the label yi {-1, +1} (i = 1, ? ? ? , N ), while in the multi-class cases there are more than two classes, i.e., yi {1, 2, ? ? ? , K} (i = 1, ? ? ? , N ), where K is the number of classes and N is the number of samples. In this case, we need to estimate
the probability for each of the K classes. The hypothesis in binary LR is hence generalized to the
multi-class case as
P (y = 1|x; w)
hw (x)
=
P
(y
= ?
2|x; ??
w)
(3)
P (y = K|x; w)
A critical assumption here is that there is no ordinal relationship between the classes. So we will need one linear signal for each of the K classes, which should be independent conditioned on x. As a result, in the multi-class LR, we compute K linear signals by the dot product between the input x and K independent weight vectors wk, k = 1, ? ? ? , K as
w1T x
w2T x
...
.
(4)
wKT x
So far, the only thing left to obtain the hypothesis is to map the K linear outputs (as a vector in RK) to the K probabilities (as a probability distribution among the K classes).
3.1 Softmax
In order to accomplish such a mapping, we introduce the softmax function, which is gener-
alized from the sigmoid function and defined as below. Given a K-dimensional vector v = [v1, v2, ? ? ? , vK ]T RK ,
ev1
softmax(v) =
1
K k=1
evk
ev2
...
.
(5)
evK
It is easy to verify that the softmax maps a vector in RK to (0, 1)K. All elements in the output vector of softmax sum to 1 and their orders are preserved. Thus the hypothesis in (3) can be written as
P (y = 1|x; w)
ew1T x
hw (x)
=
P
(y
= ?
2|x; ??
w)
=
P (y = K|x; w)
1
K k=1
ewkT
x
ew2T
x
.
???
ewK T x
(6)
We will further discuss the connection between the softmax function and the sigmoid function by showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2 in Section 4.
3.2 Cross Entropy
We optimize the multi-class LR by minimizing a loss (cost) function, measuring the error between predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy in Equation (7) to measure the distance between two probability distributions.
The cross entropy is defined by
K
H(P , Q) = - pi log(qi),
(7)
i=1
where P = (p1, ? ? ? , pK ) and Q = (q1, ? ? ? , qK ) are two probability distributions. In multi-class LR, the two probability distributions are the true distribution and predicted vector in Equation (3), respectively.
Here the true distribution refers to the one-hot encoding of the label. For label k (k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k, and 0 everywhere else.
2
3.3 Loss Function Now the loss for a training sample x in class c is given by
loss(x, y; w) = H(y, y^)
(8)
= - yk log y^k
(9)
k
= - log y^c
(10)
= - log
ewcT x
K k=1
ewkT
x
(11)
where y denotes the one-hot vector and y^ is the predicted distribution h(xi). And the loss on all samples (Xi, Yi)Ni=1 is
NK
loss(X, Y ; w) = -
I[yi = k] log
i=1 k=1
ewkT xi
K k=1
ewkT
xi
(12)
4 Properties of Multi-class LR
4.1 Shift-invariance in Parameters
The softmax function in multi-class LR has an invariance property when shifting the parameters. Given the weights w = (w1, ? ? ? , wK), suppose we subtract the same vector u from each of the K weight vectors, the outputs of softmax function will remain the same.
Proof. To prove this, let us denote w = {wi}Ki=1, where wi = wi - u. We have
P (y = k|x; w ) =
e(wk-u)T x
K i=1
e(wi -u)T
x
ewkT xe-uT x
=
K i=1
ewiT
xe-uT
x
= (
ewkT xe-uT x
K i=1
ewiT
x)e-uT
x
e(wk)T x
=
K i=1
e(wi )T
x
= P (y = k|x; w),
(13) (14) (15) (16) (17)
which completes the proof.
4.2 Equivalence to Sigmoid
Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.
3
Proof.
1 hw(x) = ew1T x + ew2T x
ew1T x ew2T x
1 = e(w1-w1)T x + e(w2-w1)T x
e(w1-w1)T x e(w2-w1)T x
1
=
1+e(w2-w1)T x e(w2-w1)T x
1+e(w2-w1)T x
1
=
1+e-w^T x e-w^T x
1+e-w^T x
1
=
1+e-w^T x
1- 1
1+e-w^T x
=
hw^ (x) 1 - hw^(x)
,
where w^ = w1 - w2. This completes the proof.
(18) (19) (20) (21) (22)
4.3 Relations between binary and multi-class LR
In the assignment, we've already proved that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes. We hereby show the proof again as below.
Proof.
1
arg
min
w
Ein(w)
=
arg
min
w
N
N
ln(1 + e-ynwT xn )
n=1
1N
1
= arg min
w
N
ln
n=1
(ynwT xn)
1N
1
= arg min
ln
wN
n=1
P (yn|xn)
1N
1
1
= arg min
w
N
I [yn
n=1
=
+1] ln
P (yn|xn)
+
I [yn
=
-1] ln
P (yn|xn)
1N
1
1
= arg min
w
N
I [yn
n=1
=
+1] ln
h(xn)
+ I[yn
=
-1] ln
1
- h(xn)
1
1
= arg min p log + (1 - p) log
w
q
1-q
= arg min H({p, 1 - p}, {q, 1 - q})
w
where p = I[yn = +1] and q = h(xn). This completes the proof.
The equivalence between logistic regression loss and the cross-entropy loss, as proved above, shows that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K = 2.
5 Derivative of multi-class LR
To optimize the multi-class LR by gradient descent, we now derive the derivative of softmax and cross entropy. The derivative of the loss function can thus be obtained by the chain rule.
4
5.1 Derivative of softmax
Let pi denotes the i-th element of softmax(a). Then for j = i, we have
pi
=
pi
=
aj ai
eai
K k=1
eak
ai
(23)
eai =
(
K k=1
eak
-
e2ai
K k=1
eak
)2
(24)
eai
=
K k=1
eak
?
K k=1
eak
-
eai
K k=1
eak
(25)
=pi(1 - pi)
(26)
=pi(1 - pj)
(27)
And for j = i,
pi
=
aj
eai
K k=1
eak
aj
(28)
0 - eai eaj
= (
K k=1
eak
)2
=-
eai
K k=1
eak
?
eaj
K k=1
eak
(29) (30)
= - pipj
(31)
If we unify the two cases with the Kronecker delta, we will have
pi aj
= pi(ij
- pj),
where
1 if i = j ij = 0 if i = j
5.2 Derivative of cross entropy loss with softmax
The Cross Entropy Loss is given by:
where pi = softmaxi(a) = derivative of cross entropy is
L = - yilog(pi)
i
eai
K k=1
eak
and yi
denotes the i-th element of the one-hot vector.
The
L =-
ak
i
yi
log(pi ak
)
(32)
=-
i
yi
log(pi) pi
?
pi ak
(33)
=-
i
1 yi pi
?
pi ak
(34)
1
= - i yi pi ? pi(ki - pk)
(35)
= - yk(1 - pk) + yipk
(36)
i=k
K
=pk yi - yk
(37)
i=1
=pk - yk
(38)
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- logistic regression carnegie mellon university
- classification data x
- machine learning basics lecture 7 multiclass classification
- learning from data lecture 9 logistic regression and gradient descent
- logistic regression university at buffalo
- bayesian multinomial logistic regression for numerous categories work
- on logistic regression gradients of the log loss multi class
- lecture 3 multi class classification github pages
- lab 4 logistic regression in python clark science center
- logistic regression from binary to multi class texas a m university
Related searches
- texas a m money education center
- texas a m grading scale
- texas a m 2020 2021 academic schedule
- texas a m extension child care training
- texas a m academic calendar
- texas a m 1098 t
- texas a m campuses in texas
- texas a m financial aid portal
- texas a m university campuses
- texas a m online engineering degree
- texas a m online engineering masters
- texas a m college station tuition