Logistic Regression: From Binary to Multi-Class
Logistic Regression: From Binary to Multi-Class
Yaochen Xie
Texas A&M University
College Station, TX 77843
ethanycx@tamu.edu
Shuiwang Ji
Texas A&M University
College Station, TX 77843
sji@tamu.edu
1
Introduction
This introduction to the multi-class logistic regression (LR) aims at providing a complete, selfcontained, and easy-to-understand introduction to multi-class LR. We start with a quick review of the
binary LR and then generalize the binary LR to multi-class case. We further discuss the connections
between the binary LR and the multi-class LR. This document is based on lecture notes by Shuiwang
Ji and compiled by Yaochen Xie at Texas A&M University. It can be used for undergraduate and
graduate level classes.
2
Binary Logistic Regression
The binary LR predicts the label yi ¡Ê {?1, +1} for a given sample xi by estimating a probability
P (y|xi ) and comparing with a pre-defined threshold.
Recall the sigmoid function is defined as
es
1
=
,
(1)
1 + es
1 + e?s
where s ¡Ê R and ¦È denotes the sigmoid function. The ¦È maps any value in R to a number in (0, 1)
and meanwhile preserves the order of any two input numbers as ¦È(¡¤) is a monotonically increasing
function.
¦È(s) =
The probability is thus represented by
P (y|x) =
¦È(wT x)
1 ? ¦È(wT x)
if y = 1
if y = ?1.
This can also be expressed compactly as
P (y|x) = ¦È(ywT x),
(2)
due to the fact that ¦È(?s) = 1 ? ¦È(s). Note that in the binary case, we only need to estimate one
probability, as the probabilities for +1 and -1 sum to one.
3
Multi-Class Logistic Regression
The binary logistic regression assumes that the label yi ¡Ê {?1, +1} (i = 1, ¡¤ ¡¤ ¡¤ , N ), while in the
multi-class cases there are more than two classes, i.e., yi ¡Ê {1, 2, ¡¤ ¡¤ ¡¤ , K} (i = 1, ¡¤ ¡¤ ¡¤ , N ), where
K is the number of classes and N is the number of samples. In this case, we need to estimate
the probability for each of the K classes. The hypothesis in binary LR is hence generalized to the
multi-class case as
?
?
P (y = 1|x; w)
? P (y = 2|x; w) ?
hw (x) = ?
(3)
?
¡¤¡¤¡¤
P (y = K|x; w)
A critical assumption here is that there is no ordinal relationship between the classes. So we will
need one linear signal for each of the K classes, which should be independent conditioned on x. As
a result, in the multi-class LR, we compute K linear signals by the dot product between the input x
and K independent weight vectors wk , k = 1, ¡¤ ¡¤ ¡¤ , K as
?
w1T x
? w2T x ?
?
?
? .. ? .
? . ?
?
(4)
T
wK
x
So far, the only thing left to obtain the hypothesis is to map the K linear outputs (as a vector in RK )
to the K probabilities (as a probability distribution among the K classes).
3.1
Softmax
In order to accomplish such a mapping, we introduce the softmax function, which is generalized from the sigmoid function and defined as below. Given a K-dimensional vector v =
[v1 , v2 , ¡¤ ¡¤ ¡¤ , vK ]T ¡Ê RK ,
? v1 ?
e
e v2 ?
?
1
? . ?.
softmax(v) = PK
(5)
evk ? .. ?
k=1
e vK
It is easy to verify that the softmax maps a vector in RK to (0, 1)K . All elements in the output vector
of softmax sum to 1 and their orders are preserved. Thus the hypothesis in (3) can be written as
?
P (y = 1|x; w)
1
? P (y = 2|x; w) ?
hw (x) = ?
? = PK wT x
¡¤¡¤¡¤
k
k=1 e
P (y = K|x; w)
?
T ?
ew1 x
? ew2T x ?
?
?
? ¡¤¡¤¡¤ ?.
?
e
(6)
T
wK
x
We will further discuss the connection between the softmax function and the sigmoid function by
showing that the sigmoid in binary LR is equivalent to the softmax in multi-class LR when K = 2
in Section 4.
3.2
Cross Entropy
We optimize the multi-class LR by minimizing a loss (cost) function, measuring the error between
predictions and the true labels, as we did in the binary LR. Therefore, we introduce the cross-entropy
in Equation (7) to measure the distance between two probability distributions.
The cross entropy is defined by
H(P , Q) = ?
K
X
pi log(qi ),
(7)
i=1
where P = (p1 , ¡¤ ¡¤ ¡¤ , pK ) and Q = (q1 , ¡¤ ¡¤ ¡¤ , qK ) are two probability distributions. In multi-class
LR, the two probability distributions are the true distribution and predicted vector in Equation (3),
respectively.
Here the true distribution refers to the one-hot encoding of the label. For label k (k is the correct class), the one-hot encoding is defined as a vector whose element being 1 at index k, and 0
everywhere else.
2
3.3
Loss Function
Now the loss for a training sample x in class c is given by
loss(x, y; w) = H(y, y?)
X
=?
yk log y?k
(8)
(9)
k
= ? log y?c
(10)
wcT x
e
= ? log PK
(11)
T
k=1
ewk x
where y denotes the one-hot vector and y? is the predicted distribution h(xi ). And the loss on all
samples (Xi , Yi )N
i=1 is
loss(X, Y ; w) = ?
4
4.1
N X
K
X
T
ewk xi
I[yi = k] log PK
Tx
wk
i
k=1 e
i=1 k=1
(12)
Properties of Multi-class LR
Shift-invariance in Parameters
The softmax function in multi-class LR has an invariance property when shifting the parameters.
Given the weights w = (w1 , ¡¤ ¡¤ ¡¤ , wK ), suppose we subtract the same vector u from each of the K
weight vectors, the outputs of softmax function will remain the same.
0
Proof. To prove this, let us denote w0 = {wi0 }K
i=1 , where wi = wi ? u. We have
T
e(wk ?u)
0
x
P (y = k|x; w ) = PK
(wi ?u)T x
i=1 e
T
wk x ?uT x
e
= PK
e
T
ewi x e?uT x
(13)
(14)
i=1
T
wk
x ?uT x
e
e
= PK
T
w
( i=1 e i x )e?uT x
(15)
T
e(wk ) x
= PK
(wi )T x
i=1 e
= P (y = k|x; w),
(16)
(17)
which completes the proof.
4.2
Equivalence to Sigmoid
Once we have proved the shift-invariance, we are able to show that when K = 2, the softmax-based
multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both
LR are equivalent.
3
Proof.
" T #
1
ew1 x
hw (x) = wT x
T
T
e 1 + ew2 x ew2 x
"
#
T
1
e(w1 ?w1 ) x
= (w ?w )T x
T
e 1 1
+ e(w2 ?w1 )T x e(w2 ?w1 ) x
"
#
1
(18)
(19)
T
=
=
=
1+e(w2 ?w1 ) x
T
e(w2 ?w1 ) x
T
1+e(w2 ?w1 ) x
#
"
1
1+e?w?T x
T
e?w? x
1+e?w?T x
"
#
1
1+e?w?T x
1 ? 1+e?1w?T x
(20)
(21)
=
hw? (x)
,
1 ? hw? (x)
(22)
where w? = w1 ? w2 . This completes the proof.
4.3
Relations between binary and multi-class LR
In the assignment, we¡¯ve already proved that minimizing the logistic regression loss is equivalent to
minimizing the cross-entropy loss with binary outcomes. We hereby show the proof again as below.
Proof.
N
T
1 X
arg min Ein (w) = arg min
ln(1 + e?yn w xn )
w
w N
n=1
= arg min
N
1
1 X
ln
N n=1 ¦È(yn wT xn )
= arg min
N
1
1 X
ln
N n=1 P (yn |xn )
= arg min
N
1
1
1 X
I[yn = +1] ln
+ I[yn = ?1] ln
N n=1
P (yn |xn )
P (yn |xn )
= arg min
N
1 X
1
1
I[yn = +1] ln
+ I[yn = ?1] ln
N n=1
h(xn )
1 ? h(xn )
w
w
w
w
1
1
+ (1 ? p) log
q
1?q
= arg min H({p, 1 ? p}, {q, 1 ? q})
= arg min p log
w
w
where p = I[yn = +1] and q = h(xn ). This completes the proof.
The equivalence between logistic regression loss and the cross-entropy loss, as proved above, shows
that we always obtain identical weights w by minimizing the two losses. The equivalence between
the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that
the binary logistic regression is a particular case of multi-class logistic regression when K = 2.
5
Derivative of multi-class LR
To optimize the multi-class LR by gradient descent, we now derive the derivative of softmax and
cross entropy. The derivative of the loss function can thus be obtained by the chain rule.
4
5.1
Derivative of softmax
Let pi denotes the i-th element of softmax(a). Then for j = i, we have
ea i
ak
?pi ? PK
?pi
k=1 e
=
=
?aj
?ai
?ai
PK ak
ai
e
? e2ai
k=1 e
=
PK a 2
( k=1 e k )
PK ak
e ? eai
eai
¡¤ k=1
= PK
P
K
ak
ak
k=1 e
k=1 e
=pi (1 ? pi )
=pi (1 ? pj )
And for j 6= i,
eai
ak
?pi ? PK
k=1 e
=
?aj
?aj
0 ? eai eaj
= PK
( k=1 eak )2
eaj
eai
¡¤ PK
= ? PK
ak
ak
k=1 e
k=1 e
= ? pi pj
If we unify the two cases with the Kronecker delta, we will have
?pi
= pi (¦Äij ? pj ),
?aj
where
1 if i = j
¦Äij =
0 if i 6= j
5.2
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
Derivative of cross entropy loss with softmax
The Cross Entropy Loss is given by:
L=?
X
yi log(pi )
i
where pi = softmaxi (a) =
derivative of cross entropy is
ai
PKe
k=1
ea k
and yi denotes the i-th element of the one-hot vector. The
X ? log(pi )
?L
=?
yi
?ak
?ak
i
X ? log(pi ) ?pi
=?
yi
¡¤
?pi
?ak
i
X 1 ?pi
=?
yi ¡¤
pi ?ak
i
X 1
=?
yi ¡¤ pi (¦Äki ? pk )
pi
i
X
= ? yk (1 ? pk ) +
yi p k
(32)
(33)
(34)
(35)
(36)
i6=k
=pk
K
X
yi ? yk
(37)
i=1
=pk ? yk
(38)
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- multinomial logistic regression feature engineering
- lecture 5 logistic regression
- logistic regression university of pennsylvania
- logistic regression from binary to multi class
- on logistic regression gradients of the log loss multi
- multiclass logistic regression
- weibo gong umass amherst
- 王熙照教授个人主页
- author guidelines for 8
- database systems
Related searches
- binary logistic regression analysis
- binary logistic regression equation
- binary logistic regression formula
- binary logistic regression 101
- binary logistic regression pdf
- binary logistic regression for dummies
- multi class logistic regression
- from binary to dec
- binary logistic regression calculator
- binary logistic regression data
- multi class logistic regression gradient
- binary logistic regression example