Chapter 1: Sub-Gaussian Random Variables - MIT OpenCourseWare

Chapter

1

Sub-Gaussian Random Variables

1.1 GAUSSIAN TAILS AND MGF

Recall that a random variable X IR has Gaussian distribution iff it has a density p with respect to the Lebesgue measure on IR given by

p(x)

=

1 22

( exp

-

(x

- ?)2 22

)

,

x IR ,

where ? = IE(X) IR and 2 = var(X) > 0 are the mean and variance of X. We write X N (?, 2). Note that X = Z + ? for Z N (0, 1) (called standard Gaussian) and where the equality holds in distribution. Clearly, this distribution has unbounded support but it is well known that it has almost bounded support in the following sense: IP(|X - ?| 3) 0.997. This is due to the fast decay of the tails of p as |x| (see Figure 1.1). This decay can be quantified using the following proposition (Mills inequality).

Proposition 1.1. Let X be a Gaussian random variable with mean ? and variance 2 then for any t > 0, it holds

IP(X - ? > t) 1

e . -

t2 22

2 t

By symmetry we also have

IP(X - ? < -t)

1

e -

t2 22

2 t

.

14

1.1. Gaussian tails and MGF

15

Figure 1.1. Probabilities of falling within 1, 2, and 3 standard deviations close to the mean in a Gaussian distribution. Source

and

IP(|X - ?| > t)

2 e-

t2 22

t

.

Proof. Note that it is sufficient to prove the theorem for ? = 0 and 2 = 1 by simple translation and rescaling. We get for Z N (0, 1),

IP(Z

>

t)

= =

t112212111ttt ext-xepx(xp-(ex-xp22(x)22-d)xxd2x2 )dx

= 1 exp(-t2/2) . t 2

The second inequality follows from symmetry and the last one using the union bound:

IP(|Z| > t) = IP({Z > t} {Z < -t}) IP(Z > t) + IP(Z < -t) = 2IP(Z > t) .

The fact that a Gaussian random variable Z has tails that decay to zero exponentially fast can also be seen in the moment generating function (MGF)

M : s M (s) = IE[exp(sZ)] .

1.2. Sub-Gaussian random variables and Chernoff bounds

16

Indeed in the case of a standard Gaussian random variable, we have

1

M (s) = IE[exp(sZ)] = 1

esz

e -

z 2 2

dz

2 1

= 1

e dz -

(z-s)2 2

+

s 2 2

2

=

e s2 2

.

It

follows

that

if

X

N (?,

2),

then

IE[exp(sX )]

=

exp

( s?

+

2 s2 2

) .

1.2 SUB-GAUSSIAN RANDOM VARIABLES AND CHERNOFF BOUNDS

Definition and first properties

Gaussian tails are practical when controlling the tail of an average of inde

tpheenndeXn?t=rann1 domni=1vaXriiableNs.(?In,de2e/dn,)r.eUcaslilntghaLtemif mXa1,1...3.

, Xn are i.i.d N (?, 2), below for example, we

get

IP(|X?

-

?|

>

t)

2

exp

(

-

nt2 22

)

.

Equating the right-hand side with some confidence level > 0, we find that with probability at least1 1 - ,

[ ? X? -

2

log(2/) n

,

X?

+

2 log(2/) n

,

(1.1)

This is almost the confidence interval that you used in introductory statistics. The only difference is that we used an approximation for the Gaussian tail whereas statistical tables or software use a much more accurate computation. Figure 1.2 shows the ration of the width of the confidence interval to that of the confidence interval computer by the software R. It turns out that intervals of the same form can be also derived for non-Gaussian random variables as long as they have sub-Gaussian tails.

Definition 1.2. A random variable X IR is said to be sub-Gaussian with variance proxy 2 if IE[X] = 0 and its moment generating function satisfies

IE[exp(sX )]

exp

(

2s2 2

)

,

s IR .

(1.2)

In this case we write X subG(2). Note that subG(2) denotes a class of

distributions rather than a distribution. Therefore, we abuse notation when writing X subG(2).

More generally, we can talk about sub-Gaussian random vectors and ma trices. A random vector X IRd is said to be sub-Gaussian with variance

1We will often commit the statement "at least" for brevity

1.2. Sub-Gaussian random variables and Chernoff bounds

17

Figure 1.2. Width of confidence intervals from exact computation in R (red dashed) and (1.1) (solid black).

proxy 2 if IE[X] = 0 and uX is sub-Gaussian with variance proxy 2 for any unit vector u Sd-1. In this case we write X subGd(2). A ran dom matrix X IRd?T is said to be sub-Gaussian with variance proxy 2 if IE[X] = 0 and uXv is sub-Gaussian with variance proxy 2 for any unit vectors u Sd-1, v ST -1. In this case we write X subGd?T (2).

This property can equivalently be expressed in terms of bounds on the tail of the random variable X.

Lemma 1.3. Let X subG(2). Then for any t > 0, it holds

IP[X

>

t]

exp

(

-

t2 ) 22

,

and

IP[X

<

-t]

exp

(

-

t2 ) 22

.

(1.3)

Proof. Assume first that X subG(2). We will employ a very useful technique

called Chernoff bound that allows to to translate a bound on the moment

generating function into a tail bound. Using Markov's inequality, we have for

any s > 0,

IP(X

>

t)

IP(esX

>

est)

IE esX est

.

Next we use the fact that X is sub-Gaussian to get

IP(X

>

t)

e 2s2 2

-st

.

1.2. Sub-Gaussian random variables and Chernoff bounds

18

The above inequality holds for any s > 0 so to make it the tightest possible, we

minimize

with

respect

to

s

>

0.

Solving

(s)

=

0,

where

(s)

=

2 s2 2

- st,

we

find

that

inf s>0

(s)

=

-

t2 22

.

This

proves the

first

part of

(1.3).

The

second

inequality in this equation follows in the same manner (recall that (1.2) holds

for any s IR).

Moments

Recall that the absolute moments of Z N (0, 2) are given by

IE[|Z |k ]

=

1

(22

)k/2

(

k

+ 1) 2

where (?) denote the Gamma function defined by

1

(t) =

xt-1e-xdx , t > 0 .

0

The next lemma shows that the tail bounds of Lemma 1.3 are sufficient to show that the absolute moments of X subG(2) can be bounded by those of Z N (0, 2) up to multiplicative constants.

Lemma 1.4. Let X be a random variable such that

IP[|X |

>

t]

2

exp

(

-

t2 22

)

,

then for any positive integer k 1,

IE[|X|k] (22)k/2k(k/2) .

In particular,

( IE[|X

|k])1/k

e1/e k

,

and IE[|X|] 2 .

k 2.

Proof.

1 IE[|X|k] = IP(|X|k > t)dt

10

=

IP(|X| > t1/k)dt

0 1 2

e dt

-

t2/k 22

0

1

= (22)k/2k

e-uuk/2-1du ,

0

= (22)k/2k(k/2)

u

=

t2/k 22

1.2. Sub-Gaussian random variables and Chernoff bounds

19

The second statement follows from (k/2) (k/2)k/2 and k1/k e1/e for any k 2. It yields

((22)k/2k(k/2))1/k k1/k

22k 2

e1/e k .

Moreover, for k = 1, we have 2(1/2) = 2.

Using moments, we can prove the following reciprocal to Lemma 1.3.

Lemma 1.5. If (1.3) holds, then for any s > 0, it holds

IE[exp(sX)] e42s2 .

As a result, we will sometimes write X subG(2) when it satisfies (1.3).

Proof. We use the Taylor expansion of the exponential function as follows. Observe that by the dominated convergence theorem

IE esX

1

+

sk IE[|X |k ] k!

k=2

1

+

(22s2)k/2k(k/2) k!

k=2

=

1

+

(22 s2 )k 2k(k) (2k)!

+

(22s2)k+1/2(2k + 1)(k (2k + 1)!

+

1/2)

k=1

k=1

1

+

( 2

+

) 22s2

(22 s2 )k k! (2k)!

k=1

( 1+ 1+

2s2 ) (22s2)k

2

k!

k=1

2(k!)2 (2k)!

= e22s2 + e42s2 .

2s2 2

(e22

s2

-

1)

From the above Lemma, we see that sub-Gaussian random variables can be equivalently defined from their tail bounds and their moment generating functions, up to constants.

Sums of independent sub-Gaussian random variables Recall that if X1, . . . , Xn are i.i.d N (0, 2), then for any a IRn,

n aiXi N (0, |a|222).

i=1

1.2. Sub-Gaussian random variables and Chernoff bounds

20

If we only care about the tails, this property is preserved for sub-Gaussian random variables.

Theorem 1.6. Let X = (X1, . . . , Xn) be a vector of independent sub-Gaussian random variables that have variance proxy 2. Then, the random vector X is sub-Gaussian with variance proxy 2.

Proof. Let u Sn-1 be a unit vector, then

IE[esuX ]

=

n

IE[esuiXi ]

n

e

2 s2 ui2 2

= e 2s2|u|22 2

= e 2s2 2

.

i=1

i=1

Using a Chernoff bound, we immediately get the following corollary

Corollary 1.7. Let X1, . . . , Xn be n independent random variables such that Xi subG(2). Then for any a IRn, we have

[ n IP aiXi > t

i=1

exp

(

-

t2 22|a|22

)

,

and

[ n IP aiXi < -t

i=1

( exp

-

t2 22|a|22

)

the

Of special average X?

i=nten1restin=is1

the Xi,

case where satisfies

ai

= 1/n

for

all

i.

Then,

we

get

that

IP(X?

>

t)

e -

nt2 22

and

IP(X?

<

-t)

e -

nt2 22

just like for the Gaussian average.

Hoeffding's inequality

The class of subGaussian random variables is actually quite large. Indeed, Hoeffding's lemma below implies that all randdom variables that are bounded uniformly are actually subGaussian with a variance proxy that depends on the size of their support.

Lemma 1.8 (Hoeffding's lemma (1963)). Let X be a random variable such that IE(X) = 0 and X [a, b] almost surely. Then, for any s IR, it holds

IE[esX ]

s2 (b-a)2

e 8

.

In

particular,

X

subG(

(b-a)2 4

)

.

1.2. Sub-Gaussian random variables and Chernoff bounds

21

Proof. Define (s) = log IE[esX], and observe that and we can readily compute

(s)

=

IE[X esX IE[esX ]

]

,

(s)

=

IE[X2esX ] IE[esX ]

-

IE[X esX IE[esX ]

]

2

.

Thus (s) can be interpreted as the variance of the random variable X under

the probability measure dQ =

esX IE[esX

]

dIP.

But since X

[a, b] almost surely,

we have, under any probability,

var(X )

=

( var X

-

a

+

b )

[( IE X

-

a

+

b )2

(b - a)2 .

2

2

4

The fundamental theorem of calculus yields

(s)

=

1s

0

1?

0

() d d?

s2(b - 8

a)2

using (0) = log 1 = 0 and (0) = IEX = 0.

Using a Chernoff bound, we get the following (extremely useful) result.

Theorem 1.9 (Hoeffding's inequality). Let X1, . . . , Xn be n independent ran dom variables such that almost surely,

Xi [ai, bi] , i.

Let

X?

=

1 n

n

i=1

Xi,

then

for

any

t

>

0,

IP(X?

-

IE(X? )

>

t)

( exp

-

ni=12(nb2i

t2

-

ai)2

) ,

and

IP(X? - IE(X? ) < -t) exp ( - ni=12(nb2i t-2 ai)2 ) .

Note that Hoeffding's lemma is for any bounded random variables. For example, if one knows that X is a Rademacher random variable. Then

IE(esX )

=

es

+ e-s 2

=

cosh(s)

e s2 2

Note that 2 is the best possible constant in the above approximation. For such variables a = -1, b = 1, IE(X) = 0 so Hoeffding's lemma yields

IE(esX )

e s2 2

.

Hoeffding's inequality is very general but there is a price to pay for this gen erality. Indeed, if the random variables have small variance, we would like to see it reflected in the exponential tail bound (like for the Gaussian case) but the variance does not appear in Hoeffding's inequality. We need a more refined inequality.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download