I have attached the handout also to assist in answering.
1
Chapter 11
Convergence in Distributio
n
1. Weak convergence in metric spaces
2
. Weak convergence in R
3
. Tightness and subsequences
4. Metrizing weak convergence
5. Characterizing weak convergence in spaces of functions
2
Chapter 11
Convergence in Distribution
1 Weak convergence in metric spaces
Suppose that (M,d) is a metric space, and let M denote the Borel sigma-field (the sigma field
generated by the open sets in M). Let Cb(M) denote the set of all real-valued, bounded continuous
functions on M, and let Cu(M) denote the set of all real-valued, bounded uniformly continuous
functions on M
.
Definition 1.1 (weak convergence) If {Pn}, P are probability measures on (M,M) satisfying
∫
fdPn →
∫
fdP as n →∞ for all f ∈ Cb(M
)
then we say that Pn converges in distribution (or law) to P , or that Pn converges weakly to P ,
and
we write Pn →d P or Pn ⇒ P . Similarly, if {Xn} are random elements in M (i.e. measurable maps
from some probability space(s) (Ω,A,Pr) (or (Ωn,An,Prn)) to (M,M)) with
Ef(Xn) → Ef(X) for all f ∈ Cb(M) ,
then we write Xn →d X or Xn ⇒ X.
Definition 1.2 (boundary and P-continuity set) For any set B ∈ M, the boundary of B
is
∂B ≡ B \ Bo where B is the closure of B and Bo is the interior of B; i.e. the largest open se
t
contained in B. A set B is called a continuity set of P if P(∂B) = 0.
Definition 1.3 (Bounded Lipschitz functions) A real-valued function f on a metric space
(M,d) is said to satisfy a Lipschitz condition if there exists a finite constant K for which
|f(x) −f(y)| ≤ Kd(x,y) for all x,y ∈ M .
We write BL(M) for the vector space of all bounded Lipshitz functions on M.
We can characterize the space BL(M) in terms of a norm ‖f‖BL defined for all real valued
functions f on M as follows:
‖f‖BL ≡ max{K1(f), 2K2(f
)}
3
4 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
where
K1(f) ≡ sup
x 6=y
|f(x) −f(y)|
d(x,y)
, K2(f) ≡ sup
x
|f(x)| .
Here we have followed Pollard (2002), who deviates from the usual definition of ‖f‖BL in order to
obtain the following nice inequality:
|f(x) −f(y)| ≤ ‖f‖BL{1 ∧d(x,y)} for all x,y ∈ M .
Definition 1.4 (Lower and upper semicontinuous functions) A function f : M → R is said
to be lower semicontinuous (or LSC) if {x : f(x) > t} is an open set for each fixed t. A function
f is said to be upper semicontinuous (or USC) if {x : f(x) < t} is open for each fixed t.
Thus f is USC if and only if −f is LSC. If f is both USC and LSC then it is continuous. The
basic example of a lower semicontinuous function is the indicator function 1B of an open set B;
the basic example of an upper semicontinuous function is the indicator function 1B of a closed set
B. Our first theorem will use the following result connecting lower semicontinuous functions to
functions in BL(M).
Lemma 1.1 (LSC approximation) Let g be a lower semicontinuous function bounded from
below on a metric space M. Then there exists a sequence {fm}∞m=1 ⊂ BL(M) satisfying fm(x) ↑
g(x) for each x ∈ M.
Proof. We may assume that g ≥ 0 without loss of generality (if not, replace g by g +
supx(−g(x))). For each t > 0 the set Bt ≡ {x : g(x) ≤ t} is closed. The sequence of functions
fk,t(x) ≡ t ∧ (kd(x,Bt)) for k ∈ N are in BL(M) and satisfy fk,t(x) ↑ t1Bct (x) = t1[g(x)>t] since
d(x,Bt) > 0 if and only if g(x) > t.
Now consider the countable collection G = ∪k∈N ∪t∈Q+ {gk,t} where Q is the set of all rational
numbers. The pointwise supremum of G is g. If we enumerate G as {g1,g2, . . .}, and then define
fm ≡ maxj≤m gj, it follows that fm is in BL(M) for each m and fm ↑ g. 2
Our first result gives a number of equivalences to the definition of weak convergence given in
Definition 1.1.
Theorem 1.1 (portmanteau theorem) For probability measures {Pn}, P on (M,M) the fol-
lowing are equivalent:
(i)
∫
fdPn →
∫
fdP for all f ∈ Cb(M) ; i.e. Pn →d P .
(ii)
∫
fdPn →
∫
fdP for all f ∈ Cu(M) .
(iii)
∫
fdPn →
∫
fdP for all f ∈ BL(M) .
(iv)
lim sup
n→∞
∫
fdPn ≤
∫
fdP for every upper semicontinuous f
bounded from above .
(v)
lim inf
n→∞
∫
fdPn ≥
∫
fdP for every lower semicontinuous f
bounded from below .
1. WEAK CONVERGENCE IN METRIC SPACES 5
(vi) lim sup
n→∞
Pn(B) ≤ P(B) for all closed sets B ∈M .
(vii) lim inf
n→∞
Pn(B) ≥ P(B) for all open sets B ∈M .
(viii) lim
n→∞
Pn(B) = P(B) for all P − continuity sets B ∈M .
(ix) lim
n→∞
∫
fdPn
=
∫
fdP for all bounded measurable
functions f with P(Cf ) = 1 .
Proof. Clearly (i) implies (ii) and (ii) implies (iii) since BL(M) ⊂ Cu(M) ⊂ Cb(M). We also
note that (iv) and (v) are equivalent since −f is lower semicontinuous and bounded from below if f
is upper semicontinuous and bounded from above. Similarly, (vi) and (vii) are equivalent by taking
complements. Since the indicator function of an open set is lower semicontinuous and bounded
from below, (v) implies (vii), (and similarly, (iv) implies (vi)).
Now we use Lemma 1.1 to show that (iii) implies (v): suppose that (iii) holds, and let g be a
LSC function bounded from below. By Lemma 1.1 there exists a sequence {fm} in BL(M) with
fm ↑ g pointwise. Then, for each fixed m
we have
lim inf
n
∫
gdPn ≥ lim inf
n
∫
fmdPn =
∫
fmdP since
∫
fmdPn →
∫
fmd
P
by (iii). Take the supremum over m; by the monotone convergence theorem the right side in the
last display converges to
∫
gdP, and thus (v) holds.
To see that (vi) and (vii) imply (viii), let B be a P−continuity set. Then since Bo is open and
B is closed,
P(Bo) ≤ lim inf Pn(Bo) ≤ lim inf Pn(B) ≤ lim sup Pn(B) ≤ lim sup Pn(B) ≤ P(B) .
Since B is a P−continuity set P(∂B) = 0 and P(B) = P(Bo), so the extreme terms in the last
display are equal and hence lim Pn(B) = P(B).
Next we show that (viii) implies (vi): Let B be a closed set and suppose that (viii) holds.
Since
∂{x : d(x,B) ≤ δ} ⊂ {x : d(x,B) = δ}, the boundaries are disjoint for different δ > 0, and
hence at most countably many of them can have positive P−measure. Therefore for some sequence
δk → 0 the sets Bk ≡ {x : d(x,B) < δk} are P−continuity sets and Bk ↓ B if B is closed. It
follows that
lim sup
n
Pn(B) ≤ lim sup
n
Pn(Bk) = P(Bk) since Pn(Bk) → P(Bk)
by (viii). By letting k →∞ this yields (vi).
Now we show that (vi) implies (i). Suppose that (vi) holds and fix f ∈ Cb(M). Without loss of
generality we can transform f so that 0 < f(x) ≤ 1 for all x ∈ M. Fix k ≥ 1 and define the closed sets
Bj ≡{x ∈ M :
j
k
≤ f(x)} for j = 0, . . . ,k .
Then it follows that
k∑
j=1
j − 1
k
P(Bj−1 ∩Bcj ) ≤
∫
fdP ≤
k∑
j=1
j
k
P(Bj−1 ∩Bcj ) .
6 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
Rewriting the sum on the right side and summing by parts gives
k∑
j=1
j
k
{P(Bj−1) −P(Bj)} =
1
k
+
1
k
k∑
j=1
P(Bj)
which, together with a similar summation by parts on the left side yields
1
k
k∑
j=1
P(Bj) ≤
∫
fdP ≤
1
k
+
1
k
k∑
j=1
P(Bj) .
Since the sets Bj are closed, it follows from the last display (also used with P replaced by Pn
throughout) and (vi) that
lim sup
n
∫
fdPn ≤ lim sup
n
1k + 1k
k∑
j=1
Pn(Bj)
≤ 1k + 1k
k∑
j=1
P(Bj) ≤
1
k
+
∫
fdP .
Letting k →∞ gives
lim sup
n
∫
fdPn ≤
∫
fdP .
Applying this last conclusion to −f yields
lim inf
n
∫
fdPn ≥
∫
fdP .
Combining these last two displays yields (i).
Since (ix) implies (viii) by taking f = 1B, it remains only to show that (iv) (and (v) since
(iv) and (v) are equivalent) implies (ix). Suppose that f is a bounded measurable function and
suppose that (iv) holds; without loss of generality we may assume that 0 ≤ f ≤ 1. Define the lower
semicontinuous function
◦
f and the upper semicontinuous function f by
◦
f≡ sup{g : g ≤ f, g LSC} ,
f ≡ inf{g : g ≥ f, g USC} .
Note that this notation is sensible: if we take f = 1B for a Borel set B, then
◦
(1B)= 1B◦ , (1B) = 1B .
Also note that
◦
f ≤ f ≤ f .
We claim that
Ef ≡{x :
◦
f= f} = {x : f is continuous at x}≡ Cf .
At any x for which
◦
f (x) = f(x), the set {y :
◦
f (y) > f(x) − �} is an open neighborhood of x, and
on this neighborhood f(y) > f(x) − �. Similarly, if f(x) = f(x), there exists a neighborhood of
x on which f(y) < f(x) + �. Putting these together shows that f is continuous at each point of
1. WEAK CONVERGENCE IN METRIC SPACES 7
{x : f(x) =
◦
f (x)}; i.e. Ef ⊂ Cf . To see the reverse inclusion, note that if f is continuous at x, the
for each � > 0 there is an open set G for which |f(y)−f(x)| < � for all y ∈ G. Then it follows that
(f(x) − �)1G(y) − 21Gc(y) ≤ f(y) ≤ (f(x) + �)1G(y) + 21Gc(y)
which differ by 2� at x. Note that the upper bound in the last display is USC and the lower bound
is LSC. This shows that f(x)−
◦
f (x) ≤ � and hence that f(x) =
◦
f (x). This shows that Ef ⊃ Cf
and completes the proof of (a)
Now by (a) together with (iv) and (vi) we have (using the abbreviated notation Pf ≡
∫
fdP)
P
◦
f ≤ lim inf Pn
◦
f ≤ lim inf Pnf ≤ lim sup Pnf ≤ lim sup
n
Pf ≤ Pf .
Since P(Cf ) = 1 by hypothesis, it follows from (a) that P(
◦
f) = Pf = Pf. We thus conclude that
(ix) holds. 2
The last part of the portmanteau Theorem, part (ix) has an important consequence: weak
convergence is preserved under a map T to another metric space (M ′,d′) which is continuous at
a sufficiently large set of points with respect to the limit measure P . This is the Mann-Wald or
continuous mapping theorem.
Theorem 1.2 (Continuous mapping) Suppose that T is a M\M′ measurable mapping from
(M,d) into another metric space (M ′,d′) with Borel sigma-field M′. Suppose that T is continuous
at each point of a measurable subset CT ⊂ M. If P(CT ) = 1, then PTn →d PT ; equivalently
if Xn ∼ Pn, X ∼ P are random elements in (M,d), then T(Xn) →d T(X) in (M ′,d′) provided
P(X ∈ CT ) = 1.
Proof. Let g ∈ Cb(M ′). Then∫
gdPTn =
∫
g(T)dPn
where g(T) = g ◦ T : M 7→ R is bounded and continuous a.e. P since P(CT ) = 1. It therefore
follows from (ix) of the portmanteau theorem that∫
gdPTn =
∫
g(T)dPn →
∫
g(T)dP =
∫
gdPT .
2
8 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
2 Weak convergence in R and Rk
Weak convergence in R
When the metric space M is R, further equivalences can be added to those given in the port-
manteau theorem, Theorem 1.1. In particular we can add smoothness restrictions to the functions
f involved (that only make sense for functions defined on R). The following proposition is one such
result in this direction.
Proposition 2.1 Suppose that {X,Xn}, are real valued random variables, and suppose further
that Ef(Xn) → Ef(X) for each f ∈ C∞(R), the class of all bounded functions with bounded
derivatives of all orders. Then Xn →d X.
Proof. Let Z ∼ N(0, 1). For a fixed f ∈ BL(R) and σ > 0, define a smoothed function fσ by
convolution:
fσ(x) = Ef(x + σZ) =
1
√
2πσ
∫ ∞
−∞
exp
(
−
1
2σ2
(x−y)2
)
f(y)dy .
Note that fσ ∈ C∞(R) (since we can justify repeated integration via the dominated convergence
theorem), and fσ converges uniformly to f since
|fσ(x) −f(x)| ≤ E|f(x + σZ) −f(x)| ≤ ‖f‖BLE{1 ∧σ|Z|}→ 0
as σ ↘ 0 by the dominated convergence theorem.
Suppose that � > 0 is given. Fix σ > 0 so that supx |fσ(x) −f(x)| ≤ �.
Then
|Ef(Xn) −Ef(X)| ≤ |Efσ(Xn) −Efσ(X)| + 2
�
so that
lim sup
n
|Ef(Xn) −Ef(X)| ≤ 2�
since fσ ∈ C∞(R) and hence Efσ(Xn) → Efσ(X) by the hypothesis of the lemma. 2
Here is another proposition of this type giving further equivalences:
Proposition 2.2 Suppose that {X,Xn} are real valued random variables. Then the following are
equivalent:
(i) Fn(x) = P(Xn ≤ x) → P(X ≤ x) = F(x) for all x with P(X = x) = 0
(i.e. all P−continuity intervals of the form (−∞,x]).
(ii) Xn →d X; i.e. Ef(Xn) → Ef(X) for all f ∈ Cb(R).
(iii) Ef(Xn) → Ef(X) for all f ∈ C3(R).
(iv) Ef(Xn) → Ef(X) for all f ∈ C∞(R).
(v) E exp(itXn) → E exp(itX) for all t ∈ R.
Proof. We have proved that (iv) implies (ii), and the reverse implication is trivially true. Since
C∞(R) ⊂ C3(R) ⊂ Cb(R), the equivalences with (iii) follow easily.
For the equivalence of (i) and (ii) see Exercise xx.
The equivalence of (v) and (ii) will be established in Chapter 12. 2
On the real line R we can metrize weak convergence in terms of the distribution functions: the
metric that does this is the Lévy metric λ.
2. WEAK CONVERGENCE IN R AND RK 9
Proposition 2.3 (Lévy metric) For any distribution functions F and G define
λ(F,G) ≡ inf{� > 0 : F(x− �) − � ≤ G(x) ≤ F(x + �) + � for all x ∈ R} .
Then λ is a metric. Moreover, the set of all distribution functions under λ is a complete separable
metric space. Also Fn →d F as n →∞ if and only if λ(Fn,F) → 0 as n →∞.
Proof. See Problem 6.5. 2
Our goal now is to use part (iii) of Proposition 2.2 to prove several basic central limit theorems
using the method of Lindeberg. The proofs will use the following “replacement inequality”.
Proposition 2.4 (Lindeberg replacement inequality) Suppose that X and Y are indepen-
dent random variables with E|Y |3 < ∞, and suppose that W is another random variable inde-
pendent of X with E|W |3 < ∞. Suppose further that EY = EW and EY 2 = EW2. Then for
f ∈ C3(R)
|Ef(X + Y ) −Ef(X + W)| ≤ C
(
E|Y |3 + E|W |3
)
where C = (1/6) supx |f ′′′(x)|. In particular when W ∼ N(µ,σ2), then
|Ef(X + Y ) −Ef(X + W)| ≤ C1E|Y |3
where C1 ≡ (5 + 4E|Z|3)C =̇ (11.3831 . . .)C and Z ∼ N(0, 1),
and hence
E|Z|3 = 2(2π)−1/2
∫ ∞
0
z3e−z
2/2dz = 4(2π)−1/2 =̇ 1.59577… .
Proof. Fix f ∈ C3(R); by Taylor’s theorem
f(x + y) = f(x) + yf ′(x) +
1
2
y2f ′′(y) + R(x,y)
where R(x,y) = y3f ′′′(x∗)/6 for some x∗ satisfying |x∗ −x| ≤ |y|. Therefore it follows that
|R(x,y)| ≤ C|y|3 for all x, y .(a)
Thus for any two random variables X and Y
Ef(X + Y ) = Ef(X) + E(Y f ′(X)) +
1
2
E(Y 2f ′′(X)) + ER(X,Y ) .
Using independence of X and Y and the bound (a) it follows that
|Ef(X + Y ) −Ef(X) −E(Y )E(f ′(X)) −
1
2
E(Y 2)E(f ′′(X))| ≤ CE|Y |3 .
Since the same inequality holds with Y replaced by W for another random variable W independent
of X with E|W |3 < ∞, if Y and W have E(Y ) = E(W) and E(Y 2) = E(W2), then we can subtract
and via cancellation of the first and second moment terms conclude that
|Ef(X + Y ) −Ef(X + W)| ≤ C
(
E|Y |3 + E|W |3
)
.(b)
10 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
When W ∼ N(µ,σ2) we can further bound E|W |3: since Z ≡ (W −µ)/σ ∼ N(0, 1) we can write
W = µ + σZ. Then by the Cr−inequality (with r = 3)
E|W |3 ≤ 23−1{|µ|3 + σ3E|Z|3}
≤ 4{|E(Y )|3 + {E(Y 2)}3/2E|Z|3}
≤ 4{E|Y |3 + E|Y |3E|Z|3} = (4 + 4E|Z|3)E|Y |3
where the last inequality follows from Jensen’s inequality used twice. Combining the last display
with (b) yields the second inequality of the proposition. 2
Now suppose that ξ1, . . . ,ξk are independent random variables with
µi ≡ Eξi, σ2i = V ar(ξi), E|ξi|
3 < ∞ .
Suppose that {ηi} are independent and independent of the collection {ξi} with ηi ∼ N(µi,σ2i ) for
i = 1, . . . ,k. Define
Sk = ξ1 + . . . + ξk , Tk = η1 + . . . + ηk .
Note that Tk ∼ N(E(Tk),V ar(Tk)) = N(
∑k
1 µj,
∑k
1 σ
2
j ). Now we set up notation to apply Propo-
sition 2.4: we define, for each i
Xi ≡ ξ1 + . . . + ξi−1 + +ηi+1 + . . . + ηk ,
Yi ≡ ξi
Wi ≡ ηi .
By independence of the 2k random variables {ξi} and {ηi} it follows that Xi, Yi, and Wi are
independent for each i. From the second bound of Proposition 2.4 it follows that
|Ef(Xi + Yi) −Ef(Xi + Wi)| ≤ C1E|ξi|3 1 ≤ i ≤ k .
Also note that for i = k the definitions yield Xk +Yk = Sk and X1 +W1 = Tk. Each replacement of
a Yi by a Wi gives sums Xi + Yi and Xi + Wi with one more normal random variable ηi, and taken
together the k replacements result in replacing all the non-Gaussian variables ξi by the Gaussian
random variables ηi to get Tk. The total change in expected value is therefore bounded by a sum
of third moment terms. Here are the details: since Xj + Wj = Xj−1 + Yj−1 for j = 2, . . . ,k,
|Ef(Sk) −Ef(Tk)| = |Ef(Xk + Yk) −Ef(X1 + W1)|
= |
k∑
j=1
(Ef(Xj + Yj) −Ef(Xj + Wj))|
≤
k∑
j=1
|Ef(Xj + Yj) −Ef(Xj + Wj)|
≤ C1
(
E|ξ1|3 + · · · + E|ξk|3
)
.(1)
We will state the resulting theorem in terms of a triangular array of row-wise independent
random variables {ξn,i : i = 1, . . . ,kn, n ∈ N} where n 7→ kn is non-decreasing:
ξ1,1, ξ1,2, . . . ,ξ1,k1
2. WEAK CONVERGENCE IN R AND RK 11
ξ2,1, ξ2,2, . . . ,ξ2,k2
ξ3,1, ξ3,2, . . . ,ξ3,k3
.
.
.
We assume that the random variables in each row are independent, but nothing is assumed about
relationships between different rows. As we will see, this formulation is convenient for dealing with
centering and scaling constants.
Theorem 2.1 (Basic triangular array CLT) Suppose that {ξn,i : i = 1, . . . ,kn}∞n=1 is a trian-
gular array of row-wise independent random variables such that:
(i)
∑kn
1 Eξn,i → µ where µ ∈ R is finite.
(ii)
∑kn
1 V ar(ξn,i) → σ
2 < ∞.
(iii)
∑kn
1 E|ξn,i|
3 → 0.
Then
k
n∑
i=1
ξn,i →d N(µ,σ2) .
Proof. Fix f ∈ C3(R). Application of the inequality (1) yields
|
Ef(
k
n∑
1
ξn,i) −Ef(Tn)| ≤ C1
kn∑
1
E|ξn,i|3 → 0
where Tn ∼ N(µn,σ2n) and where µn → µ, σ2n → σ2 by (i) and (ii). Since this implies that
Tn →d N(µ,σ2) (see Exercise 6.2), it follows that
Ef(
k
n∑
i=1
ξn,i) → Ef(N(µ,σ2)) = Ef(µ + σZ)
where Z ∼ N(0, 1), and this implies (2) in view of Proposition 2.2. 2
The basic central limit theorem for triangular arrays, Theorem 2.1, can be extended to cover
sums of independent random varibles without third moment hypotheses via truncation arguments.
Our next result, the classical (Lindeberg) central limit theorem for independent identically dis-
tributed random variables with finite variances is a good example of the technique.
Theorem 2.2 (Classical CLT) Suppose that X1,X2, . . . are i.i.d. random variables with E(Xi) =
0 and E(X2i ) = 1. Then
1
√
n
(X1 + · · · + Xn) =
√
n(Xn − 0) →d Z ∼ N(0, 1) .
In fact, for f ∈ C3(R),
|Ef(n1/2Xn) −Ef(Z)| ≤ C1E
{
X21
(
1 ∧
|X1|√
n
)}
+ ‖f‖BL{2 + 2E|Z|}E{|X1|21[|X1|>
√
n]}
where C1 ≡ (5 + 4E|Z|3)C =̇ (11.3831 . . .)C and C ≡ supx |f ′′′(x)|/6.
12 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
Corollary 1 (Berry-Esseen type bound) Suppose that X1,X2, . . . are i.i.d. random variables
with E(Xi) = 0, E(X
2
i ) = 1, and E|Xi|
3 < ∞. Then, for f ∈ C3(R),
|Ef(n1/2Xn) −Ef(Z)| ≤ Kf
E|X1|3√
n
where Kf ≡ C1 + 2‖f‖BL(1 + E|Z|).
Proof. The argument proceeds by applying Theorem 2.1 to the truncated and rescaled variables
ξn,i =
Xi1[|Xi|≤
√
n]√
n
, i = 1, . . . ,n.
We compute
µn ≡
n∑
1
Eξn,i = nEξn,1 = −nE{X11[|X1|>
√
n]}/
√
n(a)
since E(X1) = 0, and this yields
|µn| ≤
√
nE{|X1|1[|X1|>
√
n]}≤ E{|X1|
21[|X1|>
√
n]}→ 0
by the dominated convergence theorem. For the sum of variances we have
σ2n ≡
n∑
1
V ar(ξn,i) = E{X21 1[|X1|≤
√
n]}−n(Eξn,1)
2 → 1
since Eξn,1 = µn/n = o(1/n) and by using the dominated convergence theorem again. In fact, we
can also conclude that
|σ2n − 1| ≤ E{X
2
1 1[|X1|>
√
n]} + n(Eξn,1)
2 ≤ 2E{X21 1[|X1|>
√
n]}
by (a) and Jensen’s inequality.
Finally the sum of third moments is controlled by
kn∑
1
E|ξn,i|3 ≤
n
n3/2
E{|X1|31[|X1|≤
√
n]}≤ E
{
X21 (1 ∧
|X1|√
n
)
}
→ 0
again by the dominated convergence theorem. In fact this argument shows that
|Ef(
n∑
1
ξn,i) −Ef(Tn)| ≤ C1E
{
X21
(
1 ∧
|X1|√
n
)}
To conclude the proof we need to show that for f ∈ C3(R)
Ef(n1/2Xn) −Ef(
n∑
1
ξn,i) → 0 .
2. WEAK CONVERGENCE IN R AND RK 13
But since C3(R) ⊂ BL(R) the inequality (1) yields
|Ef(n1/2Xn) −Ef(
n∑
1
ξn,i)|
≤ ‖f‖BLE
∣∣∣ 1√
n
n∑
1
Xi −
1
√
n
n∑
i=1
Xi1[|Xi|≤
√
n]
∣∣∣
≤‖f‖BL
n
√
n
E{|X1|1[|X1|>
√
n]}
≤‖f‖BLE{|X1|21[|X1|>
√
n]}→ 0 .
This completes the proof of the first claim of the theorem. To finish the proof of the second
claim, it remains to bound Ef(Tn) −Ef(Z) = Ef(µn + σnZ) −Ef(Z) where Tn ∼ N(µn,σ2n) and
Z ∼ N(0, 1). Again, for f ∈ C3(R) the inequality (1) yields
|Ef(µn + σnZ) −Ef(Z)| ≤ ‖f‖BLE|µn + (σn − 1)Z|
≤ ‖f‖BL{|µn| + |σn − 1|E|Z|}
≤ ‖f‖BL{E{|X1|21[|X1|>
√
n]} + E|Z|
1
σn + 1
|σ2n − 1|}
≤ ‖f‖BL{1 + 2E|Z|}E{|X1|21[|X1|>
√
n]}
by (b). Collecting the bounds yields the second conclusion of the theorem. 2
To prove the direct half of the classical Lindeberg-Feller central limit theorem, we will use the
following lemma.
Lemma 2.1 Suppose that ∆n(�) → 0 for each fixed � > 0. Then there exists a sequence �n → 0
such that ∆n(�n) → 0.
Proof. For each positive integer k there is an integer nk such that |∆n(1/k)| < 1/k for n ≥ nk. We may assume, without loss of generality that n1 < n2 < .... Set
�n ≡
{
1/2 if n < n1 1/k if nk ≤ n < nk+1 .
Then for n ≥ n1 it follows that �n = 1/kn where kn satisfies nkn ≤ n < nkn+1. Note that kn →∞ as n →∞, and for n ≥ n1 |∆n(�n)| < 1/kn → 0 as n →∞. 2
Our next theorem gives the forward half of the Lindeberg-Feller central limit theorem.
Theorem 2.3 (Lindeberg-Feller) Suppose that {Xn,i : 1 ≤ i ≤ n; n ∈ N} is a triangular
array of (row-wise independent) random variables with E(Xn,i) = 0 for all i and n ∈ N and∑n
i=1 E(X
2
n,i) = 1. Then the following are equivalent:
(i)
∑n
1 Xn,i →d Z ∼ N(0, 1) and max1≤i≤n E(X
2
n,i) → 0;
(ii) Ln(�) ≡
∑n
1 E{X
2
n,i1[|Xn,i|>�]}→ 0 for each � > 0.
14 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
Proof. Here we show that the Lindeberg condition (ii) implies (i). By (ii) it follows that
∆n(�) ≡ Ln(�)/�2 → 0 for each � > 0. By Lemma 2.1 we can find �n → 0 slowly enough that
∆n(�n) → 0. Now we truncate the Xn,i’s at �n: define a new triangular array {ξn,i} by ξn,i =
Xn,i1[|Xn,i|≤�n]. Note that
P(ξn,i 6= Xn,i for some i) ≤
n∑
1
P(|Xn,i| > �n) ≤ Ln(�n)/�2n → 0 .
Thus it suffices to show that
∑n
1 ξn,i →d Z. To do this we use Theorem 2.1. Since the Xn,i have
mean zero,
∣∣∣ n∑
1
E(ξn,i)
∣∣∣ = ∣∣∣− n∑
1
E{Xn,i1[|Xn,i|>�n]}
∣∣∣ ≤ Ln(�n)/�n = �nLn(�n)/�2n → 0 .
Furthermore,
n∑
1
V ar(ξn,i)
=
n∑
1
E{X2n,i1[|Xn,i|≤�n]}−
n∑
1
(−E{Xn,i1[|Xn,i|>�n]})
2
=
n∑
1
E(X2n,i) −Ln(�n) −o(1) = 1 −Ln(�n) −o(1). → 1 .
For the third moments we compute
n∑
1
E|ξn,i|3 ≤ �n
n∑
1
E(X2n,i) → 0 .
Thus the hypotheses of Theorem 2.1 hold and we conclude that
∑n
1 ξn,i →d Z. To complete the
proof that (ii) implies (i) we need to show that max1≤i≤n E(X
2
n,i) → 0. But
E(X2n,i) = E(X
2
n,i1[|Xn,i|≤�n]) + E(X
2
n,i1[|Xn,i|>�n])
≤ �2n + Ln(�n) ,
and hence
max
1≤i≤n
E(X2n,i) ≤ �
2
n + Ln(�n) → 0 .
We will prove that (i) implies (ii) in Chapter 10 (PfS, lecture notes version; Chapter 13 PfS (2000)).
2
A Converse CLT
Proposition 2.5 (Converse CLT) Suppose that X1, . . . ,Xn are i.i.d., and let Sn ≡ n−1/2
∑n
i=1 Xi.
If Sn = Op(1), then E(X
2
1 ) < ∞ and E(X1) = 0.
Our proof of Proposition 2.5 will rely on the following three lemmas.
2. WEAK CONVERGENCE IN R AND RK 15
Lemma 2.2 (Symmetrization) For independent rv’s X1, . . . ,Xn and �1, . . . ,�n i.i.d. Rademacher
rv’s independent of the Xi’s,
P(|n−1/2
n∑
i=1
�iXi| > 2t) ≤ 2
sup
n
P(|n−1/2
n∑
i=1
Xi| > t) .(2)
Proof. By conditioning on the Rademacher’s we see that
P
(
n−1/2|
n∑
i=1
�iXi| > 2t
)
≤ P
(
n−1/2|
∑
i:�i=1
�iXi| + n−1/2|
∑
i:�i=−1
�iXi| > 2t
)
≤ E�PX
(
n−1/2|
∑
i:�i=1
Xi| > t
)
+ E�PX
(
n−1/2|
∑
i:�i=−1
Xi| > t
)
≤ 2 sup
k≤n
P
(
n−1/2|
k∑
i=1
Xi| > t
)
≤ 2 sup
k≤n
P
(
k−1/2|
k∑
i=1
Xi| > t
)
≤ 2 sup
1≤k<∞
P
(
k−1/2|
k∑
i=1
Xi| > t
)
,
i.e. (2) holds. 2
Lemma 2.3 (Khinchine’s inequalities) There exist constants Ap, Bp, such that, for a = (a1, . . . ,an) ∈
Rn, and p ≥ 1,
Ap
{
n∑
i=1
a2i
}p/2
≤ E|
n∑
i=1
ai�|p ≤ Bp
{
n∑
i=1
a2i
}p/2
.
Recall that we proved this for p = 1 and found that A1 = 1/
√
3 and B1 = 1 work.
Lemma 2.4 (Paley-Zygmund inequality) Suppose that Y is a non-negative random variable
with mean EY and second moment E(Y 2) = ‖Y‖22. Then
P(Y > t) ≥
(
(EY − t)+
‖Y‖2
)2
.(3)
Proof.
E(Y ) = E(Y 1[Y≤t]) + E(Y 1[Y >t])
≤ t +
√
E(Y 2)P(Y > t)
16 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
by the Cauchy-Schwarz inequality. Rearranging this inequality yields (3). 2
Proof. (Proposition 2.5) The following proof is from Giné and Zinn (1994). Lemma 2.2
yields
sup
n
P(|n−1/2
n∑
i=1
�iXi| > 2t) ≤ 2 sup
n
P(|n−1/2
n∑
i=1
Xi| > t) .
Thus tightness of {Sn} implies that
{n−1/2
n∑
i=1
�iXi} is tight .
By Khinchine’s inequality (Lemma 2.3), regarding the Xi’s as fixed (conditioning on the Xi’s), we
find that
E�
∣∣∣n−1/2 n∑
i=1
�i
Xi
∣∣∣ ≥ A1
(
n−1
n∑
i=1
X2i
)1/2
≡ c[Sn] .
Thus by the Paley-Zygmund inequality (Lemma 2.4) applied with Y = |n−1/2
∑n
i=1 �iXi| and the
Xi’s held fixed (conditioning on the Xi’s)
P�(|n−1/2
n∑
i=1
�iXi| > t) ≥
(
(EY − t)+
(E(Y 2))1/2
)2
≥
(
(c[Sn] − t)+
[Sn]
)2
=
c2
(
1 −
t
c[Sn]
)2
≥
c2
4
1[[Sn]>2t/c] .
Taking expectations across this inequality with respect to the Xi’s yields
P(|n−1/2
n∑
i=1
�iXi| > t) ≥
c2
4
P([Sn] > 2t/c) .
It follows that the sequence {[Sn]} is tight. Now for fixed M ∈ (0,∞)
1
n
n∑
i=1
X2i 1[X2i ≤M]
→a.s. E(X21 1[X21≤M]) as n →∞ .
Thus in particular this convergence holds in probability and in distribution. Therefore, by the
Portmanteau theorem 11.7.4 (f),
1[E(X21 1[X2
1
≤M])>t]
≤ lim inf
n→∞
P
(
1
n
n∑
i=1
X2i 1[X2i ≤M]
> t)
≤ sup
n
P(
1
n
n∑
i=1
X2i 1[X2i ≤M]
> t) ,
2. WEAK CONVERGENCE IN R AND RK 17
so it follows that
sup
M>0
1[E(X21 1[X2
1
≤M])>t]
≤ sup
M>0
sup
n
P
(
1
n
n∑
i=1
X2i 1[X2i ≤M]
> t
)
≤ sup
n
P
(
1
n
n∑
i=1
X2i > t
)
= sup
n
P([Sn]
2 > t) .
By the tightness of {[Sn]}, we can make the right side of the last display as small as we please; in
particular there exists a number t0 < ∞ such that the right side is less than 1/2. But this implies
that for this t0 the indicator on the left side of the inequality must be zero, uniformly in M; i.e.
sup
M>0
E(X21 1[X21≤M]
) ≤ t0 .
But the last supremum is just E(X21 ), and hence we have E(X
2
1 ) ≤ t0 < ∞.
To complete the proof, note that E(X21 ) < ∞ implies that E|X1| < ∞, and hence by the strong law of large numbers we have
n−1
n∑
i=1
Xi →a.s. E(X1) .
But the hypothesis n−1/2
∑n
i=1 Xi = Op(1) implies that
n−1
n∑
i=1
Xi →p 0 ,
Combining these two displays yields E(X1) = 0. 2
Giné and Zinn (1994) use similar methods to establish the corresponding theorem for U-
statistics.
Theorem. (Giné and Zinn, 1994). If the sequence {nm/2Un(h)}∞n=1 is tight (stochastically
bounded), then Eh2(X1, . . . ,Xm) < ∞ and Eh(X1,x2, . . . ,xm) = 0 for almost every (x2, . . . ,xm) ∈
Xm−1.
Reference: Giné, E. and Zinn, J. (1994). A remark on convergence in distribution of U-statistics.
Ann. Probability 22, 117 – 125.
Weak convergence in Rk
The next step is to extend the results for M = R to M = Rk. We first state a set of equivalences
for →d in Rk.
Proposition 2.6 Suppose that {X,Xn} are random vectors with values in Rk, and let Fn(x) ≡
P(Xn ≤ x) and F(x) ≡ P(X ≤ x) for x ∈ Rk. Then the following are equivalent:
(i) Fn(x) = P(Xn ≤ x) → P(X ≤ x) = F(x) for all x ∈ CF ≡{y ∈ Rk : F is continuous at y}.
(ii) Xn →d X; i.e. Ef(Xn) → Ef(X) for all f ∈ Cb(Rk).
(iii) Ef(Xn) → Ef(X) for all f ∈ C∞(Rk).
(iv) E exp(it′Xn) → E exp(it′X) for all t ∈ Rk.
18 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
In Proposition 2.6 the equivalence of (ii) and (iii) depends on the equivalence of (i) and (iii) in
Theorem 1.1 and then a generalization of Proposition 2.1 to Rk; see Exercise 6.6.
The replacement techniques of Lindeberg can be extended in a straightforward way to random
vectors; see Exercises 6.7 and 6.7 for the start of this. One concrete result in this direction is the
following central limit theorem for sums of independent random vectors.
Theorem 2.4 (Classical multivariate CLT) Suppose that X1, . . . ,Xn are i.i.d. random vec-
tors in Rk with E(X1) = µ and E(|X1|2) < ∞. Then
n−1/2(X1 + · · · + Xn −nµ) =
√
n(Xn −µ) →d Y ∼ Nk(0, Σ)
where Σ = E(X1X
T
1 ) = (Cov(X1j,X1j′)
∞
j,j′=1.
On the other hand, the usual approach to deriving limit theorems of this type is via the result
of Cramér and Wold (1936) characterizing convergence in distribution of random vectors in terms
of the convergence of linear combinations in R.
Proposition 2.7 (Cramér – Wold device) Let Xn, X be random vectors in Rk. Then Xn →d
X in Rk if and only if a′Xn →d a′X in R for each a ∈ Rk.
Proof. Suppose that Xn →d X in Rk and let a ∈ Rk. Then g(x) = a′x is a continuous function
on R and hence by the continuous mapping theorem a′Xn = g(Xn) →d g(X) = a′X.
To prove the reverse implication we use part (iv) of Proposition 2.6. Suppose that a′Xn →d a′X
for every a ∈ Rk. Then by part (v) of Proposition 2.2 it follows that
E exp(it(a′Xn)) → E exp(it(a′X))
for all t ∈ R, and this holds for every a ∈ Rk. In particular, when t = 1 we have
ϕXn(a) = E exp(ia
′Xn) → E exp(ia′X) = ϕX(a)
for every a ∈ Rk. But then by (iv) of Proposition 2.6 this implies that Xn →d X in Rk. 2
Walther (1997) gives a proof of the result of Cramér and Wold without use of characteristic
functions, and notes that related results were established by Radon (1917).
3. TIGHTNESS AND SUBSEQUENCES 19
3 Tightness and subsequences
It is often useful to argue using subsequences in arguments involving convergence in distribution.
The following basic proposition gives a starting point for our discussion:
Proposition 3.1 If Pn and P are distributions (probability measures) on (M,M) such that for
every subsequence {Pn′} with {n′}⊂ N there is a further subsequence {Pn′′} such that Pn′′ →d P ,
then Pn → P .
Proof. Suppose not. Then for some f ∈ Cb(M) we have Pnf 6→ Pf. Thus for some � > 0
and subsequence n′ it follows that |Pn′f −Pf| > � for all n′ ∈ {n′}. But then there is no further
subsequence {n′′} for which Pn′′f → Pf, contradicting the hypothesis. 2
To be able to extract convergent subsequences in general requires some appropriate notion of
compactness. Here the right idea is to rule out “escape of mass”. On the real line this “escape”
is possible only toward ±∞, but in more complicated spaces it can happen in many ways. The
following definitions are aimed at ruling out the “escape of mass” in quite general settings.
Definition 3.1 (Tightness) A probability measure P on M is said to be tight if for each � > 0
there exists a compact set K = K� such that P(K�) > 1 − �.
The basic result concerning tightness of individual measures P is due to Ulam.
Theorem 3.1 (Ulam’s theorem) If M is separable and complete, then each P on (M,M) is
tight.
Proof. Let � > 0. By the separability of M, for each m ≥ 1 there is a sequence Am1,Am2, . . .
of open 1/m spheres covering M. Choose im so that P(∪i≤imAmi) > 1 − �/2m. Now the set
B ≡ ∩∞m=1 ∪i≤im Ami is totally bounded in M: for each � > 0 it has a finite �−net (i.e. a set of
points {xk} with d(x,xk) < � for some xk for each x ∈ B). By completeness of M, B is complete
and B ≡ K is compact. Since
P(Kc) = P(B
c
) ≤ P(Bc) ≤
∞∑
i=1
P {(∪i≤imAmi)
c} <
∞∑
m=1
�
2m
= �,
the conclusion follows. 2
Definition 3.2 (Uniform tightness) If P is a set of probability measures on a metric space
(M,d), then P is called uniformly tight if and only if for every � > 0 there is a compact set K ⊂
M
such that P(K) > 1 − � for all P ∈P.
In the case of a sequence of measures {Pn} it is convenient to relax the requirement in Definition
3.2 slightly.
Definition 3.3 (Asymptotic tightness (of a sequence)) If {Pn} is a sequence of probability
measures on (M,d), then {Pn} is called asymptotically tight if and only if for every � > 0 there is
a compact set K = K� such that lim supn Pn(G
c) < � for every open set G containing K�
20 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
The main result for an asymptotically tight sequence is the following theorem due to Prohorov
(1956) and Le Cam (1957).
Theorem 3.2 (Prohorov, 1956; Le Cam, 1957) Suppose that {Pn} on (M,M) is asymptot-
ically tight. Then there exists a subsequence {Pn′} that satisfies Pn′ →d (some) P where P is
tight.
Pollard (2001) relaxes the definition of uniform tightness for a sequence still further, and proves
the same result for arbitrary metric spaces.
The proof of the Prohorov – LeCam theorem 3.2 depends on the following auxiliary results. The
first of these gives a correspondence between tight measures and tight linear functionals.
Theorem 3.3 (Correspondence theorem) A linear functional T : BL(M)+ → R+ with T1 = 1
defines a tight probability measure if and only if it is functionally tight: i.e. for each � > 0 there
exists a compact set K� such that T(l) < � for every l ∈ BL(M)+ for which l ≤ 1Kc� .
Up to inconsequential constant multiples, asymptotic tightness is equivalent to: for each � > 0
there exists K� such that
lim sup
n→∞
Pnl < 2� for every l ∈ BL(M)+ with 0 ≤ l ≤ 1Kc� .
To see that asymptotic tightness implies this, note that for such a function l, the set G� = {l < �} is open and G� ⊃ K�. Then
Pn(l) ≤ � + Pn(Gc�) < 2�
eventually.
The second analytic result we will use is:
Proposition 3.2 (Continuous partition of unity) For each δ > 0, � > 0, and each compact
set K, there exists a finite collection G = {g0,g1, . . . ,gk}⊂ BL(M)+ such that:
(i) g0(x) + g1(x) + · · · + gk(x) = 1 for each x ∈ M;
(ii) diam[gi > 0] ≤ δ for i ≥ 1 where diam(A) ≡ sup{d(x,y) : x,y ∈ A};
(iii) g0 < � on K.
Proof. Let x1, . . . ,xk be the centers of open balls of radius δ/4 whose union covers K. Define
functions f0 ≡ �/2, fi(x) = (1−2d(x,xi)/δ)+ for i ≥ 1, so that fj ∈ BL(M)+ for j = 0, . . . ,k. Also
note that fi(x) = 0 if d(x,xi) > δ/2. Thus the set {fi > 0} has diameter less than δ for i ≥ 1. The
function F(x) =
∑k
i=0 fi(x) is everywhere greater than �/2 and is in BL(M)
+. The non-negative
functions gi ≡ fi/F are bounded by 1 and satisfy a Lipschitz condition:
|gi(x) −gi(y)| ≤
|F(y)fi(x) −F(x)fi(y)|
F(x)F(y)
≤
|fi(x) −fi(y)|
F(x)
+
|F(y) −F(x)|fi(y)
F(x)F(y)
≤
‖f‖BLd(x,y)
�/2
+
‖F‖BLd(x,y)
�/2
.
For each x ∈ K, there is an i for which d(x,xi) < δ/4. For this i, fi(x) > 1/2 and g0(x) ≤
f0(x)/fi(x) < (�/2)/(1/2) = �. Thus the functions gi satisfy (i) - (iii). 2
3. TIGHTNESS AND SUBSEQUENCES 21
Proof. (Prohorov-LeCam theorem). Write Ki for the compact set corresponding to � = 1/i,
i ≥ 1. Write Gi for the finite collecton of functions in BL(M)+ constructed in Proposition 3.2 with
δ = � = 1/i and K = Ki. The collection G ≡∪i∈NGi is countable.
For each g ∈G the sequence of real numbers Png is bounded. It has a convergent subsequence.
Via the Cantor-diagonalization argument we can construct a single sequence N1 ⊂ N for which
limn′∈N1 Pn′g exists for every g ∈ G. The aproximation properties of G will allow us to show
that T(l) ≡ limn′∈N1 Pn′∈N1Pn′l exists for every l ∈ BL(M)
+. Without loss of generality, suppose
that ‖l‖BL ≤ 1. Given � > 0, choose an i > 1/�, then write Gi = {g0,g1, . . . ,gk} for the finite
collection guaranteed by Proposition 3.2. The open set Gi = {g0 < �} contains Ki which implies
that lim supn→∞PnG
c
i < �. For each 1 ≤ j ≤ k = k(i), let xj be any point at which gj(xj) > 0. If
x is any other point with gj(x) > 0, then
|l(x) − l(xj)| ≤ d(x,xj) ≤ �.
It follows that for every x ∈ M
|l(x) −
k∑
1
l(xj)gj(x)| ≤ l(x)g0(x) +
k∑
j=1
|l(x) − l(xj)|gj(x)
≤ (� + 1Gci ) + �,
and this integrates to give
|Pnl−
k∑
j=1
l(xj)Pn(gj)| ≤ PnGci + 2�.
Since limn′∈N1 Pn′gj exists, it follows that
lim sup
n′∈N1
Pn′l− lim inf
n′∈N1
Pn′l ≤ 6�.
This shows that T(l) ≡ limn′∈N1 Pn′l exists for each l ∈ BL(M)
+.
Note that T(1) = 1 easily, and T inherits functional tightness from asymptotic tightness of
{Pn}. From the correspondence Theorem 3.3 the functionally tight linear functional T corresponds
to a tight probability measure P to which {Pn′ : n′ ∈ N1} converges weakly. 2
Definition 3.4 (Relative compactness) Let P be a set of probability measures on (M,M).
We say that P is relatively compact if every sequence {Pn} ⊂ P contains a weakly convergent
subsequence. Thus every {Pn} ⊂ P contains a subsequence {Pn′} with Pn′ →d some Q (not
necessarily in P).
Proposition 3.3 Let (M,d) be a separable metric space.
(i) (Le Cam). If Pn →d P , then {Pn} is uniformly tight.
(ii) If Pn →d P , then {Pn} is relatively compact.
(iii) If {Pn} is relatively compact and the set of limit points is just the single point P , then
Pn →d P .
Theorem 3.4 (Prohorov’s theorem) Let P be a collection of probability measures on (M,M).
(i) If P is uniformly tight, then it is relatively compact.
(ii) Suppose that (M,d) is separable and complete. If P is relatively compact it is uniformly tight.
22 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
4 Metrizing weak convergence
The Lévy metric on distribution functions defined in Proposition 2.3 extends in a nice way to give
a metric for →d more generally. For any set B ∈M and � > 0 define
B� ≡{y ∈ M : d(x,y) < � for some x ∈ B}.
Definition 4.1 (Prohorov metric) For P , Q two probability measures on (M,M), the Prohorov
distance ρ(P,Q) between P and Q is
defined by
ρ(P,Q) ≡ inf{� > 0 : P(B) ≤ Q(B�) + � for all B ∈M} .
Another very useful metric on P is defined in terms of the bounded Lipschitz functions BL(M)
defined in Section 1.
Definition 4.2 (Bounded Lipschitz metric) For P , Q two probability measures on (M,M),
the bounded Lipschitz distance β(P,Q) between P and Q is defined by
β(P,Q) ≡ sup
{∣∣∣∫ fdP −∫ fdQ∣∣∣ : ‖f‖BL ≤ 1} .
Proposition 4.1 Both ρ and β are metrics on P ≡{all probability measures on (M,M)}.
Proof. See Exercise 6.10. 2
The following theorem says that both ρ and β metrize →d just as the Lévy metric metrized
convergence of distribution functions on R.
Theorem 4.1 For any separable metric space (M,d) and Borel probability measures {Pn}, P on
(M,M) the following are equivalent:
(i) Pn →d P .
(ii)
∫
fdPn →
∫
fdP for all f ∈ BL(M).
(iii) β(Pn,P) → 0.
(iv) ρ(Pn,P) → 0.
Proof. We prove the result under the additional assumption that M is complete. The
equivalence of (i) and (ii) has been proved in Theorem 1.1. Now we show that (ii) implies (iii): by
Ulam’s Theorem 3.1, for any � > 0 we can choose K compact so that P(K) > 1 − �. Now the set
of functions E = {f ∈ BL(M) : ‖f‖BL ≤ 1} restricted to K form a compact set of functions for
‖ · ‖∞ (by the Arzela-Ascoli theorem; see e.g. Billingsley (1968) page 221). Thus for some finite
k there are f1, . . . ,fk ∈ BL(M) such for any f ∈ E there is an fj with supx∈K |f(x) −fj(x)| ≤ �.
Then, since f, fj ∈ BL(M),
sup
x∈K�
|f(x) −fj(x)| ≤ 3�.
Let g(x) ≡ max{0, (1 −d(x,K)/�)}; then g ∈ BL(M) and 1K ≤ g ≤ 1K�. For n sufficiently large
we have
Pn(K
�) ≥
∫
gdPn > 1 − 2�,
4. METRIZING WEAK CONVERGENCE 23
and hence for any f ∈E∣∣∣∫ fdPn −∫ fdP∣∣∣ = ∣∣∣∫ (f −fj)d(Pn −P) + ∫ fjd(Pn −P)∣∣∣
≤
∣∣∣∫ (f −fj)dPn∣∣∣ + ∣∣∣∫ (f −fj)dP∣∣∣ + ∣∣∣∫ fjd(Pn −P)∣∣∣
≤ 3� + 2 · 2� + 2� + 2� + +
∣∣∣∫ fjd(Pn −P)∣∣∣
≤ 7� + 4� + � = 12�
by choosing n large. Hence (iii) holds.
Now we show that (iii) implies (iv): given a Borel set B and � > 0, let f�(x) ≡ max{0, (1 −
d(x,B)/�)}. Then f� ∈ BL(M), ‖f‖BL ≤ 2 ∨ �−1, and 1 < f� ≤ 1B�. Therefore, for any P and Q on M we have
Q(B) ≤
∫
f�dQ ≤
∫
f�dP + (2 ∨ �−1)β(P,Q)
≤ P(B�) + (2 ∨ �−1)β(P,Q) ,
and it follows that
ρ(P,Q) ≤ max{�, (2 ∨ �−1)β(P,Q)} .
Hence if β(P,Q) ≤ �2, then
ρ(P,Q) < max{�, (2 ∨ �−1)�2} = max{2�2,�}≤ �(1 + 2�) ≤ 3�.
Hence for all P , Q we have ρ(P,Q) ≤ 3
√
β(P,Q). Thus (iii) implies (iv). [It can also be shown
that 2−1β(P,Q) ≤ ρ(P,Q); see e.g. Dudley (1976), RAP, Corollary 11.6.5, page 411.]
Finally we show that (iv) implies (i): Suppose that (iv) holds, let B be a P−continuity set, and
let � > 0. Then for 0 < δ < � small, P(Bδ \B) < � and P((Bc)δ \Bc) < �. Then
Pn(B) ≤ P(Bδ + δ ≤ P(B) + 2�
and
Pn(B
c) ≤ P(((Bc)δ + δ ≤ P(Bc) + 2� ;
combining these yields
|Pn(B) −P(B)| ≤ 2�
and hence Pn(B) → P(B). By the portmanteau theorem 11.1.1 this yields (i). 2
More Metrics on P
There are other useful metrics on P that metrize topologies other than weak convergence. It is
frequently useful to relate these to the Prohorov and bounded Lipschitz metrics ρ and β we have
introduced earlier in this section.
Definition 4.3 For probability measures P,Q on (M,M), the total variation distance from P to
Q is defined by
dTV (P,Q) ≡ sup{|P(A) −Q(A)| : A ∈M} .
24 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
Proposition 4.2 The total variation distance dTV (P,Q) is given by
dTV (P,Q) =
1
2
∫
|p−q|dµ = 1 −
∫
(p∧q) dµ
where p = dP/dµ, q = dQ/dµ, and µ is any measure dominating both P and Q (e.g. P + Q).
Proof. See Exercise 6.11. 2
Definition 4.4 The Hellinger distance H(P,Q) is defined by
H2(P,Q) ≡
1
2
∫
{
√
p−
√
q}2 dµ = 1 −
∫
√
pqdµ,
where p = dP/dµ, q = dQ/dµ, and µ is any measure dominating both P and Q.
It is not hard to show (see Exercise 6.12) that H(P,Q) does not depend on the choice of the
dominating measure µ.
Here is a theorem relating these metrics to each other and to the Prohorov and bounded Lipschitz
metrics.
Theorem 4.2 For P,Q probability measures on (M,M) with (M,d) separable, the following in-
equalities hold:
(i) 2−1β(P,Q) ≤ ρ(P,Q) ≤ 3
√
β(P,Q).
(ii) H2(P,Q) ≤ dTV (P,Q) ≤ H(P,Q){1 −H2(P,Q)/2}1/2.
(iii) ρ(P,Q) ≤ dTV (P,Q).
For distribution functions F,G on R (or on Rk) we have:
(iv) λ(F,G) ≤ ρ(F,G) ≤ dTV (F,G).
(v) λ(F,G) ≤ dK(F,G) ≤ dTV (F,G)
where dK(F,G) ≡‖F −G‖∞ ≡ supx |F(x) −G(x)|.
Proof. The right side of (i) was proved in the course of the proof of Theorem 4.1. For the left
side, see Dudley (1976) section 18.6. We leave the remaining inequalities as exercises. 2
Wasserstein metrics
These metrics, often denoted by Wp(P,Q) or Wr(P,Q) are also called Kantorovich distances, or
Monge-Kantorovich distances, or the Mallows metric, or Wasserstein transport distances. To define
these metrics, suppose that (M,d) is a separable metric space, and let P(M,d) be the collection
of all Borel probability measures on (M,d). For r ≥ 1 let Pr(M,d) ≡ Pr(M) be the collection of
all probability measures P ∈P(M,d) such that∫
M
d(x,x0)
rdP(x) < ∞ for some x0 ∈ M
(and then equivalently for all x0 ∈ M). For P,Q ∈Pr(M,d) define Wr(P,Q) by
Wrr (P,Q) ≡ inf
{∫
M
∫
M
d(x,y)rdπ(x,y) : π on (M ×M,M×M) has margins P and Q
}
.
Thus π(A×M) = P(A) for all A ∈M and π(M ×B) = Q(B) for all B ∈M. The most important
values of r are 1, 2, and ∞.
Here are statements of several important results concerning the Wasserstein metrics. The first
result shows that W1 is closely related to the bounded Lipschitz metric β.
4. METRIZING WEAK CONVERGENCE 25
Theorem 4.3 (Kantorivich duality theorem) If (M,d) is separable, then for all P,Q ∈P1(M)
W1(P,Q) = sup
f:‖f‖L≤1
∣∣∣∣
∫
M
fdP −
∫
M
fdQ
∣∣∣∣
where ‖f‖L = supx 6=y |f(x) −f(y)|/d(x,y).
The second result characterizes convergence in the Wr metrics.
Theorem 4.4 (Convergence in Wr) Let 1 ≤ r < ∞. Suppose that P ∈ Pr(M,d) and {Pn}n≥1 ⊂ Pr(M,d) where (M,d) is a complete separable metric space (or, slightly more generally, M is Polish). Then the following are equivalent: (i) Wr(Pn,P) → 0 as n →∞. (ii) Pn →d P and for some x0 ∈ M∫
M
d(x,x0)
rdPn(x) →
∫
M
d(x,x0)
rdP(x).
(iii) For any f : M → R satisfying, for some x0 ∈ M,
|f(x)| ≤ C(1 + d(x,x0))r, for all x ∈ M,∫
M
fdPn →
∫
M
fdP.
Thus Wr(Pn,P) → 0 is stronger than Pn →d P .
The next result relates W∞(P,Q) to ρ(P,Q).
Theorem 4.5 (Wr metrics dominate ρ) For all P,Q ∈Pr(M,d) and r ≥ 1,
ρ(P,Q) ≤ Wr(P,Q)r/(r+1).
In particular,
ρ(P,Q) ≤ W∞(P,Q) ≡ lim
r→∞
Wr(P,Q) sup
r≥1
Wr(P,Q)
= inf
π
‖d‖L∞(π) = inf
π
esssupπd(x,y).
Another way to state this connection is in terms of the Ky Fan metric α for convergence in
probability.
Definition 4.5 For a separable metric space (M,d), a probability space (Ω,A,P) and X,Y ∈
L0(Ω,M), let
α(X,Y ) ≡ inf{� > 0 : P(d(X,Y ) > �) < �}.
This is called the Ky Fan metric for convergence in probability; see Dudley, RAP, section 9.2.
Theorem 4.6 (α metrizes convergence in probability) On L0(Ω,M), α is a metric for convergence
in probability. That is, α(Xn,X) → 0 if and only if Xn → X in probability (or equivalently,
d(Xn,X) →p 0).
26 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
Proposition 4.3 (Connecting ρ and α) For any separable metric space, laws P and Q on M and
� > 0 (or � = 0 if P and Q are tight), there is a probability space (Ω,A,µ) and random variables
X,Y on Ω with
α(X,Y ) ≤ ρ(P,Q) + � where X ∼ P, Y ∼ Q.
Thus
ρ(P,Q) = inf{α(X,Y ) : X ∼ P, Y ∼ Q}.
When the metric space (M,d) is (R, | · |), the Wr metrics take a special form:
Theorem 4.7 (Prohorov) Let P,Q be probability measures in Pr(R), r ≥ 1, with respective
distribution functions F and G. Then
Wrr (P,Q) =
∫ 1
0
|F−1(t) −G−1(t)|rdt.
In particular,
W1(P,Q) =
∫ 1
0
|F−1 −G−1|dt =
∫ ∞
−∞
|F(x) −G(x)|dx.
To understand the convergence properties of empirical measures with respect to the Wr−metrics
we first need a theorem of Varadarajan (1958).
Theorem 4.8 (Varadarajan) Suppose that (M,d) is separable and suppose that X1, . . . ,Xn are
i.i.d. P on (M,M). Then Pr(Pn →d P) = 1. Equivalently, ρ(Pn,P) →a.s. 0 and β(Pn,P) →a.s. 0.
This leads naturally to the following theorem:
Theorem 4.9 (Glivenko-Cantelli theorem with respect to Wr) Suppose that (M,d) is separable
and suppose that X1, . . . ,Xn are i.i.d. P on Pr(M) with r ≥ 1. Then
Wr(Pn,P) →a.s. 0 as n →∞.
Proof. By Varadarajan’s theorem and the characterization of Wr(Pn,P) → 0 given above, it
suffices to show that∫
d(x,x0)
rdPn(x) =
1
n
n∑
i=1
d(Xi,x0)
r →a.s.
∫
M
d(x,x0)
rdP(x) = E{d(X,x0)r}.
But this last convergence holds by the strong law of large numbers. 2
4. METRIZING WEAK CONVERGENCE 27
Some references:
• Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. Ann.
Statist. 9, 1196 – 1217.
• Bobkov, S. and Ledoux, M. (2017). One dimensional empirical measures, order statistics,
and Kantorovich transport distances. Memoirs of the American Mathematical Society, AMS
Providence, to appear.
• Dudley, R. M. (2002). Real Analysis and Probability, 2nd edition. Cambridge University
Press.
• Dümbgen, L., Samworth, R., and Schuhmacher, D. (2011). Approximation by log-concave
distributions with application to regression. Ann. Statist. 39, 702-730.
• Mallows, C. (1972). A note on joint asymptotic normality. Ann. Math. Statist. 43, 508 –
515.
• Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Society Grad-
uate Studies in Mathematics 58, AMS, Providence.
• Villani, C. (2009). Optimal Transport. Springer, Berlin.
28 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
5 Characterizing weak convergence in spaces of functions
Suppose that T is a set, and suppose that Xn(t), t ∈ T are stochastic processes indexed by the set
T; that is, Xn(t) : Ω 7→ R is a measurable map from each t ∈ T and n ∈ N. Assume that the
processes Xn have bounded sample functions almost surely (or, have versions with bounded sample
paths almost surely). Then Xn(·) ∈ `∞(T) almost surely where `∞(T) is the space of all bounded
real-valued functions on T. The space `∞(T) with the sup norm ‖ · ‖T is a Banach space; it is
separable only if T is finite. Hence we will not assume that the processes Xn induce tight Borel
probability laws on `∞(T).
Now suppose that X(t), t ∈ T, is a sample bounded process that does induce a tight Borel
probability measure on `∞(T). then we say that Xn converges weakly to X (or, informally Xn
converges in law to X uniformly in t ∈ T), and write
Xn ⇒ X in `∞(T)
if
E∗H(Xn) → EH(X)
for all bounded continuous functions H : `∞(T) 7→ R. Here E∗ denotes outer expectation.
It follows immediately from the preceding definition that weak convergence is preserved by
continuous functions: if g : `∞(T) 7→ D for some metric space (D,d) where g is continuous and
Xn ⇒ X in `∞(T), then g(Xn) ⇒ g(X) in (D,d). (The condition of continuity of g can be relaxed
slightly; see e.g. Van der Vaart and Wellner (1996), Theorem 1.3.6, page 20.) While this is not a
deep result, it is one of the reasons that the concept of weak convergence is important.
The following example shows why the outer expectation in the definition of ⇒ is necessary.
Example 5.1 Suppose that U is a Uniform(0, 1) random variable, and let X(t) = 1{U ≤ t} =
1[0,t](U) for t ∈ T = [0, 1]. If we assume the axiom of choice, then there exists a nonmeasurable
subset A of [0, 1]. For this subset A, define FA = {1[0,·](s) : s ∈ A}⊂ `∞(T). Since FA is a discrete
set for the sup norm, it is closed in `∞(T). But {X ∈ FA} = {U ∈ A} is not measurable, and
therefore the law of X does not extend to a Borel probability measure on `∞(T).
On the other hand, the following proposition gives a description of the sample bounded processes
X that do induce a tight Borel measure on `∞(T).
Proposition 5.1 (de la Peña and Giné (1999), Lemma 5.1.1; van der Vaart and Wellner (1996),
Lemma 1.5.9)). Let X(t), t ∈ T be a sample bounded stochastic process. Then the finite-
dimensional distributions of X are those of a tight Borel probability measure on `∞(T) if and
only if there exists a pseudometric ρ on T for which (T,ρ) is totally bounded and such that X has
a version with almost all its sample paths uniformly continuous for ρ.
Proof. Suppose that the induced probability measure of X on `∞(T) is a tight Borel
measure PX. Let Km, m ∈ N be an increasing sequence of compact sets in `∞(T) such that
PX(∪∞m=1Km) = 1, and let K = ∪
∞
m=1Km. Then we will show that the pseudometric ρ on T
defined by
ρ(s,t) =
∞∑
m=1
2−m(1 ∧ρm(s,t)) ,
5. CHARACTERIZING WEAK CONVERGENCE IN SPACES OF FUNCTIONS 29
where
ρm(s,t) = sup{|x(s) −x(t)| : x ∈ Km} ,
makes (T,ρ) totally bounded. To show this, let � > 0, and choose k so that
∑∞
m=k+1 2
−m < �/4
and let x1, . . . ,xr be a finite subset of ∪km=1Km = Kk that is �/4−dense in Kk for the supremum
norm; i.e. for each x ∈∪km=1Km there is an integer i ≤ r such that ‖x−xi‖T ≤ �/4. Such a finite
set exists by compactness. The subset A of Rr defined by {(x1(t), . . . ,xr(t)) : t ∈ T} is bounded
(note that ∪km=1Km is compact and hence bounded). Therefore A is totally bounded and hence
there exists a finite set T� = {tj : 1 ≤ j ≤ N} such that, for each t ∈ T, there is a j ≤ N for which
max1≤s≤r |xs(t) −xs(tj)| ≤ �/4. It is easily seen that T� is �−dense in T for the pseudo-metric ρ:
if t and tj are as above, then for m ≤ k it follows that
ρm(t,tj) = sup
x∈Km
|x(t) −x(tj)| ≤ max
s≤r
|xs(t) −xs(tj)| +
�
2
≤
3�
4
,
and hence
ρ(t,tj) ≤
�
4
+
k∑
m=1
2−mρm(t,tj) ≤ �.
Thus we have proved that (T,ρ) is totally bounded. Furthermore, the functions x ∈ K are uniformly
ρ−continuous, since, if x ∈ Km, then |x(s) − x(t)| ≤ ρm(s,t) ≤ 2mρ(s,t) for all s,t ∈ T with
ρ(s,t) ≤ 1. Since PX(K) = 1, the identity function of (`∞(T),B,PX) yields a version of X with
almost all of its sample paths in K, hence in Cu(T,ρ), the space of bounded uniformly ρ−continuous
functions on T. This proves the direct half of the proposition.
Conversely, suppose that X(t), t ∈ T, is a stochastic process with a version whose sample
functions are almost all in Cu(T,ρ) for a metric or pseudometric ρ on T for which (T,ρ) is totally
bounded. We will continue to use X to denote the version with these properties. We can clearly
assume that all the sample functions are uniformly continuous. If (Ω,A,P) is the probability space
where X is defined, then the map X : Ω 7→ Cu(T,ρ) is Borel measurable because the random
vectors (X(t1), . . . ,X(tk)), ti ∈ T, k ∈ N, are measurable and the Borel σ− algebra of Cu(T,ρ) is
generated by the “finite-dimensional sets” {x ∈ Cu(T,ρ) : (x(t1), . . . ,x(tk)) ∈ A} for all Borel sets
A of Rk, ti ∈ T, k ∈ N. Therefore the induced probability law PX of X is a tight Borel measure on
Cu(T,ρ) by Ulam’s theorem; see e.g. Billingsley (1968), Theorem 1.4 page 10, or Dudley (1989),
Theorem 7.1.4 page 176. But the inclusion of Cu(T,ρ) into `
∞(T) is continuous, so PX is also a
tight Borel measure on `∞(T). 2
Exhibiting convenient metrics ρ for which total boundedness and continuity holds is more
involved. It can be shown that (see e.g. Hoffmann-Jørgensen (1984), (1991); Andersen (1985),
Andersen and Dobric (1987)) that if any pseudometric works, then the pseudometric
ρ0(s,t) = E arctan |X(s) −X(t)|
will do the job. However, ρ0 may not be the most natural or convenient pseudometric for a
particular problem. In particular, for the frequent situation in which the process X is Gaussian,
the pseudometrics ρr defined by
ρr(s,t) = (E|X(s) −X(t)|r)1/(r∨1)
30 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
for 0 < r < ∞ are often more convenient, and especially ρ2 in the Gaussian case; see Van der Vaart and Wellner (1996), Lemma 1.5.9, and the following discussion.
Proposition 5.1 motivates our next result which characterizes weak convergence Xn ⇒ X in
terms of asymptotic equicontinuity and convergence of finite-dimensional distributions.
Theorem 5.1 The following are equivalent:
(i) All the finite-dimensional distributions of the sample bounded processes Xn converge in law,
and there exists a pseudometric ρ on T such that both:
(a) (T,ρ) is totally bounded, and
(b) the processes Xn are asymptotically equicontinuous in probability with respect to ρ: that
is
lim
δ→0
lim sup
n→∞
Pr∗
{
sup
ρ(s,t)≤
δ
|Xn(s) −Xn(t)| > �
}
= 0 for all � > 0 .(1)
(ii) There exists a process X with tight Borel probability distribution on `∞(T) and such that
Xn ⇒ X in `∞(T) .
If (i) holds, then the process X in (ii) (which is completely determined by the limiting finite-
dimensional distributions of {Xn}), has a version with sample paths in Cu(T,ρ), the space of all
ρ−uniformly continuous real-valued functions on T. If X in (ii) has sample functions in Cu(T,γ)
for some pseudometric γ for which (T,γ) is totally bounded, then (i) holds with the pseudometric
ρ taken to be γ.
Proof. Suppose that (i) holds. Let T∞ be a countable ρ−dense subset of T, and let Tk, k ∈ N,
be finite subsets of T satisfying Tk ↗ T∞. (Such sets exist by virtue of the hypothesis that (T,ρ)
is totally bounded.) The limiting distributions of the processes Xn are consistent, and thus define
a stochastic process X on T. Furthermore, by the portmanteau theorem for finite-dimensional
convergence in distribution,
Pr{ max
ρ(s,t)≤δ,s,t∈Tk
|X(s) −X(t)| > �}
≤ lim inf
n→∞
Pr{ max
ρ(s,t)≤δ,s,t∈Tk
|Xn(s) −Xn(t)| > �}
≤ lim inf
n→∞
Pr{ max
ρ(s,t)≤δ,s,t∈T∞
|Xn(s) −Xn(t)| > �} .
Taking the limit in the last display as k → ∞ and then using the asymptotic equicontinuity
condition (1), it follows that there is a sequence δm ↘ 0 such that
Pr{ max
ρ(s,t)≤δm,s,t∈T∞
|X(s) −X(t)| > �}≤ 2−m .
Hence it follows by Borel-Cantelli that there exist m = m(ω) < ∞ a.s. such that
sup
ρ(s,t)≤δm,s,t∈T∞
|X(s,ω) −X(t,ω)| ≤ 2−m
for all m > m(ω). Therefore X(t,ω) is a ρ−uniformly continuous function of t ∈ T∞ for almost
every ω. The extension to T by uniform continuity of the restriction of X to T∞ yields a version
of X with sample paths all in Cu(T,ρ); note that it suffices to consider only the set of ω’s upon
5. CHARACTERIZING WEAK CONVERGENCE IN SPACES OF FUNCTIONS 31
which X is uniformly continuous. It then follows from Proposition 5.1 that the law of X exists as
a tight Borel measure on `∞(T).
Our proof of convergence will be based on the following fact (see Exercise 6.16): if H : `∞(T) 7→
R is bounded and continuous, and K ⊂ `∞(T) is compact, then for every � > 0 there exists τ > 0
such that: if x ∈ K and y ∈ `∞(T) with ‖x−y‖T < τ then
|H(x) −H(y)| < �.(a)
Now we are ready to prove the weak convergence part of (ii). Since (T,ρ) is totally bounded,
for every δ > 0 there exists a finite set of points t1, . . . , tN(δ) that is δ−dense in (T,ρ); i.e. T ⊂
∪N(δ)i=1 B(ti,δ) where B(t,δ) is the open ball with center t and radius δ. Thus, for each t ∈ T we can
choose πδ(t) ∈ {t1, . . . , tN(δ)} so that ρ(πδ(t), t) < δ. Then we can define processes Xn,δ, n ∈ N,
and Xδ by
Xn,δ(t) = Xn(πδ(t)) Xδ(t) = X(πδ(t)), t ∈ T .
Note that Xn,δ and Xδ are approximations of the processes Xn and X respectively that can take on
at most N(δ) different values. Convergence of the finite-dimensional distributions of Xn to those
of X implies that
Xn,δ ⇒ Xδ in l∞(T) .(b)
Furthermore, uniform continuity of the sample paths of X yields
lim
δ→0
‖X −Xδ‖T = 0 a.s.(c)
Let H : `∞(T) 7→ R be bounded and continuous. Then it follows that
|E∗H(Xn) −EH(X)|
≤ |E∗H(Xn) −EH(Xn,δ)| + |EH(Xn,δ) −EH(Xδ)| + |EH(Xδ) −EH(X)|
≡ In,δ + IIn,δ + IIIδ .
To show the convergence part of (ii) we need to show that limδ→0 lim supn→∞ of each of these three
terms is 0. This follows for IIn,δ by (b). Now we show that limδ→0 IIIδ = 0. Given � > 0, let
K ⊂ l∞(T) be a compact set such that Pr{X ∈ Kc} < �/(6‖H‖∞), let τ > 0 be such that (a)
holds for K and �/6, and let δ1 > 0 be such that Pr{‖Xδ −X‖T ≥ τ} < �/(6‖H‖∞) for all δ < δ1;
this can be done by virtue of (c). Then it follows that
|EH(Xδ) −EH(X)| ≤ 2‖H‖∞Pr{[X ∈ Kc] ∪ [‖Xδ −X‖T ≥ τ]}
+ sup{|H(x) −H(y)| : x ∈ K, ‖x−y‖T < τ}
≤ 2‖H‖∞
(
�
6‖H‖∞
+
�
6‖H‖∞
)
+
�
6
< �,
so that limδ→0 IIIδ = 0 holds.
To show that limδ→0 lim supn→∞In,δ = 0, chose �, τ, and K as above. Then we have
|E∗H(Xn) −H(Xn,δ)| ≤ 2‖H‖∞
{
Pr∗{‖Xn −Xn,δ‖T ≥ τ/2} + Pr{Xn,δ ∈ (Kτ/2)
c}
}
+ sup{|H(x) −H(y)| : x ∈ K, ‖x−y‖T < τ}(d)
where Kτ/2 is the τ/2 open neighborhood of the set K for the sup norm. The inequality in the
previous display can be checked as follows: if Xn,δ ∈ Kτ/2 and ‖Xn − Xn,δ‖T < τ/2, then there
32 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
exists x ∈ K such that ‖x−Xn,δ‖T < τ/2 and ‖x−Xn‖T < τ. Now the asymptotic equicontinuity hypothesis implies that there is a δ2 such that
lim sup
n→∞
Pr∗{‖Xn,δ −Xn‖T ≥ τ/2} < �
6‖H‖∞
for all δ < δ2, and finite-dimensional convergence yields
lim sup
n→∞
Pr{Xn,δ ∈ (Kτ/2)
c}≤ Pr{Xδ ∈ (Kτ/2)
c}≤
�
6‖H‖∞
.
Hence we conclude from (d) that, for δ < δ1 ∧δ2,
lim sup
n→∞
|E∗H(Xn) −EH(Xn,δ)| < �,
and this completes the proof that (i) implies (ii).
The converse implication is an easy consequence of the “closed set” part of the portmanteau
theorem: if Xn ⇒ X in `∞(T), then, as for usual convergence in law,
lim sup
n→∞
Pr∗{Xn ∈ F}≤ Pr{X ∈ F}
for every closed set F ⊂ `∞(T); see e.g. Van der Vaart and Wellner (1996), page 18. If (ii) holds,
then by Proposition 5.1 there is a pseudometric ρ on T which makes (T,ρ) totally bounded and
such that X has (a version with) sample paths in Cu(T,ρ). Thus for the closed set F = Fδ,� defined
by
F�,δ = {x ∈ `∞(T) : sup
ρ(s,t)≤δ
|x(s) −x(t)| ≥ �} ,
we have
lim sup
n→∞
Pr∗
{
sup
ρ(s,t)≤δ
|Xn(s) −Xn(t)| ≥ �
}
= lim sup
n→∞
Pr∗{Xn ∈ F�,δ}≤ Pr{X ∈ F�,δ} = Pr{ sup
ρ(s,t)≤δ
|X(s) −X(t)| ≥ �} .
Taking limits across the resulting inequality as δ → 0 yields the asymptotic equicontinuity in view
of the ρ−uniform continuity of the sample paths of X. Thus (ii) implies (i) 2
We conclude this section by stating an obvious corollary of Theorem 5.1 for the empirical process
Gn indexed by a class of measurable real-valued functions F on the probability space (X ,A,P), and
let ρP be the pseudo-metric on F defined by ρ2P (f,g) = V arP (f(X)−g(X)) = P(f−g)
2−[P(f−g)]2.
Corollary 1 Let F be a class of measurable functions on (X ,A). Then the following are equivalent:
(i) F is P−Donsker: Gn ⇒ G in `∞(F).
(ii) (F,ρP ) is totally bounded and Gn is asymptotically equicontinuous with respect to ρP in
probability: i.e.
lim
δ↘0
lim sup
n→∞
Pr∗
{
sup
f,g∈F:ρP (f,g)<δ |Gn(f) −Gn(g)| > �
}
= 0(2)
for all � > 0.
5. CHARACTERIZING WEAK CONVERGENCE IN SPACES OF FUNCTIONS 33
We close this section with another equivalent formulation of the asymptotic equicontinuity
condition in terms of partitions of the set T.
A sequence {Xn} in `∞(T) is said to be asymptotically tight if for every � > 0 there exists a
compact set K ⊂ `∞(T) such that
lim inf
n→∞
P∗(Xn ∈ Kδ) ≥ 1 − � for every δ > 0 .
Here Kδ = {y ∈ `∞(T) : d(y,K) < δ} is the “δ−enlargement” of K.
Theorem 5.2 The sequence {Xn} in `∞(T) is asymptotically tight if and only if Xn(t) is asymp-
totically tight in R for every t ∈ T and, for every � > 0, η > 0, there exists a finite partition
T = ∪ki=1Ti such that
lim sup
n
P∗
(
sup
1≤i≤k
sup
s,t∈Ti
|Xn(s) −Xn(t)| > �
)
< η .
Proof. See Van der Vaart and Wellner (1996), Theorem 1.5.6, page 36. 2
Example 5.2 (Partial sum process) Suppose that X1,X2, . . . are i.i.d. random variables with
E(X1) = 0, V ar(X1) = 1. The partial sum process Sn is defined by
Sn(t) ≡
1
√
n
bntc∑
i=1
Xi for 0 ≤ t < ∞ .
We will consider the process {Sn(t) : 0 ≤ t ≤ 1}. Note that Sn takes values in D[0, 1] since it has
jumps of size Xi/
√
n at the points t = i/n, i = 1, . . . ,n. The linearly interpolated version of the
process Sn is given by Sn(k/n) = Sn(k/n) and
Sn(t) = Sn(k/n) +
√
n(t−k/n)Xk+1, k/n ≤ t ≤ (k + 1)/n.
Note that Sn takes values in C[0, 1], and that
‖Sn −Sn‖∞ ≤ n−1/2 max
1≤i≤n
|Xi|→a.s. 0(3)
since E(X21 ) < ∞. To show that the finite-dimensional distributions of Sn converge in distribution, we will show
that the finite dimensional distributions of Sn converge in distribution. By (3) the same will hold
for Sn. Let 0 < t1 < · · · < tk ≤ 1, and consider the random vectors Yn ≡ (Sn(t1), . . . ,Sn(tk)) in Rk.
Define g : Rk → Rk by g(y) = (y1,y2 −y1,y3 −y2, . . . ,yk −yk−1). Then
g(Yn) = (Sn(t1),Sn(t2) −Sn(t1), . . . ,Sn(tk) −Sn(tk−1))
has components which are independent (by independence of the Xi’s), and
Sn(tj) −Sn(tj−1) =
1
√
n
bntjc∑
i=bntj−1c+1
Xi
=
√
bntjc−bntj−1c√
n
1√
bntjc−bntj−1c
bntjc∑
i=bntj−1c+1
Xi
→d
√
tj − tj−1Zj
d
= S(tj) −S(tj−1) ∼ N(0, tj − tj−1), j = 1, . . . ,k
34 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
where Z = (Z1, . . . ,Zk) is a vector of independent N(0, 1) random variables. Thus it follows
that g(Yn) →d g(Y ) where Y ≡ (S(t1),S(t2) − S(t1), . . . ,S(tk) − S(tk−1)). Now g−1 = h where
h : Rk → Rk given by h(x) ≡ (x1,x1 +x2, . . . ,x1 +· · ·+xk) is continuous. Hence by the continuous
mapping theorem Yn = h(g(Yn)) →d h(g(Y )) = Y ; i.e.
(Sn(t1), . . . ,Sn(tk)) = Yn →d Y
d
= (S(t1), . . . ,S(tk)) .
Since ([0, 1], | · |) is clearly totally bounded, it remains to verify the asymptotic equicontinuity
condition:
lim
δ↘0
lim sup
n→∞
P(
sup
|t−s|≤δ
|Sn(t) −Sn(s)| > �) = 0 for every � > 0 .
To do this, let tj = jδ, j = 0, . . . ,k ≡ k(δ), and tk+1 = 1 where k is the largest integer strictly
less than 1/δ, k = d1/δe− 1. Then tj − tj−1 ≤ δ, j = 1, . . . ,k + 1, and, by letting tj(t) denote the
largest point tj to the left of t ∈ [0, 1] we find that
sup
|t−s|≤δ
|Sn(t) −Sn(s)|
= sup
|t−s|≤δ
|Sn(t) −Sn(tj(t)) + Sn(tj(t)) −Sn(tj′(s)) + Sn(tj′(s)) −Sn(s)|
≤ max
0≤j≤k
{
sup
tj≤t≤tj+1
|Sn(t) −Sn(tj)|
+ |Sn(tj+1) −Sn(tj)| + sup
tj≤s≤tj+1
|Sn(s) −Sn(tj)
}
≤ 3 max
0≤j≤k
sup
tj≤t≤tj+1
|Sn(t) −Sn(tj)| .
Therefore it follows that, by choosing δ so that, by
√
δ < �/12 and using the Ottaviani-Skorokod
inequality,
P( sup
|t−s|≤δ
|Sn(t) −Sn(s)| > �)
≤ P( max
0≤j≤k
sup
tj≤t≤tj+1
|Sn(t) −Sn(tj)| > �/3)
≤
k∑
j=0
P( sup
tj≤t≤tj+1
|Sn(t) −Sn(tj)| > �/3)
=
k∑
j=0
P
max
1≤l≤bntj+1c−bntjc
∣∣∣ bntjc+l∑
i=bntjc+1
Xi
∣∣∣ > (�/3)√n
≤
k∑
j=0
2P
∣∣∣bntj+1c−bntjc∑
i=1
Xi
∣∣∣ > (�/6)√n
≤
k∑
j=0
2P
∣∣∣bntj+1c−bntjc∑
i=1
Xi
∣∣∣ ≥ (�/6)√n
.
5. CHARACTERIZING WEAK CONVERGENCE IN SPACES OF FUNCTIONS 35
Since
1√
bntj+1c−bntjc
bntj+1c−bntjc∑
i=1
Xi →d N(0, 1) ,
it follows from the portmanteau theorem that
lim sup
n→∞
P( sup
|t−s|≤δ
|Sn(t) −Sn(s)| > �)
≤ 2
2
δ
P(|Z| ≥ (�/12)δ−1/2)
≤
4
δ
12
√
δ
�
φ((�/12)δ−1/2) by Mills’ ratio
=
4
8
�
√
2π
δ−1/2 exp
(
−
�2
288
δ−1
)
→ 0 as δ ↘ 0 .
It follows from Theorem 5.1 that Sn ⇒ S in C[0, 1]. We can also conclude, via (3) that Sn ⇒ S in
D[0, 1].
Example 5.3 (Uniform empirical process) Suppose that ξ1,ξ2, . . . ,ξn, . . . are i.i.d. Uniform[0, 1]
random variables. Let Gn(t) = n−1
∑n
i=1 1[0,t](ξi) for 0 ≤ t ≤ 1 be the empirical distribution func-
tion. Then Un(t) =
√
n(Gn(t) − t for 0 ≤ t ≤ 1 is the uniform empirical process.
Now Un →f.d. U where U is a standard Brownian bridge process on [0, 1] (i.e. U is a mean 0
Gaussian process with E{U(s)U(t)} = s∧t−st for 0 ≤ s,t ≤ 1). That is, for 0 < t1 < ... < tk < 1,
Un(t1)
Un(t2)
.
.
.
Un(tk)
=
1
√
n
n∑
i=1
1[0,t1] − t1
1[0,t2] − t2
.
.
.
1[0,tk] − tk
→d
U(t1)
U(t2)
.
.
.
U(tk)
∼ Nk(0, (tj ∧ tj′ − tjtj′))
by the multivariate central limit theorem.
To show that Un ⇒ U in l∞([0, 1]), we need to show that Un is asymptotically equicontinuous
in probability; i.e. that
lim
δ↘0
lim sup
n→∞
P( sup
|t−s|≤δ
|Un(t) −Un(s)| > �) = 0
for every � > 0. Just as we argued in the case of the partial sum process Sn,
sup
|t−s|≤δ
|Un(t) −Un(s)| ≤ 3 max
0≤j≤k
sup
tj≤t≤tj+1
|Un(t) −Un(tj)|
where again tj = jδ, j = 0, . . . ,k(δ), and k ≡ k(δ) = d1/δe− 1. Thus, using a Bennett type
exponential bound obtained via Doob’s maximal inequality and the martingale {Un(t)/(1 − t) :
0 ≤ t < 1},
P( sup
|t−s|≤δ
|Un(t) −Un(s)| > �)
≤ P( max
0≤j≤k
sup
tj≤t≤tj+1
|Un(t) −Un(tj)| > �/3)
36 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
≤
k∑
j=0
P( sup
tj≤t≤tj+1
|Un(t) −Un(tj)| > �/3)
= (k + 1)P( sup
0≤t≤δ
|Un(t)| > �/3)
≤ 4k exp
(
−
�2/9
2δ
(1 −δ)ψ
(
�(1 −δ)
3δ
√
n
))
.
Hence it follows, using ψ(0) = 1 that
lim sup
n→∞
P( sup
|t−s|≤δ
|Un(t) −Un(s)| > �)
≤
8
δ
exp
(
−
�2
18δ
(1 −δ)ψ(0)
)
=
8
δ
exp
(
−
�2
18δ
(1 −δ)
)
→ 0 as δ ↘ 0 .
Thus the asymptotic equicontinuity condition in probability holds, and Un ⇒ U in l∞([0, 1]).
6. PROBLEMS AND COMPLEMENTS 37
6 Problems and Complements
Exercise 6.1 Prove the equivalence of (i) and (ii) in Proposition 11.2.2.
Exercise 6.2 Suppose that µn → µ and σ2n → σ2 where both µ and σ2 are finite. Suppose that
Z ∼ P0 on R.
(a) Show that Xn
d
= µn + σnZ →d µ + σZ
d
= X.
(b) Show that for f ∈ BL(R)
|Ef(Xn) −Ef(X)| ≤ ‖f‖BLE{1 ∧ (|µn −µ| + |σn −σ||Z|)} .
Exercise 6.3 Suppose that Xn ∼ N(µn,σ2n) and Xn →d (some rv) X. Show that µ ≡ limn µn
and σ2 ≡ limn σ2n must exist as finite limits, and that X ∼ N(µ,σ2). Hint: choose M with
P({M}) = P({−M}) = 0 and P [−M,M] > 3/4. Then show that if |µn| > M or if σn is large
enough, show that P(|Xn| > M) ≥ 1/2. Show that all convergent subsequences of {(µn,σn)} must
converge to the same limit.
Exercise 6.4 Give a direct proof of the equivalence of (i) and (iv) in Proposition 2.2. Hint:
Consider the functions ψ�(y) = ψ(y/�) where ψ is defined as follows: ψ(y) = 1 if y ≤ 0, ψ(y) = 0 if
y ≥ 1, and
ψ(y) =
∫ 1
y
exp(−1/(u(1 −u)))du∫ 1
0
exp(−1/(u(1 −u)))du
for 0 ≤ y ≤ 1 .
Exercise 6.5 Prove Proposition 2.3.
Exercise 6.6 Formulate and prove an extension of Proposition 2.1 to Rk.
Exercise 6.7 Suppose that X and Y are independent random vectors, and that W is another
random vector independent of X with E(Y ) = E(W) and Cov(Y ) = Cov(W) and satisfying
E|Y |3 < ∞ and E|W |3 < ∞. Show that if f ∈ C3(Rk) (define carefully what you mean by this
latter class of functions), then
|Ef(X + Y ) −Ef(X + W)| ≤ C(E|Y |3 + E|W |3)
where C is a constant depending only on (third derivatives) of f.
Exercise 6.8 Let Y be a random vector in Rk with µ = E(Y ) and
Σ = Cov(Y ) = E{(Y −µ)(Y −µ)′} .
Thus we can write Σ = AΛ2A′ where A is an orthogonal matrix (so AA′ = I) and Λ is diagonal with
each diagonal entry non-negative. Define B = AΛ. Let Z be a random vector with independent
N(0, 1) coordinates; thus Z ∼ Nk(0,I).
(a) Show that |µ| ≤ E|Y |. Hint: Note that u′Y ≤ |Y | for all unit vectors u, and in particular for
u = µ/|µ|.
(b) Show that E|BZ|3 = E|
∑k
i=1 λiZi|
3 ≤ (trace(Σ))3/2E|Z1|3.
(c) Show that E|µ + AZ|3 ≤ 8E|Y |3 + 8(E|Y |2)3/2E|Z1|3. Can the factor 8 be improved to 4?
38 CHAPTER 11. CONVERGENCE IN DISTRIBUTION
Exercise 6.9 Use the Cramér – Wold device to prove the multivariate CLT from the classical CLT
in R, Theorem 2.2.
Exercise 6.10 Prove Proposition 4.1.
Exercise 6.11 Prove Proposition 4.2.
Exercise 6.12 Prove that the Hellinger distance H(P,Q) does not depend on the choice of the
dominating measure µ.
Exercise 6.13 Show that (ii) of Theorem 4.2 holds.
Exercise 6.14 Show that (iii) of Theorem 4.2 holds.
Exercise 6.15 (Statistical interpretation of the total variation metric) Consider testing P
versus Q. Find the test that minimizes the sum of the error probabilities, and show that the mini-
mum sum of errors is ‖P ∧Q‖ ≡
∫
p∧ q dµ. in the notation of Proposition 4.2. Note that P and
Q are orthogonal as measures if and only if dTV (P,Q) = 1 if and only if ‖P ∧Q‖ = 0 if and only
if
∫ √
pqdµ ≡
∫ √
dPdQ = 0.
Exercise 6.16 Show the basic fact used in the proof of (i) implies (ii) for Theorem 5.1: i.e. if
H : `∞(T) 7→ R is bounded and continuous, and K ⊂ `∞(T) is compact, then for every � > 0 there
is a δ > 0 such that: if x ∈ K and y ∈ `∞(T) with ‖y −x‖T < δ, then |H(x) −H(y)| < �.
We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.
Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.
Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.
Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.
Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.
Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.
We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.
Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.
You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.
Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.
Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.
You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.
You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.
Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.
We create perfect papers according to the guidelines.
We seamlessly edit out errors from your papers.
We thoroughly read your final draft to identify errors.
Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!
Dedication. Quality. Commitment. Punctuality
Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.
We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.
We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.
We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.
We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.