Created: 2026-03-06 07:53:04
Updated: 2026-03-06 07:53:04
12.1 The Method of Types
AEP是我们可以关注于一小部分典型序列,类型方法使我们可以考虑具有相同经验分布(empirical distribution)的序列。在该限制下,我们可以推出特定经验分布中序列个数的bounds,以及这个集合中每个序列的概率。
序列x 1 , … , x n x_{1},\dots,x_{n} x 1 , … , x n 的类型 P x P_{\mathbf{x}} P x 是H \mathscr{H} H 中每个符号在序列x \mathbf{x} x 出现的相对比例:P x ( a ) = N ( a ∣ x ) n , ∀ a ∈ H P_{\mathbf{x}}(a) = \frac{N(a\mid \mathbf{x})}{n}, \forall a\in \mathscr{H} P x ( a ) = n N ( a ∣ x ) , ∀ a ∈ H ,其中N ( a ∣ x ) N(a\mid \mathbf{x}) N ( a ∣ x ) 是符号a出现在x ∈ H n \mathbf{x}\in\mathscr{H}^n x ∈ H n 中的个数。
定义P n \mathscr{P}_{n} P n 是分母为n的所有类型集合。 例如,H = { 0 , 1 } \mathscr{H}=\{0,1\} H = { 0 , 1 } ,此时所有类型共n + 1 n+1 n + 1 个,即:
P n = { ( P ( 0 ) , P ( 1 ) ) : ( 0 n , n n ) , ( 1 n , n − 1 n ) , … , ( n n , 0 n ) } \mathscr{P_{n}} = \left\{(P(0),P(1)): \left( \frac{0}{n} , \frac{n}{n}\right), \left( \frac{1}{n}, \frac{n-1}{n} \right), \dots, \left( \frac{n}{n}, \frac{0}{n} \right) \right\} P n = { ( P ( 0 ) , P ( 1 )) : ( n 0 , n n ) , ( n 1 , n n − 1 ) , … , ( n n , n 0 ) }
定义:若P ∈ P n P\in \mathscr{P}_{n} P ∈ P n , 那么全体具有长度n、类型P的序列集合称作P的类型类,记作T ( P ) T(P) T ( P ) :
T ( P ) = { x ∈ H n : P x = P } T(P)= \left\{\mathbf{x}\in \mathscr{H}^n : P_{\mathbf{x}} = P\right\} T ( P ) = { x ∈ H n : P x = P }
例如,x = 11321 , H = { 1 , 2 , 3 } , P x ( 1 ) = 3 5 , P x ( 2 ) = 1 5 , P x ( 3 ) = 1 5 x=11321, \mathscr{H}=\{1,2,3\},P_{\mathbf{x}}(1)=\frac{3}{5},P_{\mathbf{x}}(2)=\frac{1}{5},P_{\mathbf{x}}(3) = \frac{1}{5} x = 11321 , H = { 1 , 2 , 3 } , P x ( 1 ) = 5 3 , P x ( 2 ) = 5 1 , P x ( 3 ) = 5 1 Number of T(P):
∣ T ( P ) ∣ = 5 ! 3 ! 1 ! 1 ! = 20 |T(P)| = \frac{5!}{3!1!1!}=20 ∣ T ( P ) ∣ = 3 ! 1 ! 1 ! 5 ! = 20
Theorem 12.1.1:
∣ P n ∣ ≤ ( n + 1 ) ∣ H ∣ |\mathscr{P}_{n}| \leq (n+1)^{|\mathscr{H}|} ∣ P n ∣ ≤ ( n + 1 ) ∣ H ∣
证明:实质上这个值是∣ H ∣ | \mathscr{H}| ∣ H ∣ 个有序自然数和为n的个数。每个数值最多有n + 1 n+1 n + 1 种取值。
Theorem 12.1.2 : 若X 1 , … , X n X_{1},\dots,X_{n} X 1 , … , X n 是独立同分布的,概率分布为Q ( x ) Q(x) Q ( x ) ,则x \mathbf{x} x 的概率仅取决于其类型P,由
Q n ( x ) = 2 − n ( H ( P x ) + D ( P x ∣ ∣ Q ) ) Q^n(\mathbf{x}) = 2^{-n(H(P_{\mathbf{x}})+D(P_{\mathbf{x}}\mid\mid Q))} Q n ( x ) = 2 − n ( H ( P x ) + D ( P x ∣∣ Q ))
给出。 说明:x \mathbf{x} x 是我们希望考察的变量,类型为P,Q是实际的概率分布。 2 − n H ( Q ) 2^{-nH(Q)} 2 − n H ( Q ) 是typical seq中每个序列x \mathbf{x} x 的典型概率(AEP),当P = Q P=Q P = Q 时概率能取到”最大值“
推论:若x \mathbf{x} x 在Q Q Q 的类型类中,那么Q n ( x ) = 2 − n H ( Q ) Q^{n}(\mathbf{x})=2^{-nH(Q)} Q n ( x ) = 2 − n H ( Q ) 。这是因为x ∈ T ( Q ) ⟹ P x = Q \mathbf{x}\in T(Q)\implies P_{x}=Q x ∈ T ( Q ) ⟹ P x = Q Theorem 12.1.3 : Size of type class T ( P ) T(P) T ( P ) : 对任意类型P ∈ P n P\in \mathscr{P}_{n} P ∈ P n ,
1 ( n + 1 ) ∣ H ∣ 2 n H ( p ) ≤ ∣ T ( P ) ∣ ≤ 2 n H ( P ) \frac{1}{(n+1)^{\mid\mathscr{H}\mid}} 2^{nH(p)} \leq |T(P)| \leq 2^{nH(P)} ( n + 1 ) ∣ H ∣ 1 2 n H ( p ) ≤ ∣ T ( P ) ∣ ≤ 2 n H ( P )
从而每个type T T T 发生的概率为
2 − n D ( P x ∣ ∣ Q ) ( n + 1 ) ∣ H ∣ ≤ ∣ T ( P x ) ∣ Q n ( x ) ≤ 2 − n D ( P x ∣ ∣ Q ) \frac{2^{-nD(P_{\mathbf{x}}\mid\mid Q)}}{(n+1)^{|\mathscr{H}|}}\leq|T(P_{\mathbf{x}})| Q^n(\mathbf{x})\leq 2^{-nD(P_{\mathbf{x}}\mid\mid Q)} ( n + 1 ) ∣ H ∣ 2 − n D ( P x ∣∣ Q ) ≤ ∣ T ( P x ) ∣ Q n ( x ) ≤ 2 − n D ( P x ∣∣ Q )
上面这个公式在估计偏离统计值较大的情形下很有用,能给出比大数定律更好的估计结果。
确切的表达式为:
∣ T ( P ) ∣ = n ! ( n P ( a 1 ) ) ! ( n P ( a 2 ) ) ! … ( n P ( a ∣ H ∣ ) ) ! |T(P)| = \frac{n!}{(nP(a_{1}))!(nP(a_{2}))!\dots (nP(a_{|\mathscr{H}|}))!} ∣ T ( P ) ∣ = ( n P ( a 1 ))! ( n P ( a 2 ))! … ( n P ( a ∣ H ∣ ))! n !
证明:
1 ≥ P n ( T ( P ) ) = ∑ x ∈ T ( P ) P n ( x ) = ∑ x ∈ T ( P ) 2 − n H ( P ) = ∣ T ( P ) ∣ 2 − n H ( P ) \begin{align}
1 & \geq P^n (T(P)) \\
& = \sum_{\mathbf{x}\in T(P)} P^n(\mathbf{x})\\
& = \sum_{\mathbf{x}\in T(P)} 2^{-nH(P)} \\
& = |T(P)| 2^{-nH(P)}
\end{align} 1 ≥ P n ( T ( P )) = x ∈ T ( P ) ∑ P n ( x ) = x ∈ T ( P ) ∑ 2 − n H ( P ) = ∣ T ( P ) ∣ 2 − n H ( P )
从而∣ T ( P ) ∣ ≤ 2 n H ( P ) |T(P)| \leq 2^{nH(P)} ∣ T ( P ) ∣ ≤ 2 n H ( P )
证明上界前,现证明type classT ( P ) T(P) T ( P ) 在分布P下,在所有type classes中具有最高的概率:
P n ( T ( P ) ) ≥ P n ( T ( P ^ ) ) , ∀ P ^ ∈ P n P^n(T(P)) \geq P^n (T(\hat{P})), \forall \hat{P} \in \mathscr{P}_{n} P n ( T ( P )) ≥ P n ( T ( P ^ )) , ∀ P ^ ∈ P n
P n ( T ( P ) ) P n ( T ( P ^ ) ) = ∣ T ( P ) ∣ Π a ∈ H P ( a ) n P ( a ) ∣ T ( P ^ ) ∣ Π a ∈ H P ( a ) n P ^ ( a ) = ( n n P ( a 1 ) , n P ( a 2 ) , … , n P ( a ∣ H ∣ ) ) ∏ a ∈ H P ( a ) n P ( a ) ( n n P ^ ( a 1 ) , n P ^ ( a 2 ) , … , n P ^ ( a ∣ H ∣ ) ) ∏ a ∈ H P ( a ) n P ^ ( a ) = ∏ a ∈ H ( n P ^ ( a ) ) ! ( n P ( a ) ) ! P ( a ) n ( P ( a ) − P ^ ( a ) ) \begin{aligned}
& \frac{P^n(T(P))}{P^n(T(\hat{P}))}=\frac{|T(P)| \Pi_{a \in \mathscr{H}} P(a)^{n P(a)}}{|T(\hat{P})| \Pi_{a \in \mathscr{H}} P(a)^{n \hat{P}(a)}} \\
& =\frac{\left(
\begin{matrix}
n \\
n P\left(a_1\right), n P\left(a_2\right), \ldots, n P\left(a_{|\mathscr{H}|}\right)
\end{matrix}
\right) \prod_{a \in \mathscr{H}} P(a)^{n P(a)}}{\left(\begin{array}{c} n\\
n \hat{P}\left(a_1\right), n \hat{P}\left(a_2\right), \ldots, n \hat{P}\left(a_{|\mathscr{H}|}\right)
\end{array}\right) \prod_{a \in \mathscr{H}} P(a)^{n \hat{P}(a)}} \\
& =\prod_{a \in \mathscr{H}} \frac{(n \hat{P}(a)) !}{(n P(a)) !} P(a)^{n(P(a)-\hat{P}(a))} \\
&
\end{aligned} P n ( T ( P ^ )) P n ( T ( P )) = ∣ T ( P ^ ) ∣ Π a ∈ H P ( a ) n P ^ ( a ) ∣ T ( P ) ∣ Π a ∈ H P ( a ) n P ( a ) = ( n n P ^ ( a 1 ) , n P ^ ( a 2 ) , … , n P ^ ( a ∣ H ∣ ) ) ∏ a ∈ H P ( a ) n P ^ ( a ) ( n n P ( a 1 ) , n P ( a 2 ) , … , n P ( a ∣ H ∣ ) ) ∏ a ∈ H P ( a ) n P ( a ) = a ∈ H ∏ ( n P ( a ))! ( n P ^ ( a ))! P ( a ) n ( P ( a ) − P ^ ( a ))
由于m ! n ! ≥ n m − n \frac{m!}{n!}\geq n^{m-n} n ! m ! ≥ n m − n ,
P n ( T ( P ) ) P n ( T ( P ^ ) ) ≥ ∏ a ∈ H ( n P ( a ) ) n P ^ ( a ) − n P ( a ) P ( a ) n ( P ( a ) − P ^ ( a ) ) = ∏ a ∈ H n n ( P ^ ( a ) − P ( a ) ) = n n ( Σ a ∈ H P ^ ( a ) − Σ a ∈ H P ( a ) ) = n n ( 1 − 1 ) = 1 \begin{aligned}
\frac{P^n(T(P))}{P^n(T(\hat{P}))} & \geq \prod_{a \in \mathscr{H}}(n P(a))^{n \hat{P}(a)-n P(a)} P(a)^{n(P(a)-\hat{P}(a))} \\
& =\prod_{a \in \mathscr{H}} n^{n(\hat{P}(a)-P(a))} \\
& =n^{n\left(\Sigma_{a \in \mathscr{H}} \hat{P}(a)-\Sigma_{a \in \mathscr{H}} P(a)\right)} \\
& =n^{n(1-1)} \\
& =1
\end{aligned} P n ( T ( P ^ )) P n ( T ( P )) ≥ a ∈ H ∏ ( n P ( a ) ) n P ^ ( a ) − n P ( a ) P ( a ) n ( P ( a ) − P ^ ( a )) = a ∈ H ∏ n n ( P ^ ( a ) − P ( a )) = n n ( Σ a ∈ H P ^ ( a ) − Σ a ∈ H P ( a ) ) = n n ( 1 − 1 ) = 1
于是
1 = ∑ Q ∈ P n P n ( T ( Q ) ) ≤ ∑ Q ∈ P n max Q P n ( T ( Q ) ) = ∑ Q ∈ P n T ( P ) ≤ ( n + 1 ) ∣ H ∣ P n ( T ( P ) ) = ( n + 1 ) H ∑ x ∈ T ( P ) P n ( x ) = ( n + 1 ) H ∑ x ∈ T ( P ) 2 − n H ( P ) = ( n + 1 ) H ∣ T ( P ) ∣ 2 − n H ( P ) \begin{align}
1 & = \sum_{Q\in \mathscr{P}_{n}} P^n (T(Q)) \\
& \leq \sum_{Q\in \mathscr{P}_{n}} \max _{Q} P^n (T(Q)) \\
& = \sum_{Q\in \mathscr{P}_{n}} T(P) \\
& \leq (n+1)^{|\mathscr{H}|} P^n (T(P)) \\
& = (n+1)^{\mathscr{H}} \sum_{x\in T(P)} P^n (\mathbf{x}) \\
& = (n+1)^{\mathscr{H}} \sum_{x\in T(P)} 2^{-nH(P)} \\
& = (n+1)^{\mathscr{H}} |T(P)| 2^{-nH(P)}
\end{align} 1 = Q ∈ P n ∑ P n ( T ( Q )) ≤ Q ∈ P n ∑ Q max P n ( T ( Q )) = Q ∈ P n ∑ T ( P ) ≤ ( n + 1 ) ∣ H ∣ P n ( T ( P )) = ( n + 1 ) H x ∈ T ( P ) ∑ P n ( x ) = ( n + 1 ) H x ∈ T ( P ) ∑ 2 − n H ( P ) = ( n + 1 ) H ∣ T ( P ) ∣ 2 − n H ( P )
12.4 Large Deviation Theory
Let E E E be a subset of the set of probability mass functions. For example, E E E may be the set of probability mass functions with mean μ \mu μ . With a slight abuse of notation, we write
Q n ( E ) = Q n ( E ∩ P n = ∑ x : P x ∈ E ∩ P n ) Q n ( x ) Q^n(E) = Q^n\left( E \cap \mathscr{P}_{n} = \sum_{\mathbf{x}: P_{\mathbf{x}} \in E \cap \mathscr{P}_{n}} \right) Q^n(\mathbf{x}) Q n ( E ) = Q n ( E ∩ P n = x : P x ∈ E ∩ P n ∑ ) Q n ( x )
If E E E contains a relative entropy neighborhood of Q Q Q , then by the weak law of large numbers, Q n ( E ) → 1 Q^n (E) \to 1 Q n ( E ) → 1 . On the other hand, if E E E does not contain Q Q Q or a neighborhood of Q Q Q , then by the weak law of large numbers, Q n ( E ) → 0 Q^n(E)\to0 Q n ( E ) → 0 exponentially fast. We will use the method of types to calculate the exponent. 例子:假设我们观测到样本的g ( X ) g(X) g ( X ) 均值大于等于α \alpha α :
1 n ∑ i g ( X i ) ≥ α \frac{1}{n}\sum_{i} g(X_{i}) \geq \alpha n 1 i ∑ g ( X i ) ≥ α
这等价于P x ∈ E ∩ P n P_{x}\in E\cap \mathscr{P}_{n} P x ∈ E ∩ P n ,其中
E = { P : ∑ a ∈ H g ( a ) P ( a ) ≥ α } E= \left\{P: \sum_{a\in \mathscr{H}} g(a)P(a)\geq \alpha \right\} E = { P : a ∈ H ∑ g ( a ) P ( a ) ≥ α }
Sanov's theorem: Let X 1 , X 2 , … , X n X_{1},X_{2},\dots,X_{n} X 1 , X 2 , … , X n be i.i.d. ~Q ( x ) Q(x) Q ( x ) , let E ⊂ P E\subset \mathscr{P} E ⊂ P be a set of probability distributions. Then
Q n ( E ) = Q n ( E ∩ P n ) ≤ ( n + 1 ) ∣ H ∣ 2 − n D ( P ∗ ∣ ∣ Q ) Q^n(E) =Q^n(E\cap \mathscr{P}_{n} ) \leq (n+1)^{\left| \mathscr{H} \right| } 2^{-nD(P^* \mid\mid Q)} Q n ( E ) = Q n ( E ∩ P n ) ≤ ( n + 1 ) ∣ H ∣ 2 − n D ( P ∗ ∣∣ Q )
Where
P ∗ = a r g min P ∈ E D ( P ∣ ∣ Q ) P^* = arg\min _{P\in E} D(P\mid\mid Q) P ∗ = a r g P ∈ E min D ( P ∣∣ Q )
is the distribution in E E E that is closest to Q Q Q in relative entropy. If, in addition, the set E is the closure of its interior, then
1 n log Q n ( E ) → − D ( P ∗ ∣ ∣ Q ) \frac{1}{n} \log Q^n (E) \to -D(P^* \mid\mid Q) n 1 log Q n ( E ) → − D ( P ∗ ∣∣ Q )
Examples: suppose we wish to find Pr { 1 n ∑ i = 1 n g j ( X i ) ≥ α j , j = 1 , 2 , … , k } \text{Pr}\{\frac{1}{n} \sum_{i=1}^n g_{j}(X_{i})\geq \alpha_{j},j=1,2,\dots,k\} Pr { n 1 ∑ i = 1 n g j ( X i ) ≥ α j , j = 1 , 2 , … , k } To find the closest distribution in E E E to Q Q Q ,use Lagrange multipliers, we construct functional:
J ( P ) = ∑ x P ( x ) log P ( x ) Q ( x ) + ∑ j λ i ∑ x P ( x ) g i ( x ) + ν ∑ x P ( x ) J(P) = \sum_{x}P(x) \log \frac{P(x)}{Q(x)} + \sum_{j} \lambda_{i} \sum_{x} P(x) g_{i}(x) + \nu \sum_{x} P(x) J ( P ) = x ∑ P ( x ) log Q ( x ) P ( x ) + j ∑ λ i x ∑ P ( x ) g i ( x ) + ν x ∑ P ( x )
Differentiate, calculate the closest distribution to Q:
P ∗ ( x ) ∝ Q ( x ) e ∑ i λ i g i ( x ) P^*(x) \propto Q(x)e^{\sum_{i}\lambda_{i}g_{i}(x)} P ∗ ( x ) ∝ Q ( x ) e ∑ i λ i g i ( x )
If Q is uniform, then P ∗ P^* P ∗ is the maximum entropy distribution. 例:投色子n次,平均值大于等于4的概率大约是多少?
Q n ( E ) = ˙ 2 − n D ( P ∗ ∣ ∣ Q ) Q^n(E) \dot{=} 2^{-nD(P^* \mid\mid Q)} Q n ( E ) = ˙ 2 − n D ( P ∗ ∣∣ Q )
其中P ∗ P^* P ∗ 分布满足∑ i = 1 6 i P i ≥ 4 \sum_{i=1}^6 i P_{i}\geq 4 ∑ i = 1 6 i P i ≥ 4 且D ( P ∣ ∣ Q ) D(P\mid\mid Q) D ( P ∣∣ Q ) 最小。从上面的讨论得知P ∗ P^* P ∗ 具有形式
P ∗ ( x ) = 2 λ x ∑ i = 1 6 2 λ i P^*(x) = \frac{2^{\lambda x}}{\sum_{i=1}^6 2^{\lambda i}} P ∗ ( x ) = ∑ i = 1 6 2 λi 2 λ x
其中λ \lambda λ 使得不等号正好取等。数值求解得到
P ∗ = ( 0.1031 , 0.1227 , 0.1461 , 0.1740 , 0.2072 , 0.2468 ) D ( P ∗ ∣ ∣ Q ) = 0.0624 bits \begin{align}
P^* & = (0.1031,0.1227,0.1461,0.1740,0.2072,0.2468) \\
D(P^*\mid\mid Q) & = 0.0624 \text{ bits }
\end{align} P ∗ D ( P ∗ ∣∣ Q ) = ( 0.1031 , 0.1227 , 0.1461 , 0.1740 , 0.2072 , 0.2468 ) = 0.0624 bits
投掷色子n次,概率大于等于4的概率约为2 − 0.0624 n 2^{-0.0624 n} 2 − 0.0624 n
例2:投掷硬币1000次,大于等于700次正面向上的概率:P ∗ = ( 0.7 , 0.3 ) P^*=(0.7,0.3) P ∗ = ( 0.7 , 0.3 )
D ( P ∗ ∣ ∣ Q ) = 1 − H ( P ∗ ) = 0.119 D(P^*\mid\mid Q) = 1-H(P^*) = 0.119 D ( P ∗ ∣∣ Q ) = 1 − H ( P ∗ ) = 0.119
P ( X ˉ n ≥ 0.7 ) = ˙ 2 − n D ( P ∗ ∣ ∣ Q ) P(\bar{X}_{n}\geq 0.7) \dot{=} 2^{-nD(P^*\mid\mid Q)} P ( X ˉ n ≥ 0.7 ) = ˙ 2 − n D ( P ∗ ∣∣ Q )
例3 Mutual dependence: Let Q ( x , y ) Q(x,y) Q ( x , y ) be a given joint distribution and let Q 0 ( x , y ) = Q ( x ) Q ( y ) Q_{0}(x,y)=Q(x)Q(y) Q 0 ( x , y ) = Q ( x ) Q ( y ) be the associated product distribution from the marginals of Q Q Q . We wish to know the likelihood that a sample drawn according to Q 0 Q_{0} Q 0 will appear to be jointly distributed according to Q Q Q . Accordingly, let ( X i , Y i ) (X_{i},Y_{i}) ( X i , Y i ) be i.i.d. ∼ Q 0 ( x , y ) \sim Q_{0}(x,y) ∼ Q 0 ( x , y ) . Define joint typicality: ( x n , y n ) (x^n,y^n) ( x n , y n ) is jointly typical with respect to a joint distribution Q ( x , y ) Q(x,y) Q ( x , y ) iff the sample entropies are close to their true values,
∣ − 1 n log Q ( x n ) − H ( X ) ∣ ≤ ϵ ∣ − 1 n log Q ( y n ) − H ( Y ) ∣ ≤ ϵ \begin{align}
|-\frac{1}{n} \log Q(x^n)-H(X)|\leq\epsilon \\
|-\frac{1}{n} \log Q(y^n)-H(Y)|\leq\epsilon \\
\end{align} ∣ − n 1 log Q ( x n ) − H ( X ) ∣ ≤ ϵ ∣ − n 1 log Q ( y n ) − H ( Y ) ∣ ≤ ϵ
and
∣ − 1 n log Q ( x n , y n ) − H ( X , Y ) ∣ ≤ ϵ |-\frac{1}{n} \log Q(x^n,y^n ) - H(X,Y) | \leq \epsilon ∣ − n 1 log Q ( x n , y n ) − H ( X , Y ) ∣ ≤ ϵ
我们希望计算一对看到一对jointly typical of Q的( x n , y n ) (x^n,y^n) ( x n , y n ) 的概率。因此( x n , y n ) (x^n,y^n) ( x n , y n ) 是基于Q ( x , y ) Q(x,y) Q ( x , y ) jointly typical 的,如果P x n , y n ∈ E ∩ P n ( X , Y ) P_{x^n,y^n}\in E \cap \mathscr{P}_{n}(X,Y) P x n , y n ∈ E ∩ P n ( X , Y ) ,其中
E = { P ( x , y ) : ∣ − ∑ x , y P ( x , y ) log Q ( x ) − H ( X ) ∣ ≤ ϵ , ∣ − ∑ x , y P ( x , y ) log Q ( y ) − H ( Y ) ∣ ≤ ϵ , ∣ − ∑ x , y P ( x , y ) log Q ( x , y ) − H ( X , Y ) ∣ ≤ ϵ } . \begin{aligned}
& E=\left\{P(x, y):\left|-\sum_{x, y} P(x, y) \log Q(x)-H(X)\right| \leq \epsilon,\right. \\
&\left|-\sum_{x, y} P(x, y) \log Q(y)-H(Y)\right| \leq \epsilon, \\
&\left.\left|-\sum_{x, y} P(x, y) \log Q(x, y)-H(X, Y)\right| \leq \epsilon\right\} .
\end{aligned} E = { P ( x , y ) : − x , y ∑ P ( x , y ) log Q ( x ) − H ( X ) ≤ ϵ , − x , y ∑ P ( x , y ) log Q ( y ) − H ( Y ) ≤ ϵ , − x , y ∑ P ( x , y ) log Q ( x , y ) − H ( X , Y ) ≤ ϵ } .
由Sanov 定理,概率为
Q n n ( E ) = ˙ 2 − n D ( P ∗ ∣ ∣ Q 0 ) Q_{n}^n(E) \dot{=} 2^{-nD(P^* \mid\mid Q_{0})} Q n n ( E ) = ˙ 2 − n D ( P ∗ ∣∣ Q 0 )
其中P ∗ P^* P ∗ 为相对Q 0 Q_{0} Q 0 熵最小的满足条件的分布。在此问题中为联合分布Q Q Q 。而Q 0 Q_{0} Q 0 为直积分布,因此上式为2 − n D ( Q ( x , y ) ∣ ∣ Q ( x ) Q ( y ) ) = 2 − n I ( X ; Y ) 2^{-nD(Q(x,y)\mid\mid Q(x)Q(y))}=2^{-nI(X;Y)} 2 − n D ( Q ( x , y ) ∣∣ Q ( x ) Q ( y )) = 2 − n I ( X ; Y ) ,得到的结果与AEP中相同。
12.6 The Conditional Limit Theorem
我们已经说明了,一组类型在分布Q Q Q 下出现的概率由集合中最靠近Q Q Q 的元素决定,这个概率的一阶项为2 − n D ∗ 2^{-nD^*} 2 − n D ∗ ,其中
D ∗ = min P ∈ E D ( P ∣ ∣ Q ) D^* = \min _{P\in E} D(P\mid\mid Q) D ∗ = P ∈ E min D ( P ∣∣ Q )
这是由于集合的概率为每个元素的概率之和,因此就由这些项中最大的项为界。由于这些项的个数是序列长度的多项式级别,求和的最高阶就等于其中最大的项。
现在我们将说明,Theorem 12.6.1 : For a closed convex set E ⊂ P E\subset \mathscr{P} E ⊂ P and distribution Q ∉ E Q\notin E Q ∈ / E , let P ∗ ∈ E P^* \in E P ∗ ∈ E be the distribution that achieves the minimum distance to Q Q Q . Then
D ( P ∣ ∣ Q ) ≥ D ( P ∣ ∣ P ∗ ) + D ( P ∗ ∣ ∣ Q ) D(P\mid\mid Q)\geq D(P\mid\mid P^*)+D(P^*\mid\mid Q) D ( P ∣∣ Q ) ≥ D ( P ∣∣ P ∗ ) + D ( P ∗ ∣∣ Q )
for all P ∈ E P\in E P ∈ E
Proof: Let P ∈ E , P λ = λ P + ( 1 − λ ) P ∗ P\in E, P_{\lambda} = \lambda P + (1-\lambda)P^* P ∈ E , P λ = λ P + ( 1 − λ ) P ∗
D λ = D ( P λ ∣ ∣ Q ) = ∑ x P λ log P λ ( x ) Q ( x ) D_{\lambda} = D(P_{\lambda}\mid\mid Q) = \sum_{x} P_{\lambda} \log \frac{P_{\lambda}(x)}{Q(x)} D λ = D ( P λ ∣∣ Q ) = x ∑ P λ log Q ( x ) P λ ( x )
derivative d D λ d λ ≥ 0 \frac{dD_{\lambda}}{d\lambda}\geq 0 d λ d D λ ≥ 0 , so
0 ≤ ( d D λ d λ ) λ = 0 = ∑ x ( P ( x ) − P ∗ ( x ) ) log P ∗ ( x ) Q ( x ) = ∑ x P ( x ) log P ∗ ( x ) Q ( x ) − ∑ x P ∗ ( x ) log P ∗ ( x ) Q ( x ) = ∑ P ( x ) log P ( x ) Q ( x ) P ∗ ( x ) P ( x ) − ∑ P ∗ ( x ) log P ∗ ( x ) Q ( x ) = D ( P ∣ ∣ Q ) − D ( P ∣ ∣ P ∗ ) − D ( P ∗ ∣ ∣ Q ) □ \begin{align}
0 & \leq \left( \frac{dD_{\lambda}}{d\lambda} \right)_{\lambda=0} \\
& = \sum_{x} (P(x)-P^*(x)) \log \frac{P^*(x)}{Q(x)} \\
& = \sum_{x} P(x)\log \frac{P^*(x)}{Q(x)} - \sum_{x} P^*(x)\log \frac{P^*(x)}{Q(x)} \\
& = \sum P(x) \log \frac{P(x)}{Q(x)} \frac{P^*(x)}{P(x)} - \sum P^*(x) \log \frac{P^*(x)}{Q(x)} \\
& = D(P\mid\mid Q) - D(P\mid\mid P^*) - D(P^* \mid\mid Q)\qquad\square
\end{align} 0 ≤ ( d λ d D λ ) λ = 0 = x ∑ ( P ( x ) − P ∗ ( x )) log Q ( x ) P ∗ ( x ) = x ∑ P ( x ) log Q ( x ) P ∗ ( x ) − x ∑ P ∗ ( x ) log Q ( x ) P ∗ ( x ) = ∑ P ( x ) log Q ( x ) P ( x ) P ( x ) P ∗ ( x ) − ∑ P ∗ ( x ) log Q ( x ) P ∗ ( x ) = D ( P ∣∣ Q ) − D ( P ∣∣ P ∗ ) − D ( P ∗ ∣∣ Q ) □
Note that the relative entropy D ( P ∣ ∣ Q ) D(P\mid\mid Q) D ( P ∣∣ Q ) behaves like the square of the Euclidean distance.
Theorem 12.6.2 (Conditional limit theorem): Let E E E be a closed convex subset of P \mathscr{P} P and let Q Q Q be a distribution not in E E E . Let X 1 , X 2 , … , X n X_{1},X_{2},\dots,X_{n} X 1 , X 2 , … , X n be discrete random variables drawn i.i.d ∼ Q \sim Q ∼ Q . Let P ∗ P^* P ∗ achieve min P ∈ E D ( P ∣ ∣ Q ) \min_{P\in E} D(P\mid\mid Q) min P ∈ E D ( P ∣∣ Q ) . Then
Pr ( X 1 = a ∣ P X n ∈ E ) → P ∗ ( a ) \text{Pr} (X_{1}=a\mid P_{X^n} \in E) \to P^* (a) Pr ( X 1 = a ∣ P X n ∈ E ) → P ∗ ( a )
in probability as n → ∞ n\to \infty n → ∞ , i.e. the conditional distribution of X 1 X_{1} X 1 , given that the type of sequence is in E E E , is close to P ∗ P^* P ∗ for large n.
Leave a Comment