My Notes

Created: 2026-03-06 07:53:04

Updated: 2026-03-06 07:53:04

12.1 The Method of Types

AEP是我们可以关注于一小部分典型序列，类型方法使我们可以考虑具有相同经验分布(empirical distribution)的序列。在该限制下，我们可以推出特定经验分布中序列个数的bounds，以及这个集合中每个序列的概率。

序列 $x_{1},\dots,x_{n}$ 的类型 $P_{\mathbf{x}}$ 是 $\mathscr{H}$ 中每个符号在序列 $\mathbf{x}$ 出现的相对比例： $P_{\mathbf{x}}(a) = \frac{N(a\mid \mathbf{x})}{n}, \forall a\in \mathscr{H}$ ,其中 $N(a\mid \mathbf{x})$ 是符号a出现在 $\mathbf{x}\in\mathscr{H}^n$ 中的个数。

定义 $\mathscr{P}_{n}$ 是分母为n的所有类型集合。
例如， $\mathscr{H}=\{0,1\}$ ，此时所有类型共 $n+1$ 个，即：

$\mathscr{P_{n}} = \left\{(P(0),P(1)): \left( \frac{0}{n} , \frac{n}{n}\right), \left( \frac{1}{n}, \frac{n-1}{n} \right), \dots, \left( \frac{n}{n}, \frac{0}{n} \right) \right\}$

定义：若 $P\in \mathscr{P}_{n}$ , 那么全体具有长度n、类型P的序列集合称作P的类型类，记作 $T(P)$ :

$T(P)= \left\{\mathbf{x}\in \mathscr{H}^n : P_{\mathbf{x}} = P\right\}$

例如, $x=11321, \mathscr{H}=\{1,2,3\},P_{\mathbf{x}}(1)=\frac{3}{5},P_{\mathbf{x}}(2)=\frac{1}{5},P_{\mathbf{x}}(3) = \frac{1}{5}$
Number of T(P):

$|T(P)| = \frac{5!}{3!1!1!}=20$

Theorem 12.1.1:

$|\mathscr{P}_{n}| \leq (n+1)^{|\mathscr{H}|}$

证明：实质上这个值是 $| \mathscr{H}|$ 个有序自然数和为n的个数。每个数值最多有 $n+1$ 种取值。

Theorem 12.1.2:
若 $X_{1},\dots,X_{n}$ 是独立同分布的，概率分布为 $Q(x)$ ，则 $\mathbf{x}$ 的概率仅取决于其类型P，由

$Q^n(\mathbf{x}) = 2^{-n(H(P_{\mathbf{x}})+D(P_{\mathbf{x}}\mid\mid Q))}$

给出。
说明： $\mathbf{x}$ 是我们希望考察的变量，类型为P，Q是实际的概率分布。 $2^{-nH(Q)}$ 是typical seq中每个序列 $\mathbf{x}$ 的典型概率（AEP），当 $P=Q$ 时概率能取到”最大值“

推论：若 $\mathbf{x}$ 在 $Q$ 的类型类中，那么 $Q^{n}(\mathbf{x})=2^{-nH(Q)}$ 。这是因为 $\mathbf{x}\in T(Q)\implies P_{x}=Q$
Theorem 12.1.3:
Size of type class $T(P)$ : 对任意类型 $P\in \mathscr{P}_{n}$ ,

$\frac{1}{(n+1)^{\mid\mathscr{H}\mid}} 2^{nH(p)} \leq |T(P)| \leq 2^{nH(P)}$

从而每个type $T$ 发生的概率为

$\frac{2^{-nD(P_{\mathbf{x}}\mid\mid Q)}}{(n+1)^{|\mathscr{H}|}}\leq|T(P_{\mathbf{x}})| Q^n(\mathbf{x})\leq 2^{-nD(P_{\mathbf{x}}\mid\mid Q)}$

上面这个公式在估计偏离统计值较大的情形下很有用，能给出比大数定律更好的估计结果。

确切的表达式为：

$|T(P)| = \frac{n!}{(nP(a_{1}))!(nP(a_{2}))!\dots (nP(a_{|\mathscr{H}|}))!}$

证明：

$\begin{align} 1 & \geq P^n (T(P)) \\ & = \sum_{\mathbf{x}\in T(P)} P^n(\mathbf{x})\\ & = \sum_{\mathbf{x}\in T(P)} 2^{-nH(P)} \\ & = |T(P)| 2^{-nH(P)} \end{align}$

从而 $|T(P)| \leq 2^{nH(P)}$

证明上界前，现证明type class $T(P)$ 在分布P下，在所有type classes中具有最高的概率：

$P^n(T(P)) \geq P^n (T(\hat{P})), \forall \hat{P} \in \mathscr{P}_{n}$

$\begin{aligned} & \frac{P^n(T(P))}{P^n(T(\hat{P}))}=\frac{|T(P)| \Pi_{a \in \mathscr{H}} P(a)^{n P(a)}}{|T(\hat{P})| \Pi_{a \in \mathscr{H}} P(a)^{n \hat{P}(a)}} \\ & =\frac{\left( \begin{matrix} n \\ n P\left(a_1\right), n P\left(a_2\right), \ldots, n P\left(a_{|\mathscr{H}|}\right) \end{matrix} \right) \prod_{a \in \mathscr{H}} P(a)^{n P(a)}}{\left(\begin{array}{c} n\\ n \hat{P}\left(a_1\right), n \hat{P}\left(a_2\right), \ldots, n \hat{P}\left(a_{|\mathscr{H}|}\right) \end{array}\right) \prod_{a \in \mathscr{H}} P(a)^{n \hat{P}(a)}} \\ & =\prod_{a \in \mathscr{H}} \frac{(n \hat{P}(a)) !}{(n P(a)) !} P(a)^{n(P(a)-\hat{P}(a))} \\ & \end{aligned}$

由于 $\frac{m!}{n!}\geq n^{m-n}$ ，

$\begin{aligned} \frac{P^n(T(P))}{P^n(T(\hat{P}))} & \geq \prod_{a \in \mathscr{H}}(n P(a))^{n \hat{P}(a)-n P(a)} P(a)^{n(P(a)-\hat{P}(a))} \\ & =\prod_{a \in \mathscr{H}} n^{n(\hat{P}(a)-P(a))} \\ & =n^{n\left(\Sigma_{a \in \mathscr{H}} \hat{P}(a)-\Sigma_{a \in \mathscr{H}} P(a)\right)} \\ & =n^{n(1-1)} \\ & =1 \end{aligned}$

于是

$\begin{align} 1 & = \sum_{Q\in \mathscr{P}_{n}} P^n (T(Q)) \\ & \leq \sum_{Q\in \mathscr{P}_{n}} \max _{Q} P^n (T(Q)) \\ & = \sum_{Q\in \mathscr{P}_{n}} T(P) \\ & \leq (n+1)^{|\mathscr{H}|} P^n (T(P)) \\ & = (n+1)^{\mathscr{H}} \sum_{x\in T(P)} P^n (\mathbf{x}) \\ & = (n+1)^{\mathscr{H}} \sum_{x\in T(P)} 2^{-nH(P)} \\ & = (n+1)^{\mathscr{H}} |T(P)| 2^{-nH(P)} \end{align}$

12.4 Large Deviation Theory

Let $E$ be a subset of the set of probability mass functions. For example, $E$ may be the set of probability mass functions with mean $\mu$ . With a slight abuse of notation, we write

$Q^n(E) = Q^n\left( E \cap \mathscr{P}_{n} = \sum_{\mathbf{x}: P_{\mathbf{x}} \in E \cap \mathscr{P}_{n}} \right) Q^n(\mathbf{x})$

If $E$ contains a relative entropy neighborhood of $Q$ , then by the weak law of large numbers, $Q^n (E) \to 1$ . On the other hand, if $E$ does not contain $Q$ or a neighborhood of $Q$ , then by the weak law of large numbers, $Q^n(E)\to0$ exponentially fast. We will use the method of types to calculate the exponent.
例子：假设我们观测到样本的 $g(X)$ 均值大于等于 $\alpha$ :

$\frac{1}{n}\sum_{i} g(X_{i}) \geq \alpha$

这等价于 $P_{x}\in E\cap \mathscr{P}_{n}$ ，其中

$E= \left\{P: \sum_{a\in \mathscr{H}} g(a)P(a)\geq \alpha \right\}$

Sanov's theorem: Let $X_{1},X_{2},\dots,X_{n}$ be i.i.d. ~ $Q(x)$ , let $E\subset \mathscr{P}$ be a set of probability distributions. Then

$Q^n(E) =Q^n(E\cap \mathscr{P}_{n} ) \leq (n+1)^{\left| \mathscr{H} \right| } 2^{-nD(P^* \mid\mid Q)}$

Where

$P^* = arg\min _{P\in E} D(P\mid\mid Q)$

is the distribution in $E$ that is closest to $Q$ in relative entropy. If, in addition, the set E is the closure of its interior, then

$\frac{1}{n} \log Q^n (E) \to -D(P^* \mid\mid Q)$

Examples: suppose we wish to find $\text{Pr}\{\frac{1}{n} \sum_{i=1}^n g_{j}(X_{i})\geq \alpha_{j},j=1,2,\dots,k\}$
To find the closest distribution in $E$ to $Q$ ,use Lagrange multipliers, we construct functional:

$J(P) = \sum_{x}P(x) \log \frac{P(x)}{Q(x)} + \sum_{j} \lambda_{i} \sum_{x} P(x) g_{i}(x) + \nu \sum_{x} P(x)$

Differentiate, calculate the closest distribution to Q:

$P^*(x) \propto Q(x)e^{\sum_{i}\lambda_{i}g_{i}(x)}$

If Q is uniform, then $P^*$ is the maximum entropy distribution.
例：投色子n次，平均值大于等于4的概率大约是多少？

$Q^n(E) \dot{=} 2^{-nD(P^* \mid\mid Q)}$

其中 $P^*$ 分布满足 $\sum_{i=1}^6 i P_{i}\geq 4$ 且 $D(P\mid\mid Q)$ 最小。从上面的讨论得知 $P^*$ 具有形式

$P^*(x) = \frac{2^{\lambda x}}{\sum_{i=1}^6 2^{\lambda i}}$

其中 $\lambda$ 使得不等号正好取等。数值求解得到

$\begin{align} P^* & = (0.1031,0.1227,0.1461,0.1740,0.2072,0.2468) \\ D(P^*\mid\mid Q) & = 0.0624 \text{ bits } \end{align}$

投掷色子n次，概率大于等于4的概率约为 $2^{-0.0624 n}$

例2：投掷硬币1000次，大于等于700次正面向上的概率： $P^*=(0.7,0.3)$

$D(P^*\mid\mid Q) = 1-H(P^*) = 0.119$

$P(\bar{X}_{n}\geq 0.7) \dot{=} 2^{-nD(P^*\mid\mid Q)}$

例3 Mutual dependence: Let $Q(x,y)$ be a given joint distribution and let $Q_{0}(x,y)=Q(x)Q(y)$ be the associated product distribution from the marginals of $Q$ . We wish to know the likelihood that a sample drawn according to $Q_{0}$ will appear to be jointly distributed according to $Q$ . Accordingly, let $(X_{i},Y_{i})$ be i.i.d. $\sim Q_{0}(x,y)$ . Define joint typicality: $(x^n,y^n)$ is jointly typical with respect to a joint distribution $Q(x,y)$ iff the sample entropies are close to their true values,

$\begin{align} |-\frac{1}{n} \log Q(x^n)-H(X)|\leq\epsilon \\ |-\frac{1}{n} \log Q(y^n)-H(Y)|\leq\epsilon \\ \end{align}$

and

$|-\frac{1}{n} \log Q(x^n,y^n ) - H(X,Y) | \leq \epsilon$

我们希望计算一对看到一对jointly typical of
Q的 $(x^n,y^n)$ 的概率。因此 $(x^n,y^n)$ 是基于 $Q(x,y)$ jointly typical 的，如果 $P_{x^n,y^n}\in E \cap \mathscr{P}_{n}(X,Y)$ ,其中

$\begin{aligned} & E=\left\{P(x, y):\left|-\sum_{x, y} P(x, y) \log Q(x)-H(X)\right| \leq \epsilon,\right. \\ &\left|-\sum_{x, y} P(x, y) \log Q(y)-H(Y)\right| \leq \epsilon, \\ &\left.\left|-\sum_{x, y} P(x, y) \log Q(x, y)-H(X, Y)\right| \leq \epsilon\right\} . \end{aligned}$

由Sanov 定理,概率为

$Q_{n}^n(E) \dot{=} 2^{-nD(P^* \mid\mid Q_{0})}$

其中 $P^*$ 为相对 $Q_{0}$ 熵最小的满足条件的分布。在此问题中为联合分布 $Q$ 。而 $Q_{0}$ 为直积分布，因此上式为 $2^{-nD(Q(x,y)\mid\mid Q(x)Q(y))}=2^{-nI(X;Y)}$ ，得到的结果与AEP中相同。

12.6 The Conditional Limit Theorem

我们已经说明了，一组类型在分布 $Q$ 下出现的概率由集合中最靠近 $Q$ 的元素决定，这个概率的一阶项为 $2^{-nD^*}$ ，其中

$D^* = \min _{P\in E} D(P\mid\mid Q)$

这是由于集合的概率为每个元素的概率之和，因此就由这些项中最大的项为界。由于这些项的个数是序列长度的多项式级别，求和的最高阶就等于其中最大的项。

现在我们将说明，
Theorem 12.6.1: For a closed convex set $E\subset \mathscr{P}$ and distribution $Q\notin E$ , let $P^* \in E$ be the distribution that achieves the minimum distance to $Q$ . Then

$D(P\mid\mid Q)\geq D(P\mid\mid P^*)+D(P^*\mid\mid Q)$

for all $P\in E$

Proof: Let $P\in E, P_{\lambda} = \lambda P + (1-\lambda)P^*$

$D_{\lambda} = D(P_{\lambda}\mid\mid Q) = \sum_{x} P_{\lambda} \log \frac{P_{\lambda}(x)}{Q(x)}$

derivative $\frac{dD_{\lambda}}{d\lambda}\geq 0$ , so

$\begin{align} 0 & \leq \left( \frac{dD_{\lambda}}{d\lambda} \right)_{\lambda=0} \\ & = \sum_{x} (P(x)-P^*(x)) \log \frac{P^*(x)}{Q(x)} \\ & = \sum_{x} P(x)\log \frac{P^*(x)}{Q(x)} - \sum_{x} P^*(x)\log \frac{P^*(x)}{Q(x)} \\ & = \sum P(x) \log \frac{P(x)}{Q(x)} \frac{P^*(x)}{P(x)} - \sum P^*(x) \log \frac{P^*(x)}{Q(x)} \\ & = D(P\mid\mid Q) - D(P\mid\mid P^*) - D(P^* \mid\mid Q)\qquad\square \end{align}$

Note that the relative entropy $D(P\mid\mid Q)$ behaves like the square of the Euclidean distance.

Theorem 12.6.2(Conditional limit theorem): Let $E$ be a closed convex subset of $\mathscr{P}$ and let $Q$ be a distribution not in $E$ . Let $X_{1},X_{2},\dots,X_{n}$ be discrete random variables drawn i.i.d $\sim Q$ . Let $P^*$ achieve $\min_{P\in E} D(P\mid\mid Q)$ . Then

$\text{Pr} (X_{1}=a\mid P_{X^n} \in E) \to P^* (a)$

in probability as $n\to \infty$ , i.e. the conditional distribution of $X_{1}$ , given that the type of sequence is in $E$ , is close to $P^*$ for large n.

12.1 The Method of Types

12.4 Large Deviation Theory

12.6 The Conditional Limit Theorem

Leave a Comment