Created: 2026-03-06 07:53:04
Updated: 2026-03-06 07:53:04
2.1 Entropy
编码:必须唯一可解,不出现歧义 Huffman编码 Prefix-free codes:前缀码。有一组code words c 1 c 2 … c n c_{1}c_{2}\dots c_{n} c 1 c 2 … c n ,∀ c i , c j ( i ≠ j ) \forall c_{i},c_{j}(i\neq j) ∀ c i , c j ( i = j ) 使得c i , c j c_{i},c_{j} c i , c j 都不互为前缀。前缀码是可唯一解码的。 克拉夫特不等式:设符号表原始符号为
S = { s 1 , s 2 , … , s n } S=\{s_{1},s_{2},\dots,s_n\} S = { s 1 , s 2 , … , s n }
^ee60fa
在大小为r r r 的字符集熵编码为可唯一解码的码字长度为ℓ 1 , ℓ 2 , … , ℓ n \ell_{1},\ell_{2},\dots,\ell _{n} ℓ 1 , ℓ 2 , … , ℓ n ,则
∑ i = 1 n r − ℓ i ≤ 1 \sum_{i=1}^n r^{-\ell_{i}}\leq 1 i = 1 ∑ n r − ℓ i ≤ 1
Definition: The entropy H ( X ) H(X) H ( X ) of a discrete random variable X X X is defined by
H ( X ) = − ∑ x ∈ H p ( x ) log p ( x ) = E p log 1 p ( X ) H(X) = -\sum_{x\in \mathscr{H}}p(x)\log p(x)= E_{p}\log \frac{1}{p(X)} H ( X ) = − x ∈ H ∑ p ( x ) log p ( x ) = E p log p ( X ) 1
其中E p E_{p} E p 代表X X X 的概率分布为p p p 时后面函数的期望:
E p g ( X ) = ∑ x ∈ H g ( x ) p ( x ) E_{p}g(X)=\sum_{x \in \mathscr{H}}g(x)p(x) E p g ( X ) = x ∈ H ∑ g ( x ) p ( x )
Lemma:
H ( X ) ≥ 0 H(X)\geq 0 H ( X ) ≥ 0
H b ( X ) = ( log b a ) H a ( X ) H_{b}(X)=(\log_{b}a)H_{a}(X) H b ( X ) = ( log b a ) H a ( X )
2.2 Joint Entropy, Conditional Entropy
Definition: Joint Entropy
H ( X , Y ) = − ∑ x ∈ H ∑ y ∈ y p ( x , y ) log p ( x , y ) = − E log p ( X , Y ) H(X,Y)=-\sum_{x\in\mathscr{H}}\sum_{y\in\mathscr{y}}p(x,y)\log p(x,y)=-E\log p(X,Y) H ( X , Y ) = − x ∈ H ∑ y ∈ y ∑ p ( x , y ) log p ( x , y ) = − E log p ( X , Y )
Conditional Entropy: If ( X , Y ) ∼ p ( x , y ) (X,Y)\sim p(x,y) ( X , Y ) ∼ p ( x , y ) , then the conditional entropy H ( Y ∣ X ) H(Y|X) H ( Y ∣ X ) is defined as:
H ( Y ∣ X ) = ∑ x ∈ H p ( x ) H ( Y ∣ X = x ) = − ∑ x ∈ H p ( x ) ∑ y ∈ Y p ( y ∣ x ) log p ( y ∣ x ) = − ∑ x ∈ H ∑ y ∈ Y p ( x , y ) log ( p ( y ∣ x ) ) = − E p ( x , y ) log p ( Y ∣ X ) \begin{align}
H(Y|X) & =\sum_{x\in\mathscr{H}}p(x)H(Y|X=x) \\
& =-\sum_{x\in \mathscr{H}}p(x)\sum_{y\in \mathscr{Y}}p(y|x)\log p(y|x) \\
& =-\sum_{x\in \mathscr{H}}\sum_{y\in \mathscr{Y}}p(x,y)\log(p(y|x)) \\
& = -E_{p(x,y)}\log p(Y|X)
\end{align} H ( Y ∣ X ) = x ∈ H ∑ p ( x ) H ( Y ∣ X = x ) = − x ∈ H ∑ p ( x ) y ∈ Y ∑ p ( y ∣ x ) log p ( y ∣ x ) = − x ∈ H ∑ y ∈ Y ∑ p ( x , y ) log ( p ( y ∣ x )) = − E p ( x , y ) log p ( Y ∣ X )
Theorem(Chain rule):
H ( X , Y ) = H ( X ) + H ( Y ∣ X ) H(X,Y) = H(X)+H(Y|X) H ( X , Y ) = H ( X ) + H ( Y ∣ X )
Corollary:
H ( X , Y ∣ Z ) = H ( X ∣ Z ) + H ( Y ∣ X , Z ) H(X,Y|Z)=H(X|Z)+H(Y|X,Z) H ( X , Y ∣ Z ) = H ( X ∣ Z ) + H ( Y ∣ X , Z )
2.3 Relative Entropy and Mutual Information
Relative Entropy definition:
D ( p ∣ ∣ q ) = ∑ x ∈ H p ( x ) log p ( x ) q ( x ) = E p log p ( X ) q ( X ) \begin{align}
D(p\mid \mid q) & = \sum_{x\in\mathscr{H}}p(x)\log \frac{p(x)}{q(x)} \\
& =E_{p}\log \frac{p(X)}{q(X)}
\end{align} D ( p ∣∣ q ) = x ∈ H ∑ p ( x ) log q ( x ) p ( x ) = E p log q ( X ) p ( X )
We used convention that 0 log 0 q = 0 , p log p 0 = ∞ 0\log \frac{0}{q}=0,p\log \frac{p}{0}=\infty 0 log q 0 = 0 , p log 0 p = ∞
性质:
非负性,且D ( p ∣ ∣ q ) = 0 ⇔ p = q D(p | |q)=0\Leftrightarrow p=q D ( p ∣∣ q ) = 0 ⇔ p = q
非对称,不满足三角不等式
Mutual information: relative entropy between the joint distribution and the product distribution
I ( X ; Y ) = ∑ x ∈ H ∑ y ∈ Y p ( x , y ) log p ( x , y ) p ( x ) p ( y ) I(X;Y)=\sum_{x\in \mathscr{H}}\sum_{y\in \mathscr{Y}}p(x,y)\log \frac{p(x,y)}{p(x)p(y)} I ( X ; Y ) = x ∈ H ∑ y ∈ Y ∑ p ( x , y ) log p ( x ) p ( y ) p ( x , y )
例子:当X , Y X,Y X , Y 独立,p ( x , y ) = p ( x ) p ( y ) p(x,y)=p(x)p(y) p ( x , y ) = p ( x ) p ( y ) ,则I ( X , Y ) = 0 I(X,Y)=0 I ( X , Y ) = 0
熵和共同信息之间的关系: 容易验证
I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) I(X;Y)=H(Y)-H(Y|X) I ( X ; Y ) = H ( Y ) − H ( Y ∣ X )
由上一节的H ( X , Y ) = H ( X ) + H ( Y ∣ X ) H(X,Y) = H(X)+H(Y|X) H ( X , Y ) = H ( X ) + H ( Y ∣ X ) ,有
I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) I(X;Y)=H(X)+H(Y)-H(X,Y) I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y )
当Y = X Y=X Y = X 时,I ( X ; X ) = H ( X ) + H ( X ) − H ( X ∣ X ) = H ( X ) I(X;X)=H(X)+H(X)-H(X|X)=H(X) I ( X ; X ) = H ( X ) + H ( X ) − H ( X ∣ X ) = H ( X )
相对熵不是真正的距离度量。首先是非对称性:D ( p X 1 ∣ p X 2 ) ≠ D ( p X 2 ∣ p X 1 ) D(p_{X_{1}}\mid p_{X_{2}})\neq D(p_{X_{2}}\mid p_{X_{1}}) D ( p X 1 ∣ p X 2 ) = D ( p X 2 ∣ p X 1 ) 。而且,当∃ X = x such that p X 1 ( x ) ≠ 0 , p X 2 ( x ) = 0 \exists X=x \ \text{such that}\ p_{X_{1}}(x)\neq 0,p_{X_{2}}(x)=0 ∃ X = x such that p X 1 ( x ) = 0 , p X 2 ( x ) = 0 时,相对熵会发散。但是,我们还是经常将其看成一种对概率分布距离的反映。而且,相对熵与概率的距离度量有紧密联系。比如,
D ( p X ∣ ∣ p Y ) ≥ 1 2 ln 2 ∣ ∣ p X − p Y ∣ ∣ 1 2 D(p_{X}\mid\mid p_{Y})\geq \frac{1}{2\ln 2} \mid\mid p_{X}-p_{Y}\mid\mid_{1}^2 D ( p X ∣∣ p Y ) ≥ 2 ln 2 1 ∣∣ p X − p Y ∣ ∣ 1 2
其中∣ ∣ p X − p Y ∣ ∣ 1 = ∑ a ∣ p X ( a ) − p Y ( a ) ∣ \mid\mid p_{X}-p_{Y}\mid\mid_{1}=\sum_{a}\mid p_{X}(a)-p_{Y}(a)\mid ∣∣ p X − p Y ∣ ∣ 1 = ∑ a ∣ p X ( a ) − p Y ( a ) ∣
2.5 Chain Rules for Entropy, Relative Entropy and Mutual Information
熵的链式法则:X 1 , X 2 , … , X n X_{1},X_{2},\dots,X_{n} X 1 , X 2 , … , X n 的联合分布为p ( x 1 , … , x n ) p(x_{1},\dots,x_{n}) p ( x 1 , … , x n ) ,则H ( X 1 , X 2 , … , X n ) = ∑ i = 1 n H ( X i ∣ X i − 1 , … , X 1 ) H(X_{1},X_{2},\dots,X_{n})=\sum_{i=1}^{n}H(X_{i}|X_{i-1},\dots,X_{1}) H ( X 1 , X 2 , … , X n ) = i = 1 ∑ n H ( X i ∣ X i − 1 , … , X 1 )
Conditional mutual information:I ( X ; Y ∣ Z ) = H ( X ∣ Z ) − H ( X ∣ Y , Z ) I(X;Y|Z)=H(X|Z)-H(X|Y,Z) I ( X ; Y ∣ Z ) = H ( X ∣ Z ) − H ( X ∣ Y , Z )
信息的链式法则:I ( X 1 , X 2 , … , X n ; Y ) = ∑ i = 1 n I ( X i ; Y ∣ X i − 1 , X i − 2 , … , X 1 ) I(X_{1},X_{2},\dots,X_{n};Y)=\sum_{i=1}^nI(X_{i};Y|X_{i-1},X_{i-2},\dots,X_{1}) I ( X 1 , X 2 , … , X n ; Y ) = i = 1 ∑ n I ( X i ; Y ∣ X i − 1 , X i − 2 , … , X 1 )
Conditional Relative Entropy:D ( p ( y ∣ x ) ∣ ∣ q ( y ∣ x ) ) = ∑ x p ( x ) ∑ y p ( y ∣ x ) log p ( y ∣ x ) q ( y ∣ x ) D(p(y|x)\mid\mid q(y|x))=\sum_{x}p(x)\sum_{y}p(y|x)\log \frac{p(y|x)}{q(y|x)} D ( p ( y ∣ x ) ∣∣ q ( y ∣ x )) = x ∑ p ( x ) y ∑ p ( y ∣ x ) log q ( y ∣ x ) p ( y ∣ x )
相对熵的链式法则:
D ( p ( x , y ) ∣ ∣ q ( x , y ) ) = D ( p ( x ) ∣ ∣ q ( x ) ) + D ( p ( y ∣ x ) ∣ ∣ q ( y ∣ x ) ) D(p(x,y)\mid\mid q(x,y))=D(p(x)\mid\mid q(x))+D(p(y|x)\mid\mid q(y|x)) D ( p ( x , y ) ∣∣ q ( x , y )) = D ( p ( x ) ∣∣ q ( x )) + D ( p ( y ∣ x ) ∣∣ q ( y ∣ x ))
Proof:
D ( p ( x , y ) ∣ ∣ q ( x , y ) ) = ∑ x ∑ y p ( x , y ) log p ( x , y ) q ( x , y ) = ∑ x ∑ y p ( x , y ) log p ( x ) p ( y ∣ x ) q ( x ) q ( y ∣ x ) = ∑ x ∑ y p ( x , y ) log p ( x ) q ( x ) + ∑ x ∑ y p ( x , y ) log p ( x ) q ( x ) = D ( p ( x ) ∣ ∣ q ( x ) ) + D ( p ( y ∣ x ) ∣ ∣ q ( y ∣ x ) ) \begin{align}
D(p(x,y)\mid\mid q(x,y)) & =\sum_{x}\sum_{y}p(x,y)\log \frac{p(x,y)}{q(x,y)} \\
& =\sum_{x}\sum_{y}p(x,y) \log \frac{p(x)p(y\mid x)}{q(x)q(y\mid x)} \\
& = \sum_{x}\sum_{y}p(x,y)\log \frac{p(x)}{q(x)} + \sum_{x}\sum_{y}p(x,y)\log \frac{p(x)}{q(x)} \\
& = D(p(x)\mid \mid q(x) )+ D(p(y|x)\mid\mid q(y|x))
\end{align} D ( p ( x , y ) ∣∣ q ( x , y )) = x ∑ y ∑ p ( x , y ) log q ( x , y ) p ( x , y ) = x ∑ y ∑ p ( x , y ) log q ( x ) q ( y ∣ x ) p ( x ) p ( y ∣ x ) = x ∑ y ∑ p ( x , y ) log q ( x ) p ( x ) + x ∑ y ∑ p ( x , y ) log q ( x ) p ( x ) = D ( p ( x ) ∣∣ q ( x )) + D ( p ( y ∣ x ) ∣∣ q ( y ∣ x ))
2.6 Jensen Inequality and its Consequences:
下凸(Convex)函数:f ′ ′ ( x ) > 0 f''(x)>0 f ′′ ( x ) > 0 上凸(Concave,凹)函数:f ′ ′ ( x ) < 0 f''(x)<{0} f ′′ ( x ) < 0
这里convex指的是向下凸出,类比抛物线的话就是开口朝上
加权琴生(Jensen)不等式:
设函数$f(x)$是区间$I$上的上凸函数,正实数$\lambda_{1},\lambda_{2},\dots,\lambda_{n}$满足$\sum_{i=1}^n\lambda_{i}=1$。任意$x_{1},x_{2}\dots x_{n}\in I$,有
$$
f\left( \sum_{i=1}^n \lambda_{i}x_{i} \right) \geq \sum_{i=1}^n \lambda_{i}f(x_{i})
$$
证明相对熵大于0:
非负数$p_{i},q_{i},i=1,2,\dots,n$满足$\sum_{i}p_{i}=\sum_{i}q_{i}=1$,则相对熵$D(p\mid\mid q)=\sum_{i}p_{i}\log \frac{p_{i}}{q_{i}}\geq 0$
$f(x)=\log(x)$为上凸函数,则
$$
\sum_{i}p_{i}\log\left( \frac{q_{i}}{p_{i}} \right)\leq \log \sum_{i}p_{i}\cdot \frac{q_{i}}{p_{i}}=\log \sum_{i}q_{i}=0
$$
于是$D(p\mid\mid q)\geq 0$
推论(共同信息非负): I ( X ; Y ) = D ( p ( x , y ) ∣ ∣ p ( x ) p ( y ) ) ≥ 0 I(X;Y)=D(p(x,y)\mid\mid p(x)p(y))\geq 0 I ( X ; Y ) = D ( p ( x , y ) ∣∣ p ( x ) p ( y )) ≥ 0 ,上式为0当且仅当X , Y X,Y X , Y 独立 推论: D ( p ( y ∣ x ) ∣ ∣ q ( y ∣ x ) ) ≥ 0 D(p(y|x)\mid\mid q(y|x))\geq 0 D ( p ( y ∣ x ) ∣∣ q ( y ∣ x )) ≥ 0 ,上式为0当且仅当p ( y ∣ x ) = q ( y ∣ x ) , ∀ x , y , p ( x ) > 0 p(y|x)=q(y|x),\forall x,y,p(x)>0 p ( y ∣ x ) = q ( y ∣ x ) , ∀ x , y , p ( x ) > 0 推论: I ( X ; Y ∣ Z ) ≥ 0 I(X;Y|Z)\geq 0 I ( X ; Y ∣ Z ) ≥ 0 ,上式为0当且仅当给定Z条件下X , Y X,Y X , Y 相互独立。
下面证明,在概率空间H \mathscr{H} H 上均匀分布的概率分布有最大的熵。 Theorem 2.6.4: H ( X ) ≤ log ∣ H ∣ H(X)\leq \log|\mathscr{H}| H ( X ) ≤ log ∣ H ∣ ,其中∣ H ∣ |\mathscr{H}| ∣ H ∣ 代表了X X X 的取值的元素个数。上式取等号当且仅当X X X 是在H \mathscr{H} H 上的均匀分布。 proof: let u ( x ) = 1 ∣ H ∣ u(x)=\frac{1}{|\mathscr{H}|} u ( x ) = ∣ H ∣ 1 ,p ( x ) p(x) p ( x ) 为X X X 的概率分布函数,则
D ( p ∣ ∣ u ) = ∑ p ( x ) log p ( x ) u ( x ) = log ∣ H ∣ − H ( X ) ≥ 0 D(p\mid\mid u)=\sum p(x)\log \frac{p(x)}{u(x)} = \log |\mathscr{H}|-H(X)\geq 0 D ( p ∣∣ u ) = ∑ p ( x ) log u ( x ) p ( x ) = log ∣ H ∣ − H ( X ) ≥ 0
得证。
Theorem 2.6.5(条件约化熵):
H ( X ∣ Y ) ≤ H ( X ) H(X|Y)\leq H(X) H ( X ∣ Y ) ≤ H ( X )
上式取等号当且仅当X , Y X,Y X , Y 独立。 proof: 0 ≤ I ( X ; Y ) = H ( X ) − H ( X ∣ Y ) 0\leq I(X;Y)=H(X)-H(X|Y) 0 ≤ I ( X ; Y ) = H ( X ) − H ( X ∣ Y )
Theorem 2.6.6: H ( X 1 , X 2 , … , X n ) ≤ ∑ i = 1 n H ( X i ) H(X_{1},X_{2},\dots,X_{n})\leq \sum_{i=1}^n H(X_{i}) H ( X 1 , X 2 , … , X n ) ≤ ∑ i = 1 n H ( X i ) ,上式取等当且仅当各个X i X_{i} X i 之间相互独立
2.7 The Log Sum Inequality and its Applications
Theorem 2.7.1(Log sum inequality): 对于非负的a 1 , a 2 , … , a n a_{1},a_{2},\dots,a_{n} a 1 , a 2 , … , a n 和b 1 , b 2 , … , b n b_{1},b_{2},\dots,b_{n} b 1 , b 2 , … , b n ,
∑ i = 1 n a i log a i b i ≥ ( ∑ i = 1 n a i ) log ∑ i = 1 n a i ∑ i = 1 n b i \sum_{i=1}^n a_{i}\log \frac{a_{i}}{b_{i}} \geq \left( \sum_{i=1}^n a_{i} \right)\log \frac{ \sum_{i=1}^n a_{i} }{\sum_{i=1}^n b_{i}} i = 1 ∑ n a i log b i a i ≥ ( i = 1 ∑ n a i ) log ∑ i = 1 n b i ∑ i = 1 n a i
取等条件为a i b i = const \frac{a_{i}}{b_{i}}=\text{const} b i a i = const proof: f ( t ) = t log t f(t)= t\log t f ( t ) = t log t is strictly convex, so
∑ i α i t i log t i ≥ ( ∑ i α i t i ) log ∑ i α i t i \sum_{i}\alpha_{i}t_{i}\log t_{i} \geq\left( \sum_{i}\alpha_{i}t_{i} \right)\log \sum_{i}\alpha_{i}t_{i} i ∑ α i t i log t i ≥ ( i ∑ α i t i ) log i ∑ α i t i
令α i = a i ∑ j b j , t i = a i b i \alpha_{i}=\frac{a_{i}}{\sum_{j}b_{j}},t_{i}=\frac{a_{i}}{b_{i}} α i = ∑ j b j a i , t i = b i a i 即可得证。 我们可以用log sum inequality 去证明很多凸性得到的结果,例如上面的”相对熵非负“这个定理。
Theorem 2.7.2:D ( p ∣ ∣ q ) D(p\mid \mid q) D ( p ∣∣ q ) 对于pair ( p , q ) (p,q) ( p , q ) 而言是凸的。换言之,若( p 1 , q 1 ) , ( p 2 , q 2 ) (p_{1},q_{1}),(p_{2},q_{2}) ( p 1 , q 1 ) , ( p 2 , q 2 ) 是两对概率分布函数,那么
D ( λ p 1 + ( 1 − λ ) p 2 ∣ ∣ λ q 1 + ( 1 − λ ) q 2 ) ≤ λ D ( p 1 ∣ ∣ q 1 ) + ( 1 − λ ) D ( p 2 ∣ ∣ q 2 ) D(\lambda p_{1}+(1-\lambda)p_{2}\mid\mid\lambda q_{1}+(1-\lambda)q_{2})\leq\lambda D(p_{1}\mid\mid q_{1})+(1-\lambda)D(p_{2}\mid\mid q_{2}) D ( λ p 1 + ( 1 − λ ) p 2 ∣∣ λ q 1 + ( 1 − λ ) q 2 ) ≤ λ D ( p 1 ∣∣ q 1 ) + ( 1 − λ ) D ( p 2 ∣∣ q 2 )
for all 0 ≤ λ ≤ 1 0\leq\lambda\leq 1 0 ≤ λ ≤ 1
Proof: 对左侧应用log sum inequality,即a 1 = λ p 1 , a 2 = ( 1 − λ ) p 2 , b 1 = λ p 1 , b 2 = ( 1 − λ ) q 2 a_{1}=\lambda p_{1},a_{2}=(1-\lambda)p_{2},b_{1}=\lambda p_{1},b_{2}=(1-\lambda)q_{2} a 1 = λ p 1 , a 2 = ( 1 − λ ) p 2 , b 1 = λ p 1 , b 2 = ( 1 − λ ) q 2
( λ p 1 ( x ) + ( 1 − λ ) p 2 ( x ) ) log λ p 1 ( x ) + ( 1 − λ ) p 2 ( x ) λ q 1 ( x ) + ( 1 − λ ) q 2 ( x ) ≤ λ p 1 ( x ) log λ p 1 ( x ) λ q 1 ( x ) + ( 1 − λ ) p 2 ( x ) log ( 1 − λ ) p 2 ( x ) ( 1 − λ ) q 2 ( x ) \begin{aligned}
\left(\lambda p_1(x)+(1\right. & \left.-\lambda) p_2(x)\right) \log \frac{\lambda p_1(x)+(1-\lambda) p_2(x)}{\lambda q_1(x)+(1-\lambda) q_2(x)} \\
& \leq \lambda p_1(x) \log \frac{\lambda p_1(x)}{\lambda q_1(x)}+(1-\lambda) p_2(x) \log \frac{(1-\lambda) p_2(x)}{(1-\lambda) q_2(x)}
\end{aligned} ( λ p 1 ( x ) + ( 1 − λ ) p 2 ( x ) ) log λ q 1 ( x ) + ( 1 − λ ) q 2 ( x ) λ p 1 ( x ) + ( 1 − λ ) p 2 ( x ) ≤ λ p 1 ( x ) log λ q 1 ( x ) λ p 1 ( x ) + ( 1 − λ ) p 2 ( x ) log ( 1 − λ ) q 2 ( x ) ( 1 − λ ) p 2 ( x )
对x求和即可得证。
Theorem 2.7.3(Concavity of entropy)H ( p ) H(p) H ( p ) 是p的凹函数。 Proof:
H ( p ) = log ∣ H ∣ − D ( p ∣ ∣ u ) H(p)= \log|\mathscr{H}|-D(p\mid\mid u) H ( p ) = log ∣ H ∣ − D ( p ∣∣ u )
其中u u u 是∣ H ∣ |\mathscr{H}| ∣ H ∣ 个结果上均匀分布的分布函数。因此函数H的凹性直接源于函数D的凸性。 或:取随机变量X 1 , X 2 X_{1},X_{2} X 1 , X 2 使得X 1 ∼ p 1 ( x ) , X 2 ∼ p 2 ( x ) X_{1}\sim p_{1}(x),X_{2}\sim p_{2}(x) X 1 ∼ p 1 ( x ) , X 2 ∼ p 2 ( x ) ,并令Z = X θ Z=X_{\theta} Z = X θ ,随机变量θ \theta θ 满足的分布为:
θ = { 1 p = λ 2 p = 1 − λ \theta = \begin{cases}
1 & p=\lambda \\
2 & p=1-\lambda
\end{cases} θ = { 1 2 p = λ p = 1 − λ
于是Z ∼ λ p 1 + ( 1 − λ ) p 2 Z\sim\lambda p_{1}+(1-\lambda)p_{2} Z ∼ λ p 1 + ( 1 − λ ) p 2 。由于条件(Conditioning)会减少熵,于是有
H ( Z ) ≥ H ( Z ∣ θ ) H(Z)\geq H(Z|\theta) H ( Z ) ≥ H ( Z ∣ θ )
或等价地,
H ( λ p 1 + ( 1 − λ ) p 2 ) ≥ λ H ( p 1 ) + ( 1 − λ ) H ( p 2 ) H(\lambda p_{1}+(1-\lambda)p_{2})\geq\lambda H(p_{1})+(1-\lambda)H(p_{2}) H ( λ p 1 + ( 1 − λ ) p 2 ) ≥ λ H ( p 1 ) + ( 1 − λ ) H ( p 2 )
Theorem 2.7.4: 令( X , Y ) ∼ p ( x , y ) = p ( x ) p ( y ∣ x ) (X,Y)\sim p(x,y) = p(x)p(y\mid x) ( X , Y ) ∼ p ( x , y ) = p ( x ) p ( y ∣ x ) 。共同信息I ( X ; Y ) I(X;Y) I ( X ; Y ) 在固定的p ( y ∣ x ) p(y\mid x) p ( y ∣ x ) 下是p ( x ) p(x) p ( x ) 的凹函数,在固定p ( x ) p(x) p ( x ) 的情况下是p ( y ∣ x ) p(y\mid x) p ( y ∣ x ) 的凸函数。 证明:
I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) = H ( Y ) − ∑ x p ( x ) H ( Y ∣ X = x ) I(X;Y)=H(Y)-H(Y|X)=H(Y)-\sum_{x}p(x)H(Y\mid X=x) I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) = H ( Y ) − ∑ x p ( x ) H ( Y ∣ X = x ) . 若p ( y ∣ x ) p(y\mid x) p ( y ∣ x ) 固定,那么p ( y ) ∝ p ( x ) p(y)\propto p(x) p ( y ) ∝ p ( x ) ,而H ( Y ) H(Y) H ( Y ) 是凹函数,第二项成正比,从而I I I 此时是凹函数。
固定p ( x ) p(x) p ( x ) ,考虑两个条件分布p 1 ( y ∣ x ) , p 2 ( y ∣ x ) . p_{1}(y\mid x),p_{2}(y\mid x). p 1 ( y ∣ x ) , p 2 ( y ∣ x ) . 对应的联合分布为p i ( x , y ) = p ( x ) p i ( y ∣ x ) , i = 1 , 2 p_{i}(x,y)=p(x)p_{i}(y\mid x),i=1,2 p i ( x , y ) = p ( x ) p i ( y ∣ x ) , i = 1 , 2 ,对应的边缘分布为p ( x ) , p 1 ( y ) ; p ( x ) , p 2 ( y ) p(x),p_{1}(y);p(x),p_{2}(y) p ( x ) , p 1 ( y ) ; p ( x ) , p 2 ( y ) . 考虑二者的混合条件分布: $$ p_{\lambda}(y\mid x)=\lambda p_{1}(y\mid x)+(1-\lambda)p_{2}(y\mid x) $$ 则联合分布为:p λ ( x , y ) = λ p 1 ( x , y ) + ( 1 − λ ) p 2 ( x , y ) p_{\lambda}(x,y)=\lambda p_{1}(x,y)+(1-\lambda)p_{2}(x,y) p λ ( x , y ) = λ p 1 ( x , y ) + ( 1 − λ ) p 2 ( x , y ) ,Y Y Y 的分布也是二者混合:p λ ( y ) = λ p 1 ( y ) + ( 1 − λ ) p 2 ( y ) p_{\lambda}(y)=\lambda p_{1}(y)+(1-\lambda)p_{2}(y) p λ ( y ) = λ p 1 ( y ) + ( 1 − λ ) p 2 ( y )
如果令q λ ( x , y ) = p ( x ) p λ ( y ) q_{\lambda}(x,y)=p(x)p_{\lambda}(y) q λ ( x , y ) = p ( x ) p λ ( y ) 为边缘分布之积,就有q λ ( x , y ) = λ q 1 ( x , y ) + ( 1 − λ ) q 2 ( x , y ) q_{\lambda}(x,y)=\lambda q_{1}(x,y)+(1-\lambda)q_{2}(x,y) q λ ( x , y ) = λ q 1 ( x , y ) + ( 1 − λ ) q 2 ( x , y )
由于共同信息就是p λ p_{\lambda} p λ 与q λ q_{\lambda} q λ 之间的相对熵,即:I ( X ; Y ) = D ( p λ ∣ ∣ q λ ) I(X;Y)=D(p_{\lambda}\mid\mid q_{\lambda}) I ( X ; Y ) = D ( p λ ∣∣ q λ ) 相对熵是p,q的凸函数,因此I ( X , Y ) I(X,Y) I ( X , Y ) 在该条件下也是凸的。
2.8 Data Processing Inequality
数据处理不等式可被用来证明,对数据的任何操作都不能提高我们从数据中得出的推断能力。
定义:随机变量X , Y , Z X,Y,Z X , Y , Z 被称为按顺序构成马尔可夫链(记作X → Y → Z X\to Y\to Z X → Y → Z ),若Z的条件分布只取决于Y,且Y的条件分布只取决于X,即p ( x , y , z ) = p ( x ) p ( y ∣ x ) p ( z ∣ y ) p(x,y,z)=p(x)p(y\mid x)p(z\mid y) p ( x , y , z ) = p ( x ) p ( y ∣ x ) p ( z ∣ y ) 一些简单的推论:
X → Y → Z X\to Y\to Z X → Y → Z 当且仅当X , Z X,Z X , Z 给定Y下条件独立,即p ( x , z ∣ y ) = p ( x ∣ y ) p ( z ∣ y ) p(x,z\mid y)=p(x\mid y)p(z\mid y) p ( x , z ∣ y ) = p ( x ∣ y ) p ( z ∣ y )
X → Y → Z X\to Y\to Z X → Y → Z 隐含着Z → Y → X Z\to Y\to X Z → Y → X
Z = f ( Y ) ⟹ X → Y → Z Z=f(Y)\implies X\to Y\to Z Z = f ( Y ) ⟹ X → Y → Z
我们可以证明一个重要也很实用的定理:对Y的任何(确定性或随机的)处理都不能提高Y中包含X的信息。 Theorem 2.8.1: If X → Y → Z X\to Y\to Z X → Y → Z , then I ( X ; Y ) ≥ I ( X ; Z ) I(X;Y)\geq I(X;Z) I ( X ; Y ) ≥ I ( X ; Z ) 根据链式法则,
I ( X ; Y , Z ) = I ( X ; Z ) + I ( X ; Y ∣ Z ) = I ( X ; Y ) + I ( X ; Z ∣ Y ) \begin{align}
I(X;Y,Z)&=I(X;Z)+I(X;Y\mid Z) \\
&=I(X;Y)+I(X;Z\mid Y)
\end{align} I ( X ; Y , Z ) = I ( X ; Z ) + I ( X ; Y ∣ Z ) = I ( X ; Y ) + I ( X ; Z ∣ Y )
由于上述性质1,I ( X ; Z ∣ Y ) = 0 I(X;Z|Y)=0 I ( X ; Z ∣ Y ) = 0 ,又I ( X ; Y ∣ Z ) ≥ 0 I(X;Y|Z)\geq0 I ( X ; Y ∣ Z ) ≥ 0 ,则 I ( X ; Y ) ≥ I ( X ; Z ) I(X;Y)\geq I(X;Z) I ( X ; Y ) ≥ I ( X ; Z )
推论:Z = g ( Y ) Z=g(Y) Z = g ( Y ) ,那么I ( X ; Y ) ≥ I ( X ; g ( Y ) ) I(X;Y)\geq I(X;g(Y)) I ( X ; Y ) ≥ I ( X ; g ( Y )) 。只需知道X → Y → g ( Y ) X\to Y\to g(Y) X → Y → g ( Y ) 就行。 推论2:X → Y → Z ⟹ I ( X ; Y ∣ Z ) ≤ I ( X ; Y ) X\to Y\to Z\implies I(X;Y\mid Z)\leq I(X;Y) X → Y → Z ⟹ I ( X ; Y ∣ Z ) ≤ I ( X ; Y ) 应注意,当XYZ不构成马尔可夫链时有可能I ( X ; Y ∣ Z ) > I ( X ; Y ) I(X;Y|Z)>I(X;Y) I ( X ; Y ∣ Z ) > I ( X ; Y ) 。例如,X,Y分别为独立抛硬币的结果,令Z = X + Y Z=X+Y Z = X + Y 。那么I ( X ; Y ) = 0 I(X;Y)=0 I ( X ; Y ) = 0 (XY独立),但是
I ( X ; Y ∣ Z ) = H ( X ∣ Z ) − H ( X ∣ Y , Z ) = 1 2 bit I(X;Y\mid Z)=H(X\mid Z)-H(X\mid Y,Z)=\frac{1}{2}\text{bit} I ( X ; Y ∣ Z ) = H ( X ∣ Z ) − H ( X ∣ Y , Z ) = 2 1 bit
2.9 热力学第二定律
热力学第二定律的其中一种表述为:孤立系统的熵永不减少。在统计物理中,熵通常用系统微观态数目的对数定义。这和我们对熵的定义相同,如果每个态都等概率的话。但为什么系统熵会增加呢? 我们可以使用马尔可夫链的建模来进行解释。我们假设已知当前状态时,系统的未来状态与过去状态独立。
设μ n , μ n ′ \mu_{n},\mu _{n'} μ n , μ n ′ 时两个n时刻的马尔可夫链的状态空间上的分布函数。于是p ( x n , x n + 1 ) = p ( x n ) r ( x n + 1 ∣ x n ) , q ( x n , x n + 1 ) = q ( x n ) r ( x n + 1 ∣ x n ) p(x_{n},x_{n+1})=p(x_{n})r(x_{n+1}\mid x_{n}),q(x_{n},x_{n+1})=q(x_{n})r(x_{n+1}\mid x_{n}) p ( x n , x n + 1 ) = p ( x n ) r ( x n + 1 ∣ x n ) , q ( x n , x n + 1 ) = q ( x n ) r ( x n + 1 ∣ x n ) ,于是D ( p ( x n , x n + 1 ) ∥ q ( x n , x n + 1 ) ) = D ( p ( x n ) ∥ q ( x n ) ) + D ( p ( x n + 1 ∣ x n ) ∥ q ( x n + 1 ∣ x n ) ) = D ( p ( x n + 1 ) ∥ q ( x n + 1 ) ) + D ( p ( x n ∣ x n + 1 ) ∥ q ( x n ∣ x n + 1 ) ) \begin{aligned}
& D\left(p\left(x_n, x_{n+1}\right) \| q\left(x_n, x_{n+1}\right)\right) \\
& \quad=D\left(p\left(x_n\right) \| q\left(x_n\right)\right)+D\left(p\left(x_{n+1} \mid x_n\right) \| q\left(x_{n+1} \mid x_n\right)\right) \\
& \quad=D\left(p\left(x_{n+1}\right) \| q\left(x_{n+1}\right)\right)+D\left(p\left(x_n \mid x_{n+1}\right) \| q\left(x_n \mid x_{n+1}\right)\right)
\end{aligned} D ( p ( x n , x n + 1 ) ∥ q ( x n , x n + 1 ) ) = D ( p ( x n ) ∥ q ( x n ) ) + D ( p ( x n + 1 ∣ x n ) ∥ q ( x n + 1 ∣ x n ) ) = D ( p ( x n + 1 ) ∥ q ( x n + 1 ) ) + D ( p ( x n ∣ x n + 1 ) ∥ q ( x n ∣ x n + 1 ) )
后两等号中第一个右侧为0,又相对熵恒非负,所以D ( μ n ∣ ∣ μ n ′ ) ≥ D ( μ n + 1 ∣ ∣ μ n + 1 ′ ) D(\mu_{n}\mid\mid \mu_{n}')\geq D(\mu_{n+1}\mid\mid \mu_{n+1}') D ( μ n ∣∣ μ n ′ ) ≥ D ( μ n + 1 ∣∣ μ n + 1 ′ )
相对稳定点的熵逐渐减小。如果取μ n ′ \mu_{n}' μ n ′ 为稳定点μ \mu μ ,那么D ( μ n ∣ ∣ μ ) ≥ D ( μ n + 1 ∣ ∣ μ ) D(\mu _n\mid\mid \mu)\geq D(\mu_{n+1}\mid\mid \mu) D ( μ n ∣∣ μ ) ≥ D ( μ n + 1 ∣∣ μ ) ,即演化过程中状态不断靠近稳态。
若稳态分布是均匀的,那么熵增加。一般来讲相对熵减小不意味着熵增加,最典型的例子是一个非均匀分布作为稳态的马尔可夫链。但若稳态分布是均匀分布,那么D ( μ n ∣ μ ) = log ∣ H ∣ − H ( μ n ) = log ∣ H ∣ − H ( X n ) D(\mu_{n}\mid \mu)=\log|\mathscr{H}|-H(\mu_{n})=\log|\mathscr{H}|-H(X_{n}) D ( μ n ∣ μ ) = log ∣ H ∣ − H ( μ n ) = log ∣ H ∣ − H ( X n )
此时相对熵单减意味着熵单增。
定义概率转移矩阵的双随机性(doubly sthochastic): 概率转移矩阵[ P i j ] , P i j = Pr { X n + 1 = j ∣ X n = i } [P_{i j}],P_{i j} = \text{Pr}\{X_{n+1}=j|X_{n}=i\} [ P ij ] , P ij = Pr { X n + 1 = j ∣ X n = i } 是双随机的,若
∑ i P i j = 1 , j = 1 , 2 , … ∑ j P i j = 1 , i = 1 , 2 , … \begin{align}
\sum_{i}P_{i j} = 1, & j=1,2,\dots \\
\sum_{j}P_{i j }= 1, & i=1,2,\dots
\end{align} i ∑ P ij = 1 , j ∑ P ij = 1 , j = 1 , 2 , … i = 1 , 2 , …
A stochastic process Y is stationary if the moments are not affected by a time shift. Remark: The uniform distribution is a stationary distribution of P if and only if the probability transition matrix is doubly stochastic .
The conditional entropy H ( X n ∣ X 1 ) H(X_{n}|X_{1}) H ( X n ∣ X 1 ) increases with n for a stationary Markov process. If a Markov process is stationary, then H ( X n ) H(X_{n}) H ( X n ) is constant. So the entropy is nonincreasing. However ,we'll prove that H ( X n ∣ X 1 ) H(X_{n}\mid X_{1}) H ( X n ∣ X 1 ) increases with n n n . Thus the conditional unceertainty of the future increases. proof:H ( X n ∣ X 1 ) ≥ H ( X n ∣ X 1 , X 2 ) (conditioning reduces entropy) = H ( X n ∣ X 2 ) (by Markovity) = H ( X n − 1 ∣ X 1 ) (by stationarity). \begin{aligned}
H\left(X_n \mid X_1\right) & \geq H\left(X_n \mid X_1, X_2\right) & & \text { (conditioning reduces entropy) } \\
& =H\left(X_n \mid X_2\right) & & \text { (by Markovity) } \\
& =H\left(X_{n-1} \mid X_1\right) & & \text { (by stationarity). }
\end{aligned} H ( X n ∣ X 1 ) ≥ H ( X n ∣ X 1 , X 2 ) = H ( X n ∣ X 2 ) = H ( X n − 1 ∣ X 1 ) (conditioning reduces entropy) (by Markovity) (by stationarity).
Shuffles increase entropy. If T T T is a shuffle of a deck of cards and X X X is the initial random position of the cards in the deck and if the choice of the shuffle T T T is independent of X, then H ( T X ) ≥ H ( X ) H(TX)\geq H(X) H ( TX ) ≥ H ( X ) , where T X TX TX is the permutation of the deck induced by the shuffle T T T on the initial permutation X X X .
2.10 Sufficient Statistics
这一节我们将看到数据处理不等式在说明统计学重要问题上的威力。 设我们有一族以θ \theta θ 为下标的概率分布函数{ f θ ( x ) } \{f_{\theta}(x)\} { f θ ( x )} ,令X X X 是它的一个采样。令T ( X ) T(X) T ( X ) 为X X X 的统计量,如样本方差或样本平均。那么θ → X → T ( X ) \theta\to X\to T(X) θ → X → T ( X ) 。根据数据处理不等式,有
I ( θ ; T ( X ) ) ≤ I ( θ ; X ) I(\theta ;T(X))\leq I(\theta;X) I ( θ ; T ( X )) ≤ I ( θ ; X )
定义:T ( X ) T(X) T ( X ) 函数被称为相对函数族{ f θ ( X ) } \{f_{\theta}(X)\} { f θ ( X )} 的充分统计量 ,若X X X 在给定T ( X ) T(X) T ( X ) 下独立于θ \theta θ ,换言之θ → T ( X ) → X \theta\to T(X)\to X θ → T ( X ) → X 构成Markov Chain
例子:
设X 1 , X 2 , … , X n X_{1},X_{2},\dots,X_{n} X 1 , X 2 , … , X n 是独立同分布的掷硬币序列,有一个未知量θ = Pr ( X i = 1 ) \theta=\text{Pr}(X_{i}=1) θ = Pr ( X i = 1 ) ,给定n,1的数量是θ \theta θ 的一个充分统计量。我们可以证明,给定T T T ,所有具有相同个数1的序列等可能,且独立于θ \theta θ 。Pr { ( X 1 , X 2 , … , X n ) = ( x 1 , x 2 , … x n ) ∣ ∑ i = 1 n X i = k } = { 1 C n k if ∑ i = 1 n x i = k 0 , otherwise \text{Pr}\left\{ (X_{1},X_{2},\dots,X_{n})=(x_{1},x_{2},\dots x_{n})\mid \sum_{i=1}^n X_{i}=k \right\}=\begin{cases}
\frac{1}{C_{n}^k} & \text{if}\sum_{i=1}^n x_{i}=k \\
0, & \text{otherwise}
\end{cases} Pr { ( X 1 , X 2 , … , X n ) = ( x 1 , x 2 , … x n ) ∣ i = 1 ∑ n X i = k } = { C n k 1 0 , if ∑ i = 1 n x i = k otherwise
若X X X 是以θ \theta θ 为均值,1为方差的正态分布,f θ ( X ) = 1 2 π exp − ( x − θ ) 2 2 = N ( θ , 1 ) f_{\theta}(X)=\frac{1}{\sqrt{ 2\pi }}\exp{-\frac{(x-\theta)^2}{2}}=\mathscr{N}(\theta,1) f θ ( X ) = 2 π 1 exp − 2 ( x − θ ) 2 = N ( θ , 1 )
且X 1 , X 2 , … , X n X_{1},X_{2},\dots,X_{n} X 1 , X 2 , … , X n 是独立,具有该分布的,那么X ˉ n = ∑ i = 1 n X i \bar{X}_{n}=\sum_{i=1}^nX_{i} X ˉ n = ∑ i = 1 n X i 是θ \theta θ 的一个充分统计。
若f θ ∼ U ( θ , θ + 1 ) f_{\theta}\sim U(\theta,\theta+1) f θ ∼ U ( θ , θ + 1 ) ,则θ \theta θ 的一个充分统计量为T ( X 1 , X 2 , … , X n ) = ( max { X 1 , X 2 , … X n } , min { X 1 , X 2 , … , X n } ) T(X_{1},X_{2},\dots,X_{n})=(\max \{X_{1},X_{2},\dots X_{n}\},\min\{X_{1},X_{2},\dots,X_{n}\}) T ( X 1 , X 2 , … , X n ) = ( max { X 1 , X 2 , … X n } , min { X 1 , X 2 , … , X n })
最小充分统计量(Minimal sufficient statistic)是其他全部充分统计量的函数构成的充分统计量。 定义: 一个统计量 T ( X ) T(X) T ( X ) 是相对于 { f θ ( X ) } \{f_{\theta}(X)\} { f θ ( X )} 的最小统计量,如果它是其他所有充分统计量U U U 的函数. 用数据处理不等式的语言来讲,这意味着θ → T ( X ) → U ( X ) → X \theta\to T(X)\to U(X)\to X θ → T ( X ) → U ( X ) → X
T ( X ) T(X) T ( X ) 最大程度地压缩了样本中关于θ \theta θ 的信息。其他统计量中可能包含有额外的无关信息。例如,对有均值θ \theta θ 的正态分布而言,给出奇数样本的平均和偶数样本平均的函数对是充分统计量,但不是最小的。
2.11 Fano's Inequality
设我们已知随机变量Y Y Y ,希望猜测与之关联的X X X .Fano把这种猜测的正确概率与条件熵H ( X ∣ Y ) H(X\mid Y) H ( X ∣ Y ) 联系起来。在本章的问题中我们可以证明,H ( X ∣ Y ) = 0 ⇔ X = f ( Y ) H(X\mid Y)=0\Leftrightarrow X=f(Y) H ( X ∣ Y ) = 0 ⇔ X = f ( Y ) 因此当且仅当H ( X ∣ Y ) = 0 H(X\mid Y)=0 H ( X ∣ Y ) = 0 ,我们可以无误地从Y Y Y 的值猜出X X X 的值。 进一步,如果条件熵存在但很小,我们希望能以高概率正确猜出X X X 。 设X ∼ p ( x ) X\sim p(x) X ∼ p ( x ) ,Y Y Y 由条件概率p ( y ∣ x ) p(y\mid x) p ( y ∣ x ) 依赖于X X X 。从Y Y Y 中,我们计算g ( Y ) = X ^ g(Y)=\hat{X} g ( Y ) = X ^ ,这是对X X X 的一个估计。希望找出X ^ ≠ X \hat{X}\neq X X ^ = X 的概率。注意到X → Y → X ^ X\to Y\to \hat{X} X → Y → X ^ , 定义P e = Pr { X ^ ≠ X } P_{e}=\text{Pr}\{\hat{X}\neq X\} P e = Pr { X ^ = X }
Theorem (Fano's Inequality):
H ( P e ) + P e log ( ∣ H ∣ − 1 ) ≥ H ( X ∣ Y ) H(P_{e})+P_{e}\log(\mathscr{|H|}-1)\geq H(X\mid Y) H ( P e ) + P e log ( ∣ H ∣ − 1 ) ≥ H ( X ∣ Y )
或弱化为:1 + P e log ∣ H ∣ ≥ H ( X ∣ Y ) 1+P_{e}\log|\mathscr{H}|\geq H(X|Y) 1 + P e log ∣ H ∣ ≥ H ( X ∣ Y )
证明:令
E = { 1 , X ^ = X 0 , X ^ ≠ X E=\begin{cases}
1, & \hat{X}=X \\
0, & \hat{X}\neq X
\end{cases} E = { 1 , 0 , X ^ = X X ^ = X
H ( E , X ∣ Y ) = H ( E ∣ Y ) + H ( X ∣ E , Y ) ≤ H ( P e ) + P e log ( ∣ H ∣ − 1 ) = H ( X ∣ Y ) + H ( E ∣ X , Y ) = H ( X ∣ Y ) \begin{align}
H(E,X\mid Y) & = H(E\mid Y)+H(X\mid E,Y) \leq H(P_{e})+P_{e}\log(|\mathscr{H}|-1) \\
& =H(X\mid Y)+H(E\mid X,Y)=H(X\mid Y)
\end{align} H ( E , X ∣ Y ) = H ( E ∣ Y ) + H ( X ∣ E , Y ) ≤ H ( P e ) + P e log ( ∣ H ∣ − 1 ) = H ( X ∣ Y ) + H ( E ∣ X , Y ) = H ( X ∣ Y )
Leave a Comment