Created: 2026-03-06 07:53:04
Updated: 2026-03-06 07:53:04
对于一个没有限制的、取值为全体正整数的离散随机变量X,考虑一下两个问题:
(无上界)是否存在一个X n X_{n} X n 使得序列熵的极限为无穷:lim n → ∞ ∣ H ( X n ) ∣ = ∞ \lim_{ n \to \infty } \mid H(X_{n})\mid = \infty lim n → ∞ ∣ H ( X n ) ∣= ∞
是否存在一个分布使得H ( X ) = ∞ H(X)=\infty H ( X ) = ∞
二者的答案均为肯定的。第一个的存在性是显然的,对于第二个,下面我们可以证明,p ( n ) ∼ 1 n ln 2 n p(n)\sim \frac{1}{n\ln^2 n} p ( n ) ∼ n l n 2 n 1 是满足条件的一个概率分布。
∑ n = 1 ∞ p ( n ) ∼ ∑ n = 1 ∞ 1 n ln 2 n ∼ ∫ 1 ∞ 1 x ln 2 x d x = Const ∑ n = 1 ∞ p ( n ) ln p ( n ) ∼ ∑ n 1 n ln 2 n ln 1 n ln 2 n = ∑ n − ln n − 2 ln ln n n ln 2 n ∼ ∑ n − 1 n ln n ∼ ∫ 1 ∞ − 1 x ln x d x → − ∞ \begin{align}
\sum_{n=1}^\infty p(n) & \sim \sum_{n=1}^\infty \frac{1}{n\ln^2n}\sim \int _{1}^\infty \frac{1}{x\ln^2x}\, dx = \text{Const} \\
\sum_{n=1}^\infty p(n) \ln p(n) & \sim \sum_{n} \frac{1}{n\ln^2n} \ln \frac{1}{n\ln^2n} \\
& = \sum_{n} \frac{-\ln n-2\ln \ln n}{n\ln^2n} \\
& \sim \sum_{n} -\frac{1}{n\ln n} \\
& \sim \int _{1}^\infty -\frac{1}{x\ln x} \, dx \to -\infty
\end{align} n = 1 ∑ ∞ p ( n ) n = 1 ∑ ∞ p ( n ) ln p ( n ) ∼ n = 1 ∑ ∞ n ln 2 n 1 ∼ ∫ 1 ∞ x ln 2 x 1 d x = Const ∼ n ∑ n ln 2 n 1 ln n ln 2 n 1 = n ∑ n ln 2 n − ln n − 2 ln ln n ∼ n ∑ − n ln n 1 ∼ ∫ 1 ∞ − x ln x 1 d x → − ∞
式子中∼ \sim ∼ 表示一种估计,即左右两侧可能差一个常数系数,或差为常数,总之不影响结果的敛散性。其中还用到Cauchy积分判别法 :
[!theorem] Cauchy判别法 对于一个定义在[ 0 , ∞ ) [0,\infty ) [ 0 , ∞ ) 上的正项递减函数f ( x ) f(x) f ( x ) ,级数∑ n = 1 ∞ f ( n ) \sum_{n=1}^\infty f(n) ∑ n = 1 ∞ f ( n ) 和∫ 1 ∞ f ( x ) d x \int _{1}^\infty f(x)\, dx ∫ 1 ∞ f ( x ) d x 具有相同敛散性。
现考虑约束:随机变量具有均值μ \mu μ ,此时最大熵分布是怎样的? 通过简单的变分法可以推出:
p i ∼ C e − λ i p_{i} \sim Ce^{-\lambda i} p i ∼ C e − λi
对于任意的其他分布g i g_{i} g i ,通过交叉熵同样可以证明
H ( g ∣ ∣ p ) = H ( g ) − H ( p ) ≥ 0 H(g\mid \mid p) = H(g)-H(p) \geq 0 H ( g ∣∣ p ) = H ( g ) − H ( p ) ≥ 0
即指数分布是该约束下的最大熵分布。
Lecture 9: Fisher Information and Cramer-Ral Inequality
概率统计中常见到Parameter Estimation问题。对于参数估计的好坏有一些准则,例如无偏性 Unbiased Parameter Estimation Consider a sample X = ( X 1 , … , X n ) X=(X_{1},\dots,X_{n}) X = ( X 1 , … , X n ) i.i.d. X X X has a density function f ( X ; θ ) = ∏ i = 1 n f ( X i ; θ ) f(X;\theta) = \prod_{i=1}^n f(X_{i};\theta) f ( X ; θ ) = ∏ i = 1 n f ( X i ; θ ) , for i.i.d We want to estimate θ \theta θ from X. By an estimator ϕ : X → θ \phi:X\to\theta ϕ : X → θ 我们希望这个估计是无偏的,也就是从样本得到的估计的期望等于真实期望:E [ ϕ ( X ) ] = θ E[\phi(X)] = \theta E [ ϕ ( X )] = θ 同时,还希望从样本得到的估计的方差越小越好。
Fisher Information
[!definition] Scure function For a sample X = ( X 1 , … , X n ) X=(X_{1},\dots,X_{n}) X = ( X 1 , … , X n ) , let f ( ⋅ , 0 ) f(\cdot,0) f ( ⋅ , 0 ) be the probability distribution function.The scure function is defined as $$ S\left( X;\theta\right) = \frac{ \partial }{ \partial \theta } \ln f(X;\theta) $$
容易证明E [ S ( X ; θ ) ] = 0 E[S(X;\theta)] = 0 E [ S ( X ; θ )] = 0
\fcolorbox{yellow}{}{Proof}
E S = ∫ f ( x , θ ) S ( X , θ ) d x = ∫ ∂ ∂ θ f ( x , θ ) d x = ∂ ∂ θ ∫ f ( x , θ ) d x = 0 \begin{align}
\mathbb{E} S & = \int f(x,\theta) S(X,\theta) \, dx \\
& = \int \frac{ \partial }{ \partial \theta } f(x,\theta) \, dx \\
& = \frac{ \partial }{ \partial \theta } \int f(x,\theta) \, dx \\
& = 0
\end{align} E S = ∫ f ( x , θ ) S ( X , θ ) d x = ∫ ∂ θ ∂ f ( x , θ ) d x = ∂ θ ∂ ∫ f ( x , θ ) d x = 0
[!definition] Fisher information Fisher info of param θ \theta θ with a sample X X X is defined as $$ I(\theta) = E[S^2(X;\theta)] = Var[S(X;\theta)] = \int f(X;\theta) \left( \frac{ \partial }{ \partial \theta } \ln f(X;\theta) \right)^2 , dx $$
容易证明:I ( θ ) = − E [ ∂ 2 ln f ( X ; θ ) ∂ θ 2 ] I(\theta) = -E\left[ \frac{ \partial ^2 \ln f(X;\theta) }{ \partial \theta^2 } \right] I ( θ ) = − E [ ∂ θ 2 ∂ 2 l n f ( X ; θ ) ] 。
[!theorem] Cramer-Rao Inequality For any unbiased estimator ϕ : X → R \phi: X\to \mathbb{R} ϕ : X → R of θ j \theta_{j} θ j , we have V a r ( ϕ ( X ) ) ≥ 1 I ( θ ) Var(\phi(X))\geq \frac{1}{I(\theta)} Va r ( ϕ ( X )) ≥ I ( θ ) 1
[!Proof] 要证明的式子等价于 V a r ( ϕ ( X ) ) V a r ( S ( X ; θ ) ) ≥ 1 Var(\phi(X)) Var (S(X;\theta)) \geq 1 Va r ( ϕ ( X )) Va r ( S ( X ; θ )) ≥ 1 。这可以根据柯西史瓦兹不等式来证明: 定义内积( u , v ) = ∑ i u ( x i ) v ( x i ) f ( x i ) (u,v) = \sum_{i} u(x_{i})v(x_{i})f(x_{i}) ( u , v ) = ∑ i u ( x i ) v ( x i ) f ( x i ) ,于是( u , u ) ( v , v ) ≥ ∣ ( u , v ) ∣ 2 (u,u)(v,v)\geq \mid(u,v) \mid^2 ( u , u ) ( v , v ) ≥∣ ( u , v ) ∣ 2 $$ \begin{align} 原式 &= \sum_{i} S^2 (X_{i};\theta) f(X_{i}) \sum_{i} (\phi(X_{i})-\theta)^2 f(X_{i}) \ \ & \geq \mid \sum_{i} S(X_{i};\theta) (\phi(X_{i}-\theta))f(X_i)\mid^2 \ & = |\sum_{i} \frac{ \partial f(X;\theta) }{ \partial \theta } (\phi(X_{i})-\theta)|^2 \ & = |\partial \theta / \partial \theta -\theta \partial 1 / \partial \theta|^2 = 1 \end{align} $$
Fisher information for multiple parameters: Define Score Function as a random vector:
S ⃗ ( X ; θ ⃗ ) = ∇ θ ln f ( X ; θ ⃗ ) \vec{S}(X;\vec{\theta}) = \nabla_{\theta} \ln f(X;\vec{\theta}) S ( X ; θ ) = ∇ θ ln f ( X ; θ )
易证明E ( S ( X ; θ ⃗ ) ) = 0 ⃗ E(S(X;\vec{\theta})) = \vec{0} E ( S ( X ; θ )) = 0 ,定义Fisher information为S协方差矩阵矩阵
I ( θ ⃗ ) = E [ S ⃗ ( X ; θ ⃗ ) S ( ⃗ X ; θ ⃗ ) T ] = C o v ( S ⃗ ( X ; θ ⃗ ) ) I(\vec{\theta}) = E[\vec{S}(X;\vec{\theta}) S(\vec{}X;\vec{\theta})^T] = Cov(\vec{S}(X;\vec{\theta})) I ( θ ) = E [ S ( X ; θ ) S ( X ; θ ) T ] = C o v ( S ( X ; θ ))
高维情形下,Fisher Information的二阶导数写法可换为函数的Hessian Matrix:
I ( θ ⃗ ) = − E [ ∇ 2 ln f ( X ; θ ) ] I(\vec{\theta}) = -E \left[\nabla ^2 \ln f(X;\theta)\right] I ( θ ) = − E [ ∇ 2 ln f ( X ; θ ) ]
协方差矩阵为C o v ( ϕ ( X ) ) Cov(\phi(X)) C o v ( ϕ ( X )) 是一个半正定的矩阵。 Cramer-Rao 不等式成为:
C o v ( ϕ ( X ) ) ≥ I ( θ ) − 1 Cov(\phi(X)) \geq I(\theta)^{-1} C o v ( ϕ ( X )) ≥ I ( θ ) − 1
其中≥ \geq ≥ 并不是一般的实数序关系,而是半正定矩阵的序关系,即对半正定矩阵A , B A,B A , B ,有:A ≥ B ⟹ A − B ≥ 0 A\geq B\implies A-B\geq 0 A ≥ B ⟹ A − B ≥ 0 ,A − B A-B A − B 为半正定矩阵
4.18 Channel Coding
信源编码目标是将码元中的冗余去除;而在信道传输时信道会引入噪声,从而有一定概率发生错误。信道编码目标是通过高度结构化的方式加入冗余,使编码之间的距离尽可能大。 如果允许信道发生t t t bits error,那么任意两个codewords之间的距离必须大于等于2 t + 1 2t+1 2 t + 1
Leave a Comment