Created: 2026-03-06 07:53:04
Updated: 2026-03-06 07:53:04

2.1 Entropy

编码:必须唯一可解,不出现歧义
Huffman编码
Prefix-free codes:前缀码。有一组code words c1c2cnc_{1}c_{2}\dots c_{n},ci,cj(ij)\forall c_{i},c_{j}(i\neq j)使得ci,cjc_{i},c_{j}都不互为前缀。前缀码是可唯一解码的。
克拉夫特不等式:设符号表原始符号为

S={s1,s2,,sn}S=\{s_{1},s_{2},\dots,s_n\}

^ee60fa

在大小为rr的字符集熵编码为可唯一解码的码字长度为1,2,,n\ell_{1},\ell_{2},\dots,\ell _{n},则

i=1nri1\sum_{i=1}^n r^{-\ell_{i}}\leq 1

Definition: The entropy H(X)H(X) of a discrete random variable XX is defined by

H(X)=xHp(x)logp(x)=Eplog1p(X)H(X) = -\sum_{x\in \mathscr{H}}p(x)\log p(x)= E_{p}\log \frac{1}{p(X)}

其中EpE_{p}代表XX的概率分布为pp时后面函数的期望:

Epg(X)=xHg(x)p(x)E_{p}g(X)=\sum_{x \in \mathscr{H}}g(x)p(x)

Lemma:

  1. H(X)0H(X)\geq 0
  2. Hb(X)=(logba)Ha(X)H_{b}(X)=(\log_{b}a)H_{a}(X)

2.2 Joint Entropy, Conditional Entropy

Definition: Joint Entropy

H(X,Y)=xHyyp(x,y)logp(x,y)=Elogp(X,Y)H(X,Y)=-\sum_{x\in\mathscr{H}}\sum_{y\in\mathscr{y}}p(x,y)\log p(x,y)=-E\log p(X,Y)

Conditional Entropy:
If (X,Y)p(x,y)(X,Y)\sim p(x,y), then the conditional entropy H(YX)H(Y|X) is defined as:

H(YX)=xHp(x)H(YX=x)=xHp(x)yYp(yx)logp(yx)=xHyYp(x,y)log(p(yx))=Ep(x,y)logp(YX)\begin{align} H(Y|X) & =\sum_{x\in\mathscr{H}}p(x)H(Y|X=x) \\ & =-\sum_{x\in \mathscr{H}}p(x)\sum_{y\in \mathscr{Y}}p(y|x)\log p(y|x) \\ & =-\sum_{x\in \mathscr{H}}\sum_{y\in \mathscr{Y}}p(x,y)\log(p(y|x)) \\ & = -E_{p(x,y)}\log p(Y|X) \end{align}

Theorem(Chain rule):

H(X,Y)=H(X)+H(YX)H(X,Y) = H(X)+H(Y|X)

Corollary:

H(X,YZ)=H(XZ)+H(YX,Z)H(X,Y|Z)=H(X|Z)+H(Y|X,Z)

2.3 Relative Entropy and Mutual Information

Relative Entropy definition:

D(pq)=xHp(x)logp(x)q(x)=Eplogp(X)q(X)\begin{align} D(p\mid \mid q) & = \sum_{x\in\mathscr{H}}p(x)\log \frac{p(x)}{q(x)} \\ & =E_{p}\log \frac{p(X)}{q(X)} \end{align}

We used convention that 0log0q=0,plogp0=0\log \frac{0}{q}=0,p\log \frac{p}{0}=\infty

性质:

  1. 非负性,且D(pq)=0p=qD(p | |q)=0\Leftrightarrow p=q
  2. 非对称,不满足三角不等式

Mutual information: relative entropy between the joint distribution and the product distribution

I(X;Y)=xHyYp(x,y)logp(x,y)p(x)p(y)I(X;Y)=\sum_{x\in \mathscr{H}}\sum_{y\in \mathscr{Y}}p(x,y)\log \frac{p(x,y)}{p(x)p(y)}

例子:当X,YX,Y独立,p(x,y)=p(x)p(y)p(x,y)=p(x)p(y),则I(X,Y)=0I(X,Y)=0

熵和共同信息之间的关系:
容易验证

I(X;Y)=H(Y)H(YX)I(X;Y)=H(Y)-H(Y|X)

由上一节的H(X,Y)=H(X)+H(YX)H(X,Y) = H(X)+H(Y|X),有

I(X;Y)=H(X)+H(Y)H(X,Y)I(X;Y)=H(X)+H(Y)-H(X,Y)

Y=XY=X时,I(X;X)=H(X)+H(X)H(XX)=H(X)I(X;X)=H(X)+H(X)-H(X|X)=H(X)

相对熵不是真正的距离度量。首先是非对称性:D(pX1pX2)D(pX2pX1)D(p_{X_{1}}\mid p_{X_{2}})\neq D(p_{X_{2}}\mid p_{X_{1}})。而且,当X=x such that pX1(x)0,pX2(x)=0\exists X=x \ \text{such that}\ p_{X_{1}}(x)\neq 0,p_{X_{2}}(x)=0时,相对熵会发散。但是,我们还是经常将其看成一种对概率分布距离的反映。而且,相对熵与概率的距离度量有紧密联系。比如,

D(pXpY)12ln2pXpY12D(p_{X}\mid\mid p_{Y})\geq \frac{1}{2\ln 2} \mid\mid p_{X}-p_{Y}\mid\mid_{1}^2

其中pXpY1=apX(a)pY(a)\mid\mid p_{X}-p_{Y}\mid\mid_{1}=\sum_{a}\mid p_{X}(a)-p_{Y}(a)\mid

2.5 Chain Rules for Entropy, Relative Entropy and Mutual Information

  1. 熵的链式法则:
    X1,X2,,XnX_{1},X_{2},\dots,X_{n}的联合分布为p(x1,,xn)p(x_{1},\dots,x_{n}),则

    H(X1,X2,,Xn)=i=1nH(XiXi1,,X1)H(X_{1},X_{2},\dots,X_{n})=\sum_{i=1}^{n}H(X_{i}|X_{i-1},\dots,X_{1})

    Conditional mutual information:

    I(X;YZ)=H(XZ)H(XY,Z)I(X;Y|Z)=H(X|Z)-H(X|Y,Z)

  2. 信息的链式法则:

    I(X1,X2,,Xn;Y)=i=1nI(Xi;YXi1,Xi2,,X1)I(X_{1},X_{2},\dots,X_{n};Y)=\sum_{i=1}^nI(X_{i};Y|X_{i-1},X_{i-2},\dots,X_{1})

    Conditional Relative Entropy:

    D(p(yx)q(yx))=xp(x)yp(yx)logp(yx)q(yx)D(p(y|x)\mid\mid q(y|x))=\sum_{x}p(x)\sum_{y}p(y|x)\log \frac{p(y|x)}{q(y|x)}

  3. 相对熵的链式法则:

D(p(x,y)q(x,y))=D(p(x)q(x))+D(p(yx)q(yx))D(p(x,y)\mid\mid q(x,y))=D(p(x)\mid\mid q(x))+D(p(y|x)\mid\mid q(y|x))

Proof:

D(p(x,y)q(x,y))=xyp(x,y)logp(x,y)q(x,y)=xyp(x,y)logp(x)p(yx)q(x)q(yx)=xyp(x,y)logp(x)q(x)+xyp(x,y)logp(x)q(x)=D(p(x)q(x))+D(p(yx)q(yx))\begin{align} D(p(x,y)\mid\mid q(x,y)) & =\sum_{x}\sum_{y}p(x,y)\log \frac{p(x,y)}{q(x,y)} \\ & =\sum_{x}\sum_{y}p(x,y) \log \frac{p(x)p(y\mid x)}{q(x)q(y\mid x)} \\ & = \sum_{x}\sum_{y}p(x,y)\log \frac{p(x)}{q(x)} + \sum_{x}\sum_{y}p(x,y)\log \frac{p(x)}{q(x)} \\ & = D(p(x)\mid \mid q(x) )+ D(p(y|x)\mid\mid q(y|x)) \end{align}

2.6 Jensen Inequality and its Consequences:

下凸(Convex)函数:f(x)>0f''(x)>0
上凸(Concave,凹)函数:f(x)<0f''(x)<{0}

这里convex指的是向下凸出,类比抛物线的话就是开口朝上
加权琴生(Jensen)不等式: 设函数$f(x)$是区间$I$上的上凸函数,正实数$\lambda_{1},\lambda_{2},\dots,\lambda_{n}$满足$\sum_{i=1}^n\lambda_{i}=1$。任意$x_{1},x_{2}\dots x_{n}\in I$,有 $$ f\left( \sum_{i=1}^n \lambda_{i}x_{i} \right) \geq \sum_{i=1}^n \lambda_{i}f(x_{i}) $$ 证明相对熵大于0: 非负数$p_{i},q_{i},i=1,2,\dots,n$满足$\sum_{i}p_{i}=\sum_{i}q_{i}=1$,则相对熵$D(p\mid\mid q)=\sum_{i}p_{i}\log \frac{p_{i}}{q_{i}}\geq 0$ $f(x)=\log(x)$为上凸函数,则 $$ \sum_{i}p_{i}\log\left( \frac{q_{i}}{p_{i}} \right)\leq \log \sum_{i}p_{i}\cdot \frac{q_{i}}{p_{i}}=\log \sum_{i}q_{i}=0 $$ 于是$D(p\mid\mid q)\geq 0$

推论(共同信息非负): I(X;Y)=D(p(x,y)p(x)p(y))0I(X;Y)=D(p(x,y)\mid\mid p(x)p(y))\geq 0,上式为0当且仅当X,YX,Y独立
推论: D(p(yx)q(yx))0D(p(y|x)\mid\mid q(y|x))\geq 0,上式为0当且仅当p(yx)=q(yx),x,y,p(x)>0p(y|x)=q(y|x),\forall x,y,p(x)>0
推论: I(X;YZ)0I(X;Y|Z)\geq 0,上式为0当且仅当给定Z条件下X,YX,Y相互独立。

下面证明,在概率空间H\mathscr{H}上均匀分布的概率分布有最大的熵。
Theorem 2.6.4: H(X)logHH(X)\leq \log|\mathscr{H}|,其中H|\mathscr{H}|代表了XX的取值的元素个数。上式取等号当且仅当XX是在H\mathscr{H}上的均匀分布。
proof: let u(x)=1Hu(x)=\frac{1}{|\mathscr{H}|}p(x)p(x)XX的概率分布函数,则

D(pu)=p(x)logp(x)u(x)=logHH(X)0D(p\mid\mid u)=\sum p(x)\log \frac{p(x)}{u(x)} = \log |\mathscr{H}|-H(X)\geq 0

得证。

Theorem 2.6.5(条件约化熵):

H(XY)H(X)H(X|Y)\leq H(X)

上式取等号当且仅当X,YX,Y独立。
proof: 0I(X;Y)=H(X)H(XY)0\leq I(X;Y)=H(X)-H(X|Y)

Theorem 2.6.6: H(X1,X2,,Xn)i=1nH(Xi)H(X_{1},X_{2},\dots,X_{n})\leq \sum_{i=1}^n H(X_{i}),上式取等当且仅当各个XiX_{i}之间相互独立

2.7 The Log Sum Inequality and its Applications

Theorem 2.7.1(Log sum inequality): 对于非负的a1,a2,,ana_{1},a_{2},\dots,a_{n}b1,b2,,bnb_{1},b_{2},\dots,b_{n},

i=1nailogaibi(i=1nai)logi=1naii=1nbi\sum_{i=1}^n a_{i}\log \frac{a_{i}}{b_{i}} \geq \left( \sum_{i=1}^n a_{i} \right)\log \frac{ \sum_{i=1}^n a_{i} }{\sum_{i=1}^n b_{i}}

取等条件为aibi=const\frac{a_{i}}{b_{i}}=\text{const}
proof: f(t)=tlogtf(t)= t\log t is strictly convex, so

iαitilogti(iαiti)logiαiti\sum_{i}\alpha_{i}t_{i}\log t_{i} \geq\left( \sum_{i}\alpha_{i}t_{i} \right)\log \sum_{i}\alpha_{i}t_{i}

αi=aijbj,ti=aibi\alpha_{i}=\frac{a_{i}}{\sum_{j}b_{j}},t_{i}=\frac{a_{i}}{b_{i}}即可得证。
我们可以用log sum inequality 去证明很多凸性得到的结果,例如上面的”相对熵非负“这个定理。

Theorem 2.7.2:D(pq)D(p\mid \mid q) 对于pair (p,q)(p,q)而言是凸的。换言之,若(p1,q1),(p2,q2)(p_{1},q_{1}),(p_{2},q_{2})是两对概率分布函数,那么

D(λp1+(1λ)p2λq1+(1λ)q2)λD(p1q1)+(1λ)D(p2q2)D(\lambda p_{1}+(1-\lambda)p_{2}\mid\mid\lambda q_{1}+(1-\lambda)q_{2})\leq\lambda D(p_{1}\mid\mid q_{1})+(1-\lambda)D(p_{2}\mid\mid q_{2})

for all 0λ10\leq\lambda\leq 1

Proof:
对左侧应用log sum inequality,即a1=λp1,a2=(1λ)p2,b1=λp1,b2=(1λ)q2a_{1}=\lambda p_{1},a_{2}=(1-\lambda)p_{2},b_{1}=\lambda p_{1},b_{2}=(1-\lambda)q_{2}

(λp1(x)+(1λ)p2(x))logλp1(x)+(1λ)p2(x)λq1(x)+(1λ)q2(x)λp1(x)logλp1(x)λq1(x)+(1λ)p2(x)log(1λ)p2(x)(1λ)q2(x)\begin{aligned} \left(\lambda p_1(x)+(1\right. & \left.-\lambda) p_2(x)\right) \log \frac{\lambda p_1(x)+(1-\lambda) p_2(x)}{\lambda q_1(x)+(1-\lambda) q_2(x)} \\ & \leq \lambda p_1(x) \log \frac{\lambda p_1(x)}{\lambda q_1(x)}+(1-\lambda) p_2(x) \log \frac{(1-\lambda) p_2(x)}{(1-\lambda) q_2(x)} \end{aligned}

对x求和即可得证。

Theorem 2.7.3(Concavity of entropy)
H(p)H(p)是p的凹函数。
Proof:

H(p)=logHD(pu)H(p)= \log|\mathscr{H}|-D(p\mid\mid u)

其中uuH|\mathscr{H}|个结果上均匀分布的分布函数。因此函数H的凹性直接源于函数D的凸性。
或:取随机变量X1,X2X_{1},X_{2}使得X1p1(x),X2p2(x)X_{1}\sim p_{1}(x),X_{2}\sim p_{2}(x),并令Z=XθZ=X_{\theta},随机变量θ\theta满足的分布为:

θ={1p=λ2p=1λ\theta = \begin{cases} 1 & p=\lambda \\ 2 & p=1-\lambda \end{cases}

于是Zλp1+(1λ)p2Z\sim\lambda p_{1}+(1-\lambda)p_{2}。由于条件(Conditioning)会减少熵,于是有

H(Z)H(Zθ)H(Z)\geq H(Z|\theta)

或等价地,

H(λp1+(1λ)p2)λH(p1)+(1λ)H(p2)H(\lambda p_{1}+(1-\lambda)p_{2})\geq\lambda H(p_{1})+(1-\lambda)H(p_{2})

Theorem 2.7.4: 令(X,Y)p(x,y)=p(x)p(yx)(X,Y)\sim p(x,y) = p(x)p(y\mid x)。共同信息I(X;Y)I(X;Y)在固定的p(yx)p(y\mid x)下是p(x)p(x)的凹函数,在固定p(x)p(x)的情况下是p(yx)p(y\mid x)的凸函数。
证明:

  1. I(X;Y)=H(Y)H(YX)=H(Y)xp(x)H(YX=x)I(X;Y)=H(Y)-H(Y|X)=H(Y)-\sum_{x}p(x)H(Y\mid X=x).
    p(yx)p(y\mid x)固定,那么p(y)p(x)p(y)\propto p(x),而H(Y)H(Y)是凹函数,第二项成正比,从而II此时是凹函数。
  2. 固定p(x)p(x),考虑两个条件分布p1(yx),p2(yx).p_{1}(y\mid x),p_{2}(y\mid x).对应的联合分布为pi(x,y)=p(x)pi(yx),i=1,2p_{i}(x,y)=p(x)p_{i}(y\mid x),i=1,2,对应的边缘分布为p(x),p1(y);p(x),p2(y)p(x),p_{1}(y);p(x),p_{2}(y).
    考虑二者的混合条件分布:
    $$
    p_{\lambda}(y\mid x)=\lambda p_{1}(y\mid x)+(1-\lambda)p_{2}(y\mid x)
    $$
    则联合分布为:pλ(x,y)=λp1(x,y)+(1λ)p2(x,y)p_{\lambda}(x,y)=\lambda p_{1}(x,y)+(1-\lambda)p_{2}(x,y)YY的分布也是二者混合:

    pλ(y)=λp1(y)+(1λ)p2(y)p_{\lambda}(y)=\lambda p_{1}(y)+(1-\lambda)p_{2}(y)

    如果令qλ(x,y)=p(x)pλ(y)q_{\lambda}(x,y)=p(x)p_{\lambda}(y)为边缘分布之积,就有

    qλ(x,y)=λq1(x,y)+(1λ)q2(x,y)q_{\lambda}(x,y)=\lambda q_{1}(x,y)+(1-\lambda)q_{2}(x,y)

    由于共同信息就是pλp_{\lambda}qλq_{\lambda}之间的相对熵,即:I(X;Y)=D(pλqλ)I(X;Y)=D(p_{\lambda}\mid\mid q_{\lambda})
    相对熵是p,q的凸函数,因此I(X,Y)I(X,Y)在该条件下也是凸的。

2.8 Data Processing Inequality

数据处理不等式可被用来证明,对数据的任何操作都不能提高我们从数据中得出的推断能力。

定义:随机变量X,Y,ZX,Y,Z被称为按顺序构成马尔可夫链(记作XYZX\to Y\to Z),若Z的条件分布只取决于Y,且Y的条件分布只取决于X,即p(x,y,z)=p(x)p(yx)p(zy)p(x,y,z)=p(x)p(y\mid x)p(z\mid y)
一些简单的推论:

  1. XYZX\to Y\to Z当且仅当X,ZX,Z给定Y下条件独立,即p(x,zy)=p(xy)p(zy)p(x,z\mid y)=p(x\mid y)p(z\mid y)
  2. XYZX\to Y\to Z隐含着ZYXZ\to Y\to X
  3. Z=f(Y)    XYZZ=f(Y)\implies X\to Y\to Z

我们可以证明一个重要也很实用的定理:对Y的任何(确定性或随机的)处理都不能提高Y中包含X的信息。
Theorem 2.8.1: If XYZX\to Y\to Z, then I(X;Y)I(X;Z)I(X;Y)\geq I(X;Z)
根据链式法则,

I(X;Y,Z)=I(X;Z)+I(X;YZ)=I(X;Y)+I(X;ZY)\begin{align} I(X;Y,Z)&=I(X;Z)+I(X;Y\mid Z) \\ &=I(X;Y)+I(X;Z\mid Y) \end{align}

由于上述性质1,I(X;ZY)=0I(X;Z|Y)=0,又I(X;YZ)0I(X;Y|Z)\geq0,则 I(X;Y)I(X;Z)I(X;Y)\geq I(X;Z)

推论:Z=g(Y)Z=g(Y),那么I(X;Y)I(X;g(Y))I(X;Y)\geq I(X;g(Y))。只需知道XYg(Y)X\to Y\to g(Y)就行。
推论2:XYZ    I(X;YZ)I(X;Y)X\to Y\to Z\implies I(X;Y\mid Z)\leq I(X;Y)
应注意,当XYZ不构成马尔可夫链时有可能I(X;YZ)>I(X;Y)I(X;Y|Z)>I(X;Y)。例如,X,Y分别为独立抛硬币的结果,令Z=X+YZ=X+Y。那么I(X;Y)=0I(X;Y)=0(XY独立),但是

I(X;YZ)=H(XZ)H(XY,Z)=12bitI(X;Y\mid Z)=H(X\mid Z)-H(X\mid Y,Z)=\frac{1}{2}\text{bit}

2.9 热力学第二定律

热力学第二定律的其中一种表述为:孤立系统的熵永不减少。在统计物理中,熵通常用系统微观态数目的对数定义。这和我们对熵的定义相同,如果每个态都等概率的话。但为什么系统熵会增加呢?
我们可以使用马尔可夫链的建模来进行解释。我们假设已知当前状态时,系统的未来状态与过去状态独立。

  1. μn,μn\mu_{n},\mu _{n'}时两个n时刻的马尔可夫链的状态空间上的分布函数。于是p(xn,xn+1)=p(xn)r(xn+1xn),q(xn,xn+1)=q(xn)r(xn+1xn)p(x_{n},x_{n+1})=p(x_{n})r(x_{n+1}\mid x_{n}),q(x_{n},x_{n+1})=q(x_{n})r(x_{n+1}\mid x_{n}),于是

    D(p(xn,xn+1)q(xn,xn+1))=D(p(xn)q(xn))+D(p(xn+1xn)q(xn+1xn))=D(p(xn+1)q(xn+1))+D(p(xnxn+1)q(xnxn+1))\begin{aligned} & D\left(p\left(x_n, x_{n+1}\right) \| q\left(x_n, x_{n+1}\right)\right) \\ & \quad=D\left(p\left(x_n\right) \| q\left(x_n\right)\right)+D\left(p\left(x_{n+1} \mid x_n\right) \| q\left(x_{n+1} \mid x_n\right)\right) \\ & \quad=D\left(p\left(x_{n+1}\right) \| q\left(x_{n+1}\right)\right)+D\left(p\left(x_n \mid x_{n+1}\right) \| q\left(x_n \mid x_{n+1}\right)\right) \end{aligned}

    后两等号中第一个右侧为0,又相对熵恒非负,所以
    D(μnμn)D(μn+1μn+1)D(\mu_{n}\mid\mid \mu_{n}')\geq D(\mu_{n+1}\mid\mid \mu_{n+1}')
  2. 相对稳定点的熵逐渐减小。如果取μn\mu_{n}'为稳定点μ\mu,那么D(μnμ)D(μn+1μ)D(\mu _n\mid\mid \mu)\geq D(\mu_{n+1}\mid\mid \mu),即演化过程中状态不断靠近稳态。
  3. 若稳态分布是均匀的,那么熵增加。一般来讲相对熵减小不意味着熵增加,最典型的例子是一个非均匀分布作为稳态的马尔可夫链。但若稳态分布是均匀分布,那么

    D(μnμ)=logHH(μn)=logHH(Xn)D(\mu_{n}\mid \mu)=\log|\mathscr{H}|-H(\mu_{n})=\log|\mathscr{H}|-H(X_{n})

    此时相对熵单减意味着熵单增。

定义概率转移矩阵的双随机性(doubly sthochastic):
概率转移矩阵[Pij],Pij=Pr{Xn+1=jXn=i}[P_{i j}],P_{i j} = \text{Pr}\{X_{n+1}=j|X_{n}=i\}是双随机的,若

iPij=1,j=1,2,jPij=1,i=1,2,\begin{align} \sum_{i}P_{i j} = 1, & j=1,2,\dots \\ \sum_{j}P_{i j }= 1, & i=1,2,\dots \end{align}

A stochastic process Y is stationary if the moments are not affected by a time shift.
Remark: The uniform distribution is a stationary distribution of P if and only if the probability transition matrix is doubly stochastic.

  1. The conditional entropy H(XnX1)H(X_{n}|X_{1}) increases with n for a stationary Markov process.
    If a Markov process is stationary, then H(Xn)H(X_{n}) is constant. So the entropy is nonincreasing. However ,we'll prove that H(XnX1)H(X_{n}\mid X_{1}) increases with nn. Thus the conditional unceertainty of the future increases.
    proof:

    H(XnX1)H(XnX1,X2) (conditioning reduces entropy) =H(XnX2) (by Markovity) =H(Xn1X1) (by stationarity). \begin{aligned} H\left(X_n \mid X_1\right) & \geq H\left(X_n \mid X_1, X_2\right) & & \text { (conditioning reduces entropy) } \\ & =H\left(X_n \mid X_2\right) & & \text { (by Markovity) } \\ & =H\left(X_{n-1} \mid X_1\right) & & \text { (by stationarity). } \end{aligned}

  2. Shuffles increase entropy. If TT is a shuffle of a deck of cards and XX is the initial random position of the cards in the deck and if the choice of the shuffle TT is independent of X, then H(TX)H(X)H(TX)\geq H(X), where TXTX is the permutation of the deck induced by the shuffle TT on the initial permutation XX.

2.10 Sufficient Statistics

这一节我们将看到数据处理不等式在说明统计学重要问题上的威力。
设我们有一族以θ\theta为下标的概率分布函数{fθ(x)}\{f_{\theta}(x)\},令XX是它的一个采样。令T(X)T(X)XX的统计量,如样本方差或样本平均。那么θXT(X)\theta\to X\to T(X)。根据数据处理不等式,有

I(θ;T(X))I(θ;X)I(\theta ;T(X))\leq I(\theta;X)

定义:T(X)T(X)函数被称为相对函数族{fθ(X)}\{f_{\theta}(X)\}充分统计量,若XX在给定T(X)T(X)下独立于θ\theta,换言之θT(X)X\theta\to T(X)\to X构成Markov Chain

例子:

  1. X1,X2,,XnX_{1},X_{2},\dots,X_{n}是独立同分布的掷硬币序列,有一个未知量θ=Pr(Xi=1)\theta=\text{Pr}(X_{i}=1),给定n,1的数量是θ\theta的一个充分统计量。我们可以证明,给定TT,所有具有相同个数1的序列等可能,且独立于θ\theta

    Pr{(X1,X2,,Xn)=(x1,x2,xn)i=1nXi=k}={1Cnkifi=1nxi=k0,otherwise\text{Pr}\left\{ (X_{1},X_{2},\dots,X_{n})=(x_{1},x_{2},\dots x_{n})\mid \sum_{i=1}^n X_{i}=k \right\}=\begin{cases} \frac{1}{C_{n}^k} & \text{if}\sum_{i=1}^n x_{i}=k \\ 0, & \text{otherwise} \end{cases}

  2. XX是以θ\theta为均值,1为方差的正态分布,

    fθ(X)=12πexp(xθ)22=N(θ,1)f_{\theta}(X)=\frac{1}{\sqrt{ 2\pi }}\exp{-\frac{(x-\theta)^2}{2}}=\mathscr{N}(\theta,1)

    X1,X2,,XnX_{1},X_{2},\dots,X_{n}是独立,具有该分布的,那么Xˉn=i=1nXi\bar{X}_{n}=\sum_{i=1}^nX_{i}θ\theta的一个充分统计。
  3. fθU(θ,θ+1)f_{\theta}\sim U(\theta,\theta+1),则θ\theta的一个充分统计量为

    T(X1,X2,,Xn)=(max{X1,X2,Xn},min{X1,X2,,Xn})T(X_{1},X_{2},\dots,X_{n})=(\max \{X_{1},X_{2},\dots X_{n}\},\min\{X_{1},X_{2},\dots,X_{n}\})

    最小充分统计量(Minimal sufficient statistic)是其他全部充分统计量的函数构成的充分统计量。
    定义: 一个统计量 T(X)T(X) 是相对于 {fθ(X)}\{f_{\theta}(X)\} 的最小统计量,如果它是其他所有充分统计量UU的函数. 用数据处理不等式的语言来讲,这意味着

    θT(X)U(X)X\theta\to T(X)\to U(X)\to X

    T(X)T(X)最大程度地压缩了样本中关于θ\theta的信息。其他统计量中可能包含有额外的无关信息。例如,对有均值θ\theta的正态分布而言,给出奇数样本的平均和偶数样本平均的函数对是充分统计量,但不是最小的。

2.11 Fano's Inequality

设我们已知随机变量YY,希望猜测与之关联的XX.Fano把这种猜测的正确概率与条件熵H(XY)H(X\mid Y)联系起来。在本章的问题中我们可以证明,H(XY)=0X=f(Y)H(X\mid Y)=0\Leftrightarrow X=f(Y)因此当且仅当H(XY)=0H(X\mid Y)=0,我们可以无误地从YY的值猜出XX的值。
进一步,如果条件熵存在但很小,我们希望能以高概率正确猜出XX
Xp(x)X\sim p(x)YY由条件概率p(yx)p(y\mid x)依赖于XX。从YY中,我们计算g(Y)=X^g(Y)=\hat{X},这是对XX的一个估计。希望找出X^X\hat{X}\neq X的概率。注意到XYX^X\to Y\to \hat{X}
定义Pe=Pr{X^X}P_{e}=\text{Pr}\{\hat{X}\neq X\}

Theorem (Fano's Inequality):

H(Pe)+Pelog(H1)H(XY)H(P_{e})+P_{e}\log(\mathscr{|H|}-1)\geq H(X\mid Y)

或弱化为:1+PelogHH(XY)1+P_{e}\log|\mathscr{H}|\geq H(X|Y)

证明:令

E={1,X^=X0,X^XE=\begin{cases} 1, & \hat{X}=X \\ 0, & \hat{X}\neq X \end{cases}

H(E,XY)=H(EY)+H(XE,Y)H(Pe)+Pelog(H1)=H(XY)+H(EX,Y)=H(XY)\begin{align} H(E,X\mid Y) & = H(E\mid Y)+H(X\mid E,Y) \leq H(P_{e})+P_{e}\log(|\mathscr{H}|-1) \\ & =H(X\mid Y)+H(E\mid X,Y)=H(X\mid Y) \end{align}

Leave a Comment

captcha
Fontsize