Created: 2026-03-06 07:53:04
Updated: 2026-03-06 07:53:04

Variable-length code - Wikipedia

Codes & Extensions:

Let S,TS,T be two finite sets, called the source and target alphabets. A Code C:STC:S\to T^* is a total function mapping, and an extension of CC is a homomorphism: ext(C):SText(C):S^*\to T^*, which maps each sequence of source symbols to a sequence of target symbols.

Non-singular codes:指映射非单射

A code is non-singular if each source symbol is mapped to a different non-empty bit string, i.e. the mapping from source symbols to bit strings is injective(f(x1)=f(x2)    x1=x2f(x_{1})=f(x_{2})\implies x_{1}=x_{2}).

可唯一解码的编码C(Uniquely decodable codes):

对于有限长的编码TTT'\in T^*,都存在唯一的解码方式SSS'\in S使得S=ext(C)S'=ext(C)

A code CC is uniquely decodable if its extension ext(C)ext(C) is non singular. Whether a given code is uniquely decodable can be decided with the Sardinas-Patterson algorithm.

%%给定一个编码如何判断是否为可唯一解码的编码?假定某TTT'\in T^*有两种解码方式S1,S2SS_{1},S_{2}\in S^*S1={A1,A2,,Am}S_{1}=\{A_{1},A_{2},\dots,A_{m}\}S2={B1,B2,,Bm}S_{2}=\{B_{1},B_{2},\dots,B_m\},其中Ai,BiSA_{i},B_{i}\in S%%

Sardinas Patterson Algorithm

  1. Let S0=CS_{0}=C
  2. round i : let the set of dangling suffixes SiS_{i}
    • S1=S01S0{}S_{1}=S_{0}^{-1}S_{0} \{\}

![[294024f145e41457d784606e9b5742a6d7ca7ffe.svg]]:Singular
![[e7077bf9dbc7c975f4d5f510567a34ca5ff0e1d7.svg]]:Non-singular, its extension will generate a lossless coding ,which will be useful fo general data transmission.
Note: it is not necessary for the non-singular code to be more compact than the source.

Consider M2M_{2} again. It's not uniquely decodable, since the string 011101110011 can be interpreted as cdb or babe. However, such a code is useful when the set of all possiblesource symbols is completely known and finite, or when there are restrictions (for example a formal syntax) that determine if source elements of this extension are acceptable. Such restrictions permit the decoding of the original message by checking which of the possible source symbols mapped the same are valid under the restrictions.

前缀码

C:ST=(c1,c2,cn)C:S\to T^*=(c_{1},c_{2},\dots c_{n}):指的是任意iji\neq j,有ciprefix(cj)c_{i}\neq \text{prefix}(c_{j})性质的编码
设可唯一解码的编码CC构成集合AA,前缀码/即时码的集合是BB
我们来研究具有最小平均编码长度(min average code length)的集合。我们指出,对任意cAB\vec{ c}\in A-B具有最短编码长度,总能找到对应的cB\vec{c'}\in B,使得二者编码长度相等。

Proof:

  1. 对于任意可唯一解码的编码CC,它总满足Kraft Inequality。证明见下面。
  2. 给定满足Kraft Inequality的一组码字长度,我们可以构造满足该条件的前缀码。构造方式如下:
    重新安排码字长度lil_{i}使得l1l2lnl_{1}\leq l_{2}\leq\dots\leq l_{n}.定义第ii个码字CiC_{i},使得它是

    j=1i1Dlj\sum_{j=1}^{i-1}D^{-l_{j}}

    DD进制表示的前lil_{i}个数位。对于第一个l1l_{1}C1=0C_{1}=0.对于这样构造的编码,一定有任意两个码字之间不互为前缀,因而是满足条件的前缀码。

对随机变量X,P=(p1,p2,,pn)X,P=(p_{1},p_{2},\dots,p_{n})对应了X=xkX=x_{k}对应的概率。如何找到最小的prefix-free编码c1,c2,cn,c_{1},c_{2},\dots c_{n},使得平均码长E(l)=ipiciE(l)=\sum_{i}p_{i}|c_{i}|最小?

但prefix-free这个性质并不数学!如何改得更加数学化?
一个直觉是,prefix-free的限制使得cic_{i}不可能都很短!如何把这种直觉定量化?这就是Kraft Inequality描述的内容:

Kraft Inequality:

If c1,c2,,cnc_{1},c_{2},\dots,c_{n} are prefix-free codes, let l1,l2,,lnl_{1},l_{2},\dots,l_{n} be the code length; then

i=1n2li1\sum_{i=1}^n 2^{-l_{i}}\leq 1

可以把这个prefix-free codes看成一个二叉树,这时每个码字就是一个叶节点,而prefix-free的性质要求了所有编码都不在二叉树的内部。
取等的情形对应满二叉树的情形,这时才能有可能取到最短的平均码长。

对于唯一可译编码的情形,可以给出一个很巧妙的证明:

[!theorem] Theorem 2(McMillan)
对于DD元的字母表SS构成的唯一可译码,其码字长度lkl_{k}必然满足:iDli1\sum_{i}D^{-l_{i}}\leq 1

[!tip] Proof

xXDl(x)1\sum_{x\in \mathscr{X}}D^{-l(x)}\leq 1

作k次方处理:

(xXDl(x))k=xiX,i=1,,nDi=1nl(xi)=xkXkDl(xk)=m=1klmaxa(m)Dmklmax\left(\sum_{x\in \mathscr{X}} D^{-l(x)}\right)^k = \sum_{x_{i}\in \mathscr{X},i=1,\dots,n}D^{-\sum_{i=1}^nl(x_{i})}=\sum_{x^k\in \mathscr{X^k}}D^{-l(x^k)}=\sum_{m=1}^{kl_{\max }}a(m)D^{-m}\leq kl_{\max }

其中a(m)a(m)为编码长m的个数。由非奇异编码可知,a(m)Dma(m)\leq D^m,于是

xXDl(x)(klmax)1/k\sum_{x\in \mathscr{X}}D^{-l(x)}\leq (kl_{\max })^{1/k}

kk\to \infty即得证。

1

这时我们的问题变为:

minipiliwheni2li=1\\min \sum_{i}p_{i}l_{i}\qquad \text{when} \sum_{i}2^{-l_{i}}=1

qi=2liq_{i}= 2^{-l_{i}}li=logqil_{i}=-\log q_{i},于是问题变为:

minipilogqiwheniqi=1\\min \sum_{i}-p_{i}\log{q_{i}}\qquad \text{when} \sum_{i}q_{i}=1

这正是之前证明过的定理,当qi=pi,li=log21piq_{i}=p_{i},l_{i} = \log_{2} \frac{1}{p_{i}}时取得最小值,最小码长为ipilogpi\sum_{i} -p_{i}\log p_{i}。当然,这里未考虑到lil_{i}是整数这个约束。

熵的加性:设有随机变量X,Y,ZX,Y,Z, XX有分布(p1,p2,,pn)(p_{1},p_{2},\dots,p_{n}),当X=xnX=x_{n}p(Y=y1)=q1,p(Y=y2)=q2p(Y=y_{1})=q_{1},p(Y=y_{2})=q_{2},Z的分布为(p1,p2,,pn1,pnq1,pnq2)(p_{1},p_{2},\dots,p_{n-1},p_{n}q_{1},p_{n}q_{2}),则三者熵的关系为:

H(X)+pnH(Y)=H(Z)H(X)+p_{n}H(Y)=H(Z)

Optimal code

对于随机变量X及其分布(p1,p2,,pn)(p_{1},p_{2},\dots,p_{n}),有p1p2pnp_{1}\geq p_{2}\geq\dots\geq p_{n},求一组编码c1,c2,,cn{0,1}nc_{1},c_{2},\dots,c_{n}\in \{0,1\}^n,使得平均码长最短。

Optimal Code应具有的性质为:

  1. c1c2cn1cn|c_{1}|\leq |c_{2}|\leq\dots\leq|c_{n-1}|\leq |c_{n}|
  2. i=1n2ci=1\sum_{i=1}^n 2^{-|c_{i}|}=1
  3. Recursive:合并两个最小概率节点,形成有n1n-1个message的code时,这个code也是optimal的
  4. cn=cn1|c_{n}|=|c_{n-1}|

Leave a Comment

captcha
Fontsize