| |
Plain text considerations: Rate-of-language | page 4 of 12 |
Rate-of-language: The amount of underlying
information added by each successive letter of a message.
English prose, for example, turns out to contain something
like 1.3 bits of entropy (information) per letter. This
might seem outrageous to claim -- after all there are
more than 2^(1.3) letters in English! But the problem is
that some letters occur a lot more than others, and pairs
(digraphs) and triplets (trigraphs) of letters cluster
together also. The rate of English doesn't depend just on
the alphabet, but on the patterns in the whole text. The low
rate of English prose is what makes it such a good
compression candidate.
|