[Simplified Spelling Society Newsletter Spring 1986/1 pp5-13 Later designated Journal 2]
[Part 1 of this long article is on another page.]

Information Theory and its Implications for Spelling Reform. Part 2.

Francis Knowles.


8.1 Zero Order Approximation.
Another statistical aspect of information theory is illustrated by the following sequence of letters:


This is a random extraction of letters from the English alphabet with absolutely no structure. Each letter had an equal chance of being extracted, so it represents what in Information Theory is called zero order approximation to the structure of English words.

8.2 First Order Approximation.
If on the other hand we choose letters in the frequency with which they would normally occur in text, the result is different. For every 1,000 letters used in English text, about 150 Es occur, but only about 3 or 4 Qs and Zs. If you take these statistics into account, you get a first order approximation, as here:


This is still not very like English.

8.3 Second Order Approximation.
However if you then take the frequency with which any letter occurs after any given preceding letter, we get a second order approximation, as here:


Just occasionally quite plausible strings of English-sounding syllables appear here: THING, ACT as a word, TH together, LY at the end of a word, GH as in high, ED as the end of a sort of verb, and so on. This arises from the frequency with which particular pairs of letters occur. By taking letters in pairs, this order of approximation gives you a set of pairs comprising 26 x 26.

8.4 Third Order Approximations.
The step after that gives a set of 26 x 26 x 26, from A to Z in triads, which looks much more plausible as English.


It doesn't mean anything, but if that had been a garbled telegram, one would have had no doubt that it was English that had been thus garbled. Although there is no intelligence in it, it does show the great stability of the statistics of language. If you then apply third-order statistics to French, the result indeed looks like French:

German looks as follows:



computer-generated approximation of Russian

8.5 Fifth Order Approximation.
The final example is based on the frequency of words rather than of letters, and is derived rather like a guessing game. If you start with the word the, you then look for the first word that follows the in whatever long text you choose as your source; if the word you find after the happens to be head, you then look for whatever word follows the next occurrence of head, which happens to be and, and so on. Again, it doesn't mean anything, but it is indicative of a sort of structure, indeed that is the only point to this fifth order approximation to English syntax.



9.1 Left to Right Branching in Spelling
Statisticians and computer scientists know that creating say an eighth order approximation to the behaviour of letters of the alphabet is a massive computing job, so guessing games are often used instead: subjects are presented with a certain letter and asked which letter they think should follow. If we start with the letters CONF, for instance, and you have to bet which letter would occur next, you might say E if you were thinking of conference or U if you were thinking of confusion. But let us enter the letter I (CONFI), which might suggest confidence or confiscate. This choice between several alternatives can be seen as a branching effect as we proceed from left to right through the word. If we add D (CONFID), then an A or an E could follow. E could be the end of the word, in which case a space would follow. Alternatively, one could add N, then T followed by either space or TAL. That too could be the end of the word, or you could add IT+Y to produce CONFIDENTIALITY, or IT+IES to produce CONFIDENTIALITIES.

9.2 No Choice = Redundant Information
This method can be used in the form of a very fine algorithm by the unintelligent computer to split orthographic words up automatically into their constituent morphemes, according to the statistics that govern each threshold, and it shows Information Theory can be applied not just to letters, but to fragments and segments of words. It is all very well to say T occurs so many times in the middle of words, but from the orthographical point of view we want to know exactly where, in what environments, since this is relevant to pronunciation and the tension that exists between speech and script. Shannon would conclude from this that redundancy arises whenever there is no choice as to the next letter in a word: where it is obligatory, it can be left out.

9.3 Cutting Spelling.
With these points in mind, let us consider three versions of a text. The first is in standard English but capitalized and using the dot as the separator between words:


The same text in Cut Spelling is obviously an improvement in Information Theory terms:


Then the same text with vowels removed as in Hebrew, but indicated by a mark (this certainly gives pause for thought as to how meaning is preserved in such a script):

 N.TH .S V T. N  N.Y ST RD Y. S.
 NT S V T.R N  G D S.CR M N  LS. ND.
 ND.TH . S .W S.M  N P L T D.BY.W  ST RN.


10.1 Binary Analysis.
Shannon basically approaches all Information Theory questions by requiring a yes-or-no answer. This binary approach, which modern school children are familiar with through their new maths, is the calculus that underlies his computation of redundancy figures. The basic principle he uses is that if something is highly probable, then its informativity is low, and if it is improbable, then its informativity is high. It took him some time to find the right sort of mathematical treatment for this, but he did it in the end, as I think is acknowledged by all workers in the field.

10.2 Single Letter Redundancy.
The table below gives the letters of the alphabet in frequency order as they occur in an extended version of the first text in 9.3 above, with dot, representing space, as the most common, and some calculations carried out with binary logarithms on these probabilities, summed up in the right hand column to yield a value which is explained underneath. Shannon says that ideally all symbols used should be equiprobable, in which case, any deviation from equiprobability is some sort of redundancy. On that basis we can show that the source text contains almost 14% redundancy. Because some letters are used more than others, they don't carry an equal burden.

No.Let Prob -LB(Prob)TermSum
1:.- 0.160222.64190 0.423280.42328
2:E- 0.118813.07333 0.365130.78841
3:T- 0.082153.60568 0.296191.08460
4:A- 0.065173.93958 0.256751.34135
5:S- 0.065173.93958 0.256751.59810
6:N- 0.064493.95469 0.255051.85316
7:I- 0.060424.04881 0.244632.09779
8:R- 0.057034.13222 0.235652.33344
9:O- 0.052274.25776 0.222572.55601
10:D- 0.038704.69165 0.181552.73756
11:H- 0.030555.03269 0.153752.89131
12:C- 0.025125.31509 0.133513.02482
13:U- 0.023085.43708 0.125503.15032
14:M- 0.021055.57035 0.117233.26755
15:L- 0.021055.57035 0.117233.38478
16:P- 0.020375.61765 0.114413.49919
17:V- 0.016975.88069 0.099813.59900
18:Y- 0.014946.06511 0.090593.68958
19:G- 0.012906.27662 0.080963.77054
20:B- 0.012226.35462 0.077653.77054
21:W- 0.012226.35462 0.077653.84820
22:F- 0.011546.43708 0.074294.00014
23:K- 0.006117.35462 0.044944.04508
24:Z- 0.002728.52454 0.023154.06823
25:J- 0.002048.93958 0.018214.08643
26:Q- 0.001369.52454 0.012934.09937
27:X- 0.001369.52454 0.012934.11230

Maximum possible entropy for a set of 27 symbols: 4.75489 bits
Actual entropy for this source: 4.11230 bits
Relative entropy for this text-source: 86.48570%
Redundancy for this text-source: 13.51431%

10.3 Redundancy of Letter-Pairs.
If you then apply the same technique to letter-pairs, redundancy goes up to nearly 24%, which is because the letter-pairs are not all equally exploited. Many are very familiar, such as TH, AN, IN. The 55 most common in the full version of the text with their frequencies, were:

1E.41 20NT 17 38.W11
2.A38 21ST 17 39OV11
3.T35 22Y. 16 40AS10
4S.33 23VI 16 41IS10
5TH30 24M 15 42RS9
6D.30 25.O 14 43.B9
7T.29 26.I 14 44 .H9
8HE27 27ED 14 45CE9
9IN26 28TI 14 46TA8
10EN24 29IE 13 47WE8
11AN24 30P 12 48O.8
12ND23 31DE 12 49OF8
13ER22 32AC 12 50NO8
14N.20 33OR 12 51EA8
15S20 34ON 11 52F.8
16RE19 35AR 11 53ME8
17ES18 36SE 11 54OU8
18TE17 37ET 11 55LI8

With letters in threes, or triads, redundancy exceeds 40% for this text. Many letter-combinations, like THQ, of course never occur, and that represents a sort of systemic constraint.

10.4 Bits of Information.
Taking that approach, you can say that the standard orthography transmitted a certain number of 'bits' of information, 'bit' being a technical term in Information Theory and computer science, a stump word from 'binary digit'. Now a text in Cut Spelling contains the same linguistic information, because the message is the same, but it saves a number of bits of information in its transmission, and that number can be used as an index of improved efficiency. Comparative figures for the three versions of the full text are as follows:

Text l (t.o. with dots for spaces): 1473 symbols, information transmitted, 380.26 bits.
Text 2 (Cut Spelling) 1352 symbols, information transmitted, 345.59 bits.
Text 3 (Vowels deleted): 1002 symbols, information transmitted, 249.22 bits.
Ratios: Text 2 : Text 1 = 0.91; Text 3 : Text 2 = 0.72; Text 3 : Text 1 = 0.66.

But there are some problems there that I can't give a satisfactory explanation of, even as someone who has thought about these things for a long long time. All the pieces of the jigsaw don't easily fall into place.


11.1 Huffman Coding.
One ingenious method of handling Information Theory concepts came from Huffman, who said quite correctly that one has to accept as a fact of life that the symbols are used with different frequencies, as Samuel Morse knew when he developed the morse code. Morse went into a printer's workshop and saw the compositor's trays full of slugs: he noticed the container for Es was massive, but only a few Zs were needed. On this basis he decided that the code elements for the common letters needed to be short, while those for the less frequent letters could be longer. Below is a Huffman coding for the English alphabet in descending order of letter-frequency (read down the columns, left to right); the less frequent the letter, the longer its coding:

.111 D11001  G001001
E011 H01011  B000111
T1101 C0010  W001000
A1010 U00010  F000110
S1011 M00001  K0101010
N1001 L00000  Z01010110
I1000 P110001  J010101110
R0100 V110000  Q0101011111
O0011 Y010101  Z0101011110

11.2 Equal Length Coding.
However this system gives rise to a serious problem for computer scientists, because they don't like codes of varying length. An experiment to reverse this was carried out by Mike Lynch of Sheffield University, with whom I worked for 10 years. He devised 256 codes, each with 8 digits, to represent all the letters of the alphabet, numbers from 0 to 9, and punctuation marks, as well as a large number of common letter-pairs, triads, foursomes and even a few combinations of short words. Here is a small selection of the codes:

*00000001 600010001
?00010110 A00010111
Z00110000 'S00110010
AN00111100 IS01100100
NG01111000 Q10000111
CON10110110 ING10111110
TH11011001 ATIO11100001
FROM11110100 WILL11111000
, AND11111010 TO THE 11111111

To convert our text into this system, we enclose in brackets whatever character or string of characters can be encoded by a single code-group, as shown here for the opening of the text:

( )(TA)(SS )(AT)(TA)(CK)(ED )(IN)(DE)(PE)(ND)(EN)
(T )(PE)(AC)(E )(AC)(TI)(VI)(ST) (S )(IN THE )
(SO)(VI)(E)(T )(UN)(IO)(N )(Y)(ES)TER)(DA)(Y )

The brackets split up the text so that it can be stored in the computer as a continuous sequence of ones and zeros:


This kind of coding enables text then only to take up half the space inside computers that would be required if they were put in letter by letter. It is a very impressive piece of applied Information Theory. However it is now unnecessary, because computer memory, once so expensive, is now very cheap.

11.3 Braille.
The braille alphabet, which is a series of dots or non-dots on paper, is likewise a kind of binary code. It has some close analogies to Information Theory, even though it pre-dates it by a century; for instance it comes in both a contracted and a non-contracted form. To save the blind reader the effort of scanning too many cells with the finger tips, contracted braille uses a number of conventional dot configurations, which have to be learnt. The text is thereby enormously reduced, and the reading process speeded up. It might be of interest to the Simplified Spelling Society to examine what sort of stump spellings and conventions the RNIB and its equivalent in other countries use in their systems. Here is a selection of the braille dot configurations:

braille dot configurations


12.1 Trends towards Text-Processing.
Already today computers are processing massive amounts of text - in fact that is the real growth area with computers. At the moment the ratio of numerical to textual material processed is 7:3, but by 1990 that ratio is expected to be reversed, in addition to the absolute growth that will occur anyway. A lot of time and money is being invested in intricate software for spelling correction, and one has to say that it is a shame such a thing is felt to be necessary.

12.2 Algorithms and Dictionary Look-up.
Computers like a diet of algorithms, which one could compare with the recipe for baking a cake: if you take a certain sequence of steps, correctly and in the right order, the desired result will emerge. Computers are highly adept at carrying out such sequences of steps, at high speed and repetitively. But with our present spelling, algorithms have to be supplemented with checklists, a dictionary look-up procedure to deal with all the exceptions. An example from outside spelling: in order to deal with a word like went in text you don't use an algorithm to relate it to go, you use a dictionary look-up procedure. But looking up individual words is a very untidy business as far as computing is concerned. However, one shouldn't worry too much about the effect of spelling irregularities on computing, because what the computer can't compute, it can look up, and vice versa.


13.1 Savings by Cutting Redundant Letters.
Let me finish with some quotable statistics. Before the Russian Revolution, the Russian orthographical system included a lot of redundant letters, the most redundant being called the hard sign. One feature of the Russian language is that consonants can be palatalized or non-palatalized, and the distinction used to be marked by either a soft sign or a hard sign. The hard sign was used in particular phonetic contexts at the end of words wherever there wasn't a soft sign, and it was thus clearly redundant. Lenin himself largely got rid of it in the orthographical reform that took place shortly after the October Revolution, and as a result Anna Karenina became 35 pages shorter. The hard sign still exists but is restricted to some very unusual situations, where it acts as a separator to prevent a consonant being contaminated by a following vowel, where that is required.

13.2 Simplifying Doubled Consonants.
Another example, also from Russian, is the word for communist, which is of course very common. In Russian it is spelt with MM although only a single M is pronounced; but in other slavonic languages, such as Polish (komunistyczny), the word is written with only one M. If one of the Ms were dropped in Russian, 2.35 tonnes of printing ink would be saved every year in the USSR.


Although the finer technical details of Information Theory are no doubt not central to spelling reform as such, nevertheless as a whole it is not entirely on the periphery, and those who are concerned with designing improved orthographies should perhaps have a general awareness of its implications. In Information Theory the concept of redundancy is chiefly applied to the achievement of maximum efficiency for machines; but in terms of the psychology of reading, it is clearly of the greatest importance to consider the question of efficiency of text-processing by the human brain, which, though it operates in a different way from machines and has different needs, nevertheless should ideally also be enabled to perform the functions of literacy as quickly, as easily and as accurately as possible.


J. Campbell, Grammatical Man, Penguin, 1984
C.Cherry, On Human Communication, MIT, 1968
J. R. Pierce, Symbols, Signals and Noise, Harper, 1961
J. Singh, Great Ideas in Information Theory, Language and Cybernetics, Dover, 1966
Braille Primer, RNIB, 1966.

Back to the top.