Electronics
& Wireless World apr1988, p350
The seven-per-cent rule
Ivor Catt believes that what appears to be a new rule of linguistics could have consequences in data storage. IVOR CATT |
My discovery of the
seven-per-cent rule demonstrates how new academic disciplines can grow out of
mundane technological advances, in this case the reduction in the cost of
memory and of data processing. In the past, data
compression techniques involved replacing short strings like ion, the, and by
a short code. There was an underlying assumption, valid at the time, that a
20,000 word look-up dictionary of words for conversion into shorthand was
impracticable. An August, 1977 project report* suggested a dictionary of 200
strings, examples of strings to be compressed being ion, the, and and ation.
At that time, a 20,000 word look-up dictionary in 200,000 bytes of memory
would perhaps have cost £100,000. Huffman, Shannon and Fano were associated with
the underlying theory of compression of three- or four-character strings. The relations between
cost, speed and size of semiconductor rom and ram recently passed the point
where a look-up dictionary of 20,000 words became practicable, its price
dropping to £ 100. Since a text of 100,000
words is virtually all encompassed by a dictionary of 20,000 words, then if
the average word length is six characters, we can today consider storing
these (less than) two-character codes rather than the original longer words.
The earlier, Huffman idea of character-by-character searching, looking for
frequently recurring strings, seems to be obsolete. The Lob Corpus** is a
computer analysis of one millon words of text from diverse sources, and is
invaluable for our purpose. We find that it would be possible, lacking
systematically developed information, to make major errors - for instance to
think that the average word is six characters long. Words in the Lob Corpus
analysis are ranked in order of frequency of occurrence, and we find a number
of interesting things. First, a form of text compression by word has already
occurred in English, in that the most common words are generally
substantially shorter than the less common words. Since these common words
are so short, (the most common 64 words averaging 2.6 characters***, the next
most common 64 words average 3.9 characters,) and since more than half the
words in the text come from these 128 words, it follows that the saving of
the Space character which is implicit in text compression by word is one of
the most compelling, but confusing, reasons for text compression by word.
(With the inter-word space, 2.6 becomes 3.6 and the 3.9 becomes 4.9.)
Possibly part of our system of word recognition when we read is that
shortness tells us that the word is common, and vice versa. A word's length
may be part of its informational content. It appears that something
like a 7-bit **** code for the most common 128 words, and one bit to indicate
whether the code is 7 bit or 15 bit, would give efficient ( x 3 or so) text
compression in a manner easy to effect with our hardware *****. It would also
not significantly interfere with text retrieval, although it would somewhat
upset character-by-character search. It is 20% more efficient than using a
15-bit code for everything-100 words consume 1200 bits instead of 1500 bits.
There is a further small improvement down to 1100 bits for 100 words if four
(or eight) different code lengths are used, requiring, of course, a 2- or
3-bit number to specify the code length used. Increasing the number of code
lengths improves the efficiency by reducing the pattern length for words, but
the improvement is more or less exactly cancelled by the waste involved in
recording which code length is used. The fact that use of two code lengths gives only a 20% advantage indicates that storing the look-up dictionary in ram and altering the "most common" 128-word set to suit experience with texts in real time will probably not be highly worthwhile, given that ram is more expensive than rom. However, such tradeoffs need to be discussed at great length. One large computer company with $4 billion turnover sells $1 billion worth of disc files each year. Although some of the file space will be taken by data, it is possible that text occupies a significant percentage of that $1 billion dollars' worth of hardware. This indicates that there is a lot of money available for effecting more efficient text compression. A similar argument could be used for text compression down telecommunication lines, by satellite and otherwise. For the future, we have a
fact (consistent 7%) without a theory. We will see whether other languages
give us a consistent 7%. Dr Eugene Winter, a language expert, is critical of the Lob Corpus, and I feel that further analysis should be of other bodies of text. Analysis should be by batch, 100,000 words at a time, and all the results delivered to me for accumulation, both to check on the main results and also to get a feel for variability - standard deviation - between different bodies of text. If this remarkable 7%
figure remained unknown until I stumbled on it, this would point to a gulf
between the humanities (linguistics) and computer technology. It would mean that
we have been misled by all the noise about structuralism in literary analysis
into thinking that computers were now being used, whereas quite clearly they
are not. However, perhaps this article will lead many students to go for
fashionable computer-based Ph.D. research projects. [In the event, it was
ignored. Ivor Catt aug03.] |
* Technical Memorandum No.
CIR1087 by M. Sreetharan et al., "Text Compression with Property la.
"A report from a government-funded project at Brunel University
Department of Electrical Engineering and Electronics to investigate the use
of my computer architecture (UK Pat. l 525 048) in text compression. **K. Hofland and S.
Johansson, Word Frequencies in British and American English, pub. The
Norwegian Computing Centre for the Humanities, Bergen, 1982. ***This is, of course, the average length of a word in the text coming from the most common 64, rather than the average length of these 64 words (which is 3.1 characters!); - an important distinction, difficult to describe. ****This number seven can be varied between 5 and 10 with very little penalty- only 3% or so. *****The average length of
uncommon words is seven characters. If the seven per cent rule were a law, then: 7% of the words in any long text would be the, the most common word in the language; 14% of the words in any
long text would be the or of, the two most common words; 21% of the world in any
long text would be the, of, and or to,
the four most common words; 28% of the words would be the, of, and, to, a, in, that
or is,
the eight most common words. Each time we doubled the
catchment, we would account for 7% more of the words in the text. In 14 such
steps we would have included 2 to the power 14 = 16384 words and accounted
for 98% of the words in the long text. In practice, the
seven-per-cent rule only loosely controls the pattern initially, although you
will see from the full table that, after the first four words, the 7% rule
applies remarkably rigidly. [See Wireless World april 1988 for a fuller
analysis] Number of words, starting
with the most common (= catchment); 1 2 4 8 16 32 64 128 256
512 1024 2048 4096 8192 16384 % of text the represent; 7 10 16 22 29 37 44 51 58
64 71 79 (86) (93) (100) % of the text that the
additional words represent is; 7 3 5 7 7 8 7 7 6 7 7 8
(7) (7) (7) Figures in brackets
are estimates
3 2 2.5 2.0 2.5 2.9 2.9
3.9 ? ? (5.1) ? 7.2 ? 7.6 x |