Electronics & Wireless World apr1988, p350
The seven-per-cent rule
Ivor Catt believes that what appears to be a new rule of linguistics could have consequences in data storage.
My discovery of the seven-per-cent rule demonstrates how new academic disciplines can grow out of mundane technological advances, in this case the reduction in the cost of memory and of data processing.
In the past, data compression techniques involved replacing short strings like ion, the, and by a short code. There was an underlying assumption, valid at the time, that a 20,000 word look-up dictionary of words for conversion into shorthand was impracticable. An August, 1977 project report* suggested a dictionary of 200 strings, examples of strings to be compressed being ion, the, and and ation. At that time, a 20,000 word look-up dictionary in 200,000 bytes of memory would perhaps have cost £100,000. Huffman, Shannon and Fano were associated with the underlying theory of compression of three- or four-character strings.
The relations between cost, speed and size of semiconductor rom and ram recently passed the point where a look-up dictionary of 20,000 words became practicable, its price dropping to £ 100.
Since a text of 100,000 words is virtually all encompassed by a dictionary of 20,000 words, then if the average word length is six characters, we can today consider storing these (less than) two-character codes rather than the original longer words. The earlier, Huffman idea of character-by-character searching, looking for frequently recurring strings, seems to be obsolete.
The Lob Corpus** is a computer analysis of one millon words of text from diverse sources, and is invaluable for our purpose. We find that it would be possible, lacking systematically developed information, to make major errors - for instance to think that the average word is six characters long. Words in the Lob Corpus analysis are ranked in order of frequency of occurrence, and we find a number of interesting things. First, a form of text compression by word has already occurred in English, in that the most common words are generally substantially shorter than the less common words. Since these common words are so short, (the most common 64 words averaging 2.6 characters***, the next most common 64 words average 3.9 characters,) and since more than half the words in the text come from these 128 words, it follows that the saving of the Space character which is implicit in text compression by word is one of the most compelling, but confusing, reasons for text compression by word. (With the inter-word space, 2.6 becomes 3.6 and the 3.9 becomes 4.9.) Possibly part of our system of word recognition when we read is that shortness tells us that the word is common, and vice versa. A word's length may be part of its informational content.
It appears that something like a 7-bit **** code for the most common 128 words, and one bit to indicate whether the code is 7 bit or 15 bit, would give efficient ( x 3 or so) text compression in a manner easy to effect with our hardware *****. It would also not significantly interfere with text retrieval, although it would somewhat upset character-by-character search. It is 20% more efficient than using a 15-bit code for everything-100 words consume 1200 bits instead of 1500 bits. There is a further small improvement down to 1100 bits for 100 words if four (or eight) different code lengths are used, requiring, of course, a 2- or 3-bit number to specify the code length used. Increasing the number of code lengths improves the efficiency by reducing the pattern length for words, but the improvement is more or less exactly cancelled by the waste involved in recording which code length is used.
The fact that use of two code lengths gives only a 20% advantage indicates that storing the look-up dictionary in ram and altering the "most common" 128-word set to suit experience with texts in real time will probably not be highly worthwhile, given that ram is more expensive than rom. However, such tradeoffs need to be discussed at great length. One large computer company with $4 billion turnover sells $1 billion worth of disc files each year. Although some of the file space will be taken by data, it is possible that text occupies a significant percentage of that $1 billion dollars' worth of hardware. This indicates that there is a lot of money available for effecting more efficient text compression. A similar argument could be used for text compression down telecommunication lines, by satellite and otherwise.
For the future, we have a fact (consistent 7%) without a theory. We will see whether other languages give us a consistent 7%.
Dr Eugene Winter, a language expert, is critical of the Lob Corpus, and I feel that further analysis should be of other bodies of text. Analysis should be by batch, 100,000 words at a time, and all the results delivered to me for accumulation, both to check on the main results and also to get a feel for variability - standard deviation - between different bodies of text.
If this remarkable 7% figure remained unknown until I stumbled on it, this would point to a gulf between the humanities (linguistics) and computer technology. It would mean that we have been misled by all the noise about structuralism in literary analysis into thinking that computers were now being used, whereas quite clearly they are not. However, perhaps this article will lead many students to go for fashionable computer-based Ph.D. research projects. [In the event, it was ignored. Ivor Catt aug03.]
* Technical Memorandum No. CIR1087 by M. Sreetharan et al., "Text Compression with Property la. "A report from a government-funded project at Brunel University Department of Electrical Engineering and Electronics to investigate the use of my computer architecture (UK Pat. l 525 048) in text compression.
**K. Hofland and S. Johansson, Word Frequencies in British and American English, pub. The Norwegian Computing Centre for the Humanities, Bergen, 1982.
***This is, of course, the average length of a word in the text coming from the most common 64, rather than the average length of these 64 words (which is 3.1 characters!); - an important distinction, difficult to describe.
****This number seven can be varied between 5 and 10 with very little penalty- only 3% or so.
*****The average length of uncommon words is seven characters.
If the seven per cent rule were a law, then:
7% of the words in any long text would be the, the most common word in the language;
14% of the words in any long text would be the or of, the two most common words;
21% of the world in any long text would be the, of, and or to, the four most common words;
28% of the words would be the, of, and, to, a, in, that or is, the eight most common words.
Each time we doubled the catchment, we would account for 7% more of the words in the text. In 14 such steps we would have included 2 to the power 14 = 16384 words and accounted for 98% of the words in the long text.
In practice, the seven-per-cent rule only loosely controls the pattern initially, although you will see from the full table that, after the first four words, the 7% rule applies remarkably rigidly. [See Wireless World april 1988 for a fuller analysis]
Number of words, starting with the most common (= catchment);
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
% of text the represent;
7 10 16 22 29 37 44 51 58 64 71 79 (86) (93) (100)
% of the text that the additional words represent is;
7 3 5 7 7 8 7 7 6 7 7 8 (7) (7) (7) Figures in brackets are estimates
3 2 2.5 2.0 2.5 2.9 2.9 3.9 ? ? (5.1) ? 7.2 ? 7.6