Do you know what is the most frequently used word in English language?
According to the analysis of British National Corpus, which consists of 100 million word collection of samples of written and spoken language from a wide range of sources, the most frequently used in English language is "the".
Word "the" accounts for nearly 6% of everything we say, read or write.
Source: screenshot from this this cool site : Wordcount
The top 20 word are in the following order: "the", "of", "and", "to", "a", "in", "is", "I", "that", "it", "for", "you", "was", "with", "on", "as", "have", "but", "be", "they".
Seems like a fun trivia, but is there something more?
It looks like that it doesn't matter whether we analyse an entire language, just one book or one post, almost every time an interesting pattern emerges.
Zipf's law
Word frequency and ranking on a log log graph follow a nice straight line. A power-law.
Image Source
This law is called Zipf's law and it states that given some form of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.
Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
The law is named after the American linguist George K. Zipf (1902–1950), who popularized it and sought to explain it, though he did not claim to have originated it.
Zipf's law isn't limited on English language only. It applies to other languages, in fact, all of them.
Isn't it funny how something so complex and grandiose as language can be predicted in such a simple way.
And not only language, Zipf's law pattern can also be found in:
- citations of scientific papers
- the cumulative distribution of the number of “hits” received by web sites
- copies of books sold
- magnitude of earthquakes
- intensity of solar flares
- wealth of richest people
- protein sequences
80-20 Rule
Zipf's distribution is discrete form of the continuous Pareto distribution.
The Pareto principle states that, for many events, roughly 80% of the effects come from 20% of the causes.
Joseph M. Juran suggested the principle and named it after Italian economist Vilfredo Pareto.
Pareto showed that approximately 80% of the land in Italy was owned by 20% of the population.
Pareto also observed that 20% of the pea pods in his garden contained 80% of the peas.
What does it mean today?
Pareto's Principle can also be observed in our daily lives.
The 80/20 rule should not be taken too seriously, it is a mere symbol of interesting disproportions of cause and effect that happen in the world we create.
Examples of Pareto's Principle I've found interesting:
- 80% of word occurrences come from 20% of the words
- 80% of sales come from 20% of customers
- 80% of complaints come from 20% of issues
- 85% of Facebook’s visitors are looking at only 8% of overall images
- Most people spend 80% of their time with 20% of their friends
- 20% of activities produce 80% of results
Image Source - Health are expenses by percentiles U.S.
In 2002 Microsoft reported that 80% of the crashes are caused by 20% of the bugs detected.
Possible Explanations
Although Zipf’s Law holds for most languages, we can't really tell why.
It may be explained to some point by the statistical analysis of randomly generated texts.
Theory is that the rank distribution arises naturally out of the fact that word length plays a part — long words tend not to be very common, whilst shorter words are.
But still there are still some values that don't undergo the given hypothesis. Let's take word frequencies for example. Taboo words like "sex" or the names of planets, days and chemical elements. They are highly constrained by the natural word.
Statistical analysis doesn't explain that.
The principle of least effort is another possible explanation. Zipf himself proposed that the word frequencies in language could have something with speakers and listeners. Speakers tend to use fewer words when expressing their ideas, while listeners liked when there were more words. Zipf's law is a result of compromise on amount of words used between speakers and listeners.
Another approach is called preferential attachment.
For example, posts, videos or images that have many views, get more views.
What happens is that some quantity, typically some form of wealth or credit, is distributed among a number of individuals or objects according to how much they already have, so that those who are already wealthy receive more than those who are not.
Once a word is used it is more likely to be used again.
But there doesn't need to be a conscious effort to do it. It also happens naturally.
Imagine having a number of unchained chain links.
By picking two out of the mess and linking them together you would create a longer chain that would now be more likely to get picked again randomly from the mess just because it is longer. Repeating the process in this situation would also end up in chain links length represented by Zipf's law.
Conclusion
Zipf’s Law is one of those empirical rules that characterize a surprising range of real-world phenomena remarkably well. I found interesting the amount of things that followed it.
Source: Steem Whitepaper
For the end, I'll leave you with a Steemit Payout Distribution graph and you can guess which pattern it follows.
I hope you liked the topic.