Zipf's Law

George Zipf was a philologist-linguist whose work focused primarily on the statistics of language, both English and many others.

He realized that the second most used word in a book, essay, website, etc… appears very commonly half as many times as the most used, while the third most used word appears a third of the times that the most used, the fourth: a quarter of times, etc...

Currently every article and book in the world is found on the Internet, in the online Corpus of each language, so it is not difficult to count the number of times that each word appears both in the Corpus and in Wikipedia, and you always get a similar result.

Here you can see how the graph is almost exact in all languages, and that there are some differences during the first thousand most used words, but once the first thousand words are exceeded, the frequency of use of the rest is usually the same, and even Within the first thousand words the frequency of appearance is very similar.

What is even more curious is that This rule applies to every language in the world; even to languages that have not yet been translated.

As curious information, I leave you the twenty-five most used words in Spanish (not from Spanish, since we are also counting books and texts from Latin America)

1. of  2. the  3. that  4. he  5. in  6. and  7. to  8. the  9. HE  10. of the  11. the  12.13. by  14. with   15. No  16.17. his  18. for  19. is  20. to the  21. it  22. as  23. further  24. either  25. but

And while the word No It is the fifteenth most used word when writing in Spanish with a total of 1.465.503 times used, its adversary, the word Yeah is in position ninety-four, with a total of 108.631 appearances; almost fourteen times less.

Furthermore, the first word with some individual meaning is all, in position thirty-seven, with 247,340 results, and followed by the word years, in position forty-seven, with 203,027 appearances.

ROYAL SPANISH ACADEMY: Data bank (CREA) [online]. Reference corpus of current Spanish. < http://www.rae.es > [2019-01-24]

On the other hand, in many articles and small web pages in Spanish, the 100 most used words in this language form the 50% of everything that is written, while most of the remaining 50% of words appear only once.

Although Zipf's Law is not fulfilled in colloquial speech, it is fulfilled even in the organization of protein sequences, in the intensity of solar flares, the population of cities, the times web pages are visited. , surnames, the times you call on the phone, the popularity of chess openings, and an endless number of other cases...

Zipf's Law when investigating the number of workers in the largest companies in the United States.

Leave a Comment

Your email address will not be published. Required fields are marked *

en_USEN