The English and Arabic corpus of the Holy Quran is a rich source for statistical analysis. For instance, the entire test corpora has half a million words and many thousand distinct words. A rich dataset such as the Holy Quran, therefore, provides for an exciting journey of data exploration.

In this mini-research project, we perform various studies on the Holy Quran [english and arabic corpora], in an aim to discover many interesting things. Please realize that the study might involve some technical details, for the more scientific audience, but rest assured, you can follow on.

The Corpora

A corpora, plural of corpus, is a collection of texts. In our study, our corpora contains two corpus of the Holy Quran, in arabic and english form. The arabic text is a digital copy of the book in its original form. The english text is a famous translation of the Holy Quran by Yusuf Ali a century ago.

Necessary Filtering

The english text of the Holy Quran, just like any corpus of english text, contains stop words, such as “the, and, a”. Also known as filler words, removing these stop words while performing natural language processing is a general practice for data scientists.

The rationale for removing stop words is to prevent the obfuscation of the text’s true message. For example, the word “the” appears many times in an english corpora, but it doesn’t reveal a great deal of information about the text itself. It is important to therefore filter them.

In addition to stop words filtering [using a large 2,400 word english and arabic set], we also perform the conversion of the text to lowercase [for english] and removal of all punctuation [for english]. We retain the apostrophe though e.g. { God’s and Gods are different }.

You can skip the next few paragraphs, as this is for a more scientific audience. The reason for including this is to allow others to reproduce the results, if they wish, for their own perusal and academic experiments. We will share the code on Github later.

We use Natural Language Toolkit’s stop words filtering set for english and arabic. We also take care of ensuring bogus words such as ‘p’ for paragraph marker, and ‘section’ for “Surah” marker are removed. This ensures the frequency distributions are not inflated for no reason.

stopset = set(stopwords.words('language'))

This chunk of code demonstrates the subsequent filtering, to lowercase conversion, punctuation and digit removal. This would believably affect the english corpus of the Holy Quran only. The rationale for such filtering is to prevent tokenization from inflating the statistical results significantly.

# Filtering
filter_lowercase = file_read.lower()

regex = re.compile('[%s]' % re.escape("!\"#$%&()*+,-./:;<=>?@[\]^_`{|}~"))
filter_lowercase_punctuation = regex.sub('', filter_lowercase)
filter_lowercase_punctuation_digits = ''.join(i for i in filter_lowercase_punctuation if not i.isdigit())

# Convert corpus to a list
filter_lowercase_punctuation_digits_list = filter_lowercase_punctuation_digits.split()
quran_minus_filters_stopwords = [i for i in filter_lowercase_punctuation_digits_list if i not in stopset]

The Analyses

This section is why you’ve visit this page for. Let’s start with the fundamental: how many words are there in the arabic and english texts of the Holy Quran. This is a true word count, not a token count. Recall, we removed punctuations and digits.

How many words are in the Holy Quran?


Total words in Quran: 170959
Total distinct words in Quran: 7186

Total words in Quran [stopwords removed]: 75285
Total distinct words in Quran [stop words removed]: 7060


For arabic, we have used a genuine stop words set, taking into care the [tajweed]. Tajweed is the standard way of pronouncing the Holy Quran. Common arabic stop words set wouldn’t have helped in this regard, as they lack the custom characters.

Total words in Quran: 78376
Total distinct words in Quran: 17658

Total words in Quran [stopwords removed]: 59977
Total distinct words in Quran [stopwords removed]: 17307

As you might have noticed, the Holy Quran in arabic employs 17,307 distinct words [post filtering] as compared to the english text, with 7060 distinct words [post filtering]. We find this very interesting.

It seems the arabic language has a richer vocabulary for getting the message across as compared to the english language. Due to the tajweed, the Holy Quran is probably the best language to deliver God’s message, as it is highly effective. This is due to the syntactical power of tajweed.

Tajweed can render the meaning of the same word different by the alteration of ‘pronunciation marks’ on bottom, or top. Although the word might look same to a non-arabic reader, those little characters around it make a whole lot of difference. Sophistication it is.

Word Frequency Distributions

Word frequency distributions is a way to answer the question: what are the most commonly occurring words in a corpora. For the Holy Quran, we calculate the word frequency distributions of the arabic and english corpora. We use our fully filtered set, removing all stop words.


The plot above shows the word frequency distribution [non-cumulative] of the 50 most commonly occurring words in the english corpus of the Holy Quran. If that plot is hard for you to read, exercise your head and rotate. If you are a touch lazier, we have enclosed the data dump below.

Top 100 most frequent words in Holy Quran [English]

Most commonly occurring words:
[('god', 3064)]
[('ye', 2010)]
[('lord', 950)]
[('thou', 771)]
[('say', 765)]
[('said', 712)]
[('shall', 657)]
[('thee', 611)]
[('one', 528)]
[('day', 521)]
[('thy', 514)]
[('people', 510)]
[('us', 470)]
[('believe', 468)]
[('o', 442)]
[('earth', 413)]
[('things', 375)]
[('men', 364)]
[('may', 347)]
[('signs', 340)]
[('sent', 324)]
[('truth', 319)]
[('indeed', 318)]
[('among', 311)]
[('verily', 309)]
[('hath', 300)]
[('would', 288)]
[('made', 288)]
[('fear', 273)]
[('good', 270)]
[('evil', 265)]
[('faith', 250)]
[('come', 247)]
[('give', 246)]
[('apostle', 240)]
[('see', 239)]
[('merciful', 230)]
[('penalty', 227)]
[('know', 223)]
[('life', 212)]
[('man', 209)]
[('turn', 209)]
[('therein', 205)]
[('book', 201)]
[('heavens', 200)]
[('fire', 197)]
[('make', 195)]
[('back', 192)]
[('behold', 192)]
[('power', 192)]
[('away', 186)]
[('unto', 180)]
[('knowledge', 179)]
[('reject', 177)]
[('gracious', 176)]
[('moses', 174)]
[('except', 174)]
[('right', 172)]
[('unbelievers', 172)]
[('truly', 169)]
[('deeds', 168)]
[('way', 163)]
[('let', 162)]
[('mercy', 161)]
[('doth', 159)]
[('even', 158)]
[('every', 157)]
[('reward', 157)]
[('take', 157)]
[('hearts', 157)]
[('name', 155)]
[('best', 153)]
[('created', 153)]
[('thus', 153)]
[('full', 152)]
[('like', 151)]
[('clear', 148)]
[('call', 148)]
[('never', 145)]
[("god's", 143)]
[('forth', 143)]
[('given', 139)]
[('judgment', 136)]
[('follow', 136)]
[('worship', 136)]
[('punishment', 133)]
[('came', 132)]
[('guidance', 131)]
[('well', 129)]
[('ask', 124)]
[('sura', 122)]
[('nothing', 122)]
[('wrong', 121)]
[('without', 119)]
[('nay', 118)]
[('two', 118)]
[('apostles', 118)]
[('message', 116)]
[('hereafter', 116)]
[('bring', 115)]

We are quite stunned at the power of the top 100 high frequency words in the english corpus of the Holy Quran. If you read them in order, starting from the word “god” [the most frequent word], you will see that it summarizes the message. It’s a very powerful discovery.


This is in arabic. Notice the correlation of the arabic plot with the english plot. They are visibly similar. Also, the word “God” appears at the top in both corpora. Even though the two languages are different, we see the correlation as a way to build confidence for a translated text.

Most commonly occurring words:
[('فِي', 1185)]
[('اللَّهِ', 940)]
[('الَّذِينَ', 810)]
[('اللَّهُ', 733)]
[('إِلَّا', 662)]
[('إِنَّ', 609)]
[('اللَّهَ', 592)]
[('قَالَ', 416)]
[('يَا', 350)]
[('ثُمَّ', 337)]
[('كَانَ', 323)]
[('ذَلِكَ', 280)]
[('الَّذِي', 268)]
[('قُلْ', 263)]
[('آمَنُوا', 253)]
[('قَالُوا', 250)]
[('وَاللَّهُ', 239)]
[('كَانُوا', 229)]
[('الْأَرْضِ', 219)]
[('هَذَا', 190)]
[('كَفَرُوا', 189)]
[('كُنْتُمْ', 188)]
[('شَيْءٍ', 179)]
[('السَّمَاوَاتِ', 175)]
[('وَالَّذِينَ', 164)]
[('إِنَّا', 156)]
[('أَيُّهَا', 150)]
[('إِنَّهُ', 147)]
[('حَتَّى', 142)]
[('بِاللَّهِ', 139)]
[('أُولَئِكَ', 133)]
[('الرَّحْمَنِ', 133)]
[('إِنِّي', 131)]
[('مُوسَى', 129)]
[('كُلِّ', 123)]
[('الرَّحِيمِ', 118)]
[('لِلَّهِ', 116)]
[('رَبِّكَ', 116)]
[('الدُّنْيَا', 115)]
[('بِسْمِ', 115)]
[('إِنَّمَا', 113)]
[('مِمَّا', 111)]
[('يَشَاءُ', 108)]
[('وَالْأَرْضِ', 108)]
[('مِنْكُمْ', 105)]
[('رَبِّ', 101)]
[('فَلَمَّا', 101)]
[('عَلِيمٌ', 100)]
[('أَنَّ', 99)]
[('عِنْدَ', 98)]
[('النَّاسِ', 92)]
[('رَبِّي', 91)]
[('يُؤْمِنُونَ', 86)]
[('وَقَالَ', 85)]
[('وَكَانَ', 84)]
[('تَعْمَلُونَ', 83)]
[('دُونِ', 83)]
[('كَذَلِكَ', 83)]
[('بَعْدِ', 82)]
[('السَّمَاءِ', 81)]
[('وَإِنَّ', 81)]
[('يَعْلَمُونَ', 81)]
[('رَبِّهِمْ', 80)]
[('لِلَّذِينَ', 79)]
[('أَنْتُمْ', 78)]
[('خَيْرٌ', 78)]
[('الْكِتَابَ', 78)]
[('الْمُؤْمِنِينَ', 78)]
[('الْكِتَابِ', 77)]
[('عَذَابٌ', 77)]
[('شَيْئًا', 77)]
[('بِالْحَقِّ', 74)]
[('رَبَّنَا', 72)]
[('سَبِيلِ', 72)]
[('فَإِنَّ', 71)]
[('النَّارِ', 70)]
[('الْقِيَامَةِ', 70)]
[('الظَّالِمِينَ', 64)]
[('كُنَّا', 63)]
[('إِنَّهُمْ', 62)]
[('يَعْلَمُ', 62)]
[('الْعَالَمِينَ', 61)]
[('وَقَالُوا', 61)]
[('الصَّالِحَاتِ', 59)]
[('لَعَلَّكُمْ', 59)]
[('لِي', 59)]
[('رَحِيمٌ', 59)]
[('قَوْمِ', 58)]
[('خَلَقَ', 58)]
[('بِآيَاتِنَا', 57)]
[('وَلَكِنْ', 57)]
[('الَّتِي', 57)]
[('جَاءَ', 57)]
[('شَاءَ', 56)]
[('يَعْمَلُونَ', 56)]
[('رَبَّكَ', 55)]
[('أَنْتَ', 55)]
[('الْآخِرَةِ', 55)]
[('قَلِيلًا', 55)]

Of course, as we discussed earlier, to understand the true meaning of the Holy Quran, there is no alternative than to read it in the language of it's delivery, arabic. Below is an overlay of the word frequency distribution of the arabic and english plots.

Longest Words

We're getting there. This is getting interesting. The longest words in the Holy Quran [english and arabic texts] are given below. We are not so sure about the english words, they seem to have been contracted due to a typographical error, probably.


The longest words:


The longest words:

Bigram Analysis

A bigram is a sequence of two words. If the sentence is "John is a good man", the bigrams from this small dataset would be "John is", "is a", "a good" and "good man". Since the Holy Quran is a large dataset of natural language, we thought it would be interesting to see the most frequently occurring bigrams, for both the arabic and the english text.


The most frequently occurring bigrams in the english corpus of the Holy Quran are listed in the plot above, but for brevity, have been listed in a text only format with the frequency count. Please look below.

Most frequent bi-grams:
[(('thy', 'lord'), 227)]
[(('heavens', 'earth'), 172)]
[(('god', 'gracious'), 162)]
[(('o', 'ye'), 130)]
[(('god', 'hath'), 128)]
[(('name', 'god'), 123)]
[(('gracious', 'merciful'), 118)]
[(('ye', 'believe'), 116)]
[(('said', 'o'), 115)]
[(('ye', 'may'), 110)]
[(('god', 'god'), 99)]
[(('fear', 'god'), 98)]
[(('thou', 'art'), 87)]
[(('god', 'apostle'), 82)]
[(('god', 'ye'), 80)]
[(('day', 'judgment'), 80)]
[(('ye', 'shall'), 77)]
[(('o', 'lord'), 69)]
[(('oftforgiving', 'merciful'), 69)]
[(('reject', 'faith'), 65)]
[(('believe', 'god'), 63)]
[(('o', 'people'), 60)]
[(('say', 'god'), 60)]
[(('clear', 'signs'), 59)]
[(('verily', 'god'), 58)]
[(('besides', 'god'), 58)]
[(('god', 'doth'), 57)]
[(('turn', 'away'), 55)]
[(('grievous', 'penalty'), 54)]
[(('god', 'oftforgiving'), 53)]
[(('people', 'book'), 49)]
[(('signs', 'god'), 47)]
[(('full', 'knowledge'), 47)]
[(('thou', 'hast'), 47)]
[(('glad', 'tidings'), 47)]
[(('god', 'said'), 47)]
[(('thou', 'wilt'), 45)]
[(('say', 'ye'), 43)]
[(('lord', 'ye'), 43)]
[(('moses', 'said'), 42)]
[(('god', 'lord'), 41)]
[(('children', 'israel'), 41)]
[(('shall', 'ye'), 41)]
[(('exalted', 'might'), 40)]
[(('exalted', 'power'), 40)]
[(('ye', 'know'), 39)]
[(('one', 'another'), 38)]
[(('god', 'exalted'), 38)]
[(('dwell', 'therein'), 36)]
[(('ye', 'deny'), 36)]
[(('god', 'one'), 34)]
[(('believe', 'work'), 34)]
[(('life', 'world'), 34)]
[(('god', 'full'), 34)]
[(('establish', 'regular'), 34)]
[(('power', 'things'), 34)]
[(('say', 'o'), 34)]
[(('sent', 'thee'), 33)]
[(('say', 'lord'), 33)]
[(('unto', 'thee'), 33)]
[(('turn', 'back'), 33)]
[(('god', 'knows'), 32)]
[(('god', 'say'), 32)]
[(('cause', 'god'), 31)]
[(('favours', 'lord'), 31)]
[(('day', 'shall'), 31)]
[(('hast', 'thou'), 31)]
[(('created', 'heavens'), 31)]
[(('ye', 'ye'), 30)]
[(('said', 'ye'), 30)]
[(('wilt', 'thou'), 30)]
[(('ask', 'thee'), 30)]
[(('know', 'god'), 30)]
[(('ye', 'god'), 30)]
[(('work', 'righteousness'), 29)]
[(('receive', 'admonition'), 29)]
[(('thou', 'dost'), 29)]
[(('god', 'knoweth'), 29)]
[(('seest', 'thou'), 29)]
[(('praise', 'god'), 28)]
[(('one', 'day'), 28)]
[(('celebrate', 'praises'), 28)]
[(('except', 'god'), 27)]
[(('men', 'women'), 27)]
[(('lord', 'worlds'), 27)]
[(('night', 'day'), 27)]
[(('god', 'loveth'), 27)]
[(('even', 'though'), 27)]
[(('order', 'may'), 27)]
[(('knoweth', 'things'), 26)]
[(('ye', 'would'), 26)]
[(('good', 'things'), 26)]
[(('thee', 'thy'), 26)]
[(('reject', 'signs'), 26)]
[(('verily', 'thy'), 26)]
[(('things', 'god'), 26)]
[(('said', 'lord'), 26)]
[(('regular', 'charity'), 26)]
[(('o', 'moses'), 25)]
[(('hath', 'power'), 25)]


And now, the text. By the way, the top bigram in english directly translates to the top bigram in arabic. Just wow!

Most frequent bi-grams:
[(('إِنَّ', 'اللَّهَ'), 205)]
[(('الَّذِينَ', 'آمَنُوا'), 184)]
[(('فِي', 'الْأَرْضِ'), 176)]
[(('يَا', 'أَيُّهَا'), 142)]
[(('الَّذِينَ', 'كَفَرُوا'), 134)]
[(('الرَّحْمَنِ', 'الرَّحِيمِ'), 116)]
[(('بِسْمِ', 'اللَّهِ'), 115)]
[(('اللَّهِ', 'الرَّحْمَنِ'), 114)]
[(('السَّمَاوَاتِ', 'وَالْأَرْضِ'), 95)]
[(('أَيُّهَا', 'الَّذِينَ'), 92)]
[(('إِنَّ', 'الَّذِينَ'), 84)]
[(('فِي', 'السَّمَاوَاتِ'), 71)]
[(('دُونِ', 'اللَّهِ'), 71)]
[(('كُلِّ', 'شَيْءٍ'), 69)]
[(('سَبِيلِ', 'اللَّهِ'), 69)]
[(('إِنَّ', 'فِي'), 57)]
[(('وَعَمِلُوا', 'الصَّالِحَاتِ'), 53)]
[(('فِي', 'ذَلِكَ'), 52)]
[(('أَنَّ', 'اللَّهَ'), 52)]
[(('آمَنُوا', 'وَعَمِلُوا'), 50)]
[(('فَإِنَّ', 'اللَّهَ'), 47)]
[(('فِي', 'سَبِيلِ'), 45)]
[(('عِنْدَ', 'اللَّهِ'), 44)]
[(('غَفُورٌ', 'رَحِيمٌ'), 41)]
[(('وَكَانَ', 'اللَّهُ'), 40)]
[(('يَا', 'قَوْمِ'), 38)]
[(('قَالَ', 'يَا'), 38)]
[(('إِلَهَ', 'إِلَّا'), 37)]
[(('السَّمَاوَاتِ', 'فِي'), 37)]
[(('وَالَّذِينَ', 'آمَنُوا'), 35)]
[(('كَانُوا', 'يَعْمَلُونَ'), 35)]
[(('الْحَيَاةِ', 'الدُّنْيَا'), 35)]
[(('السَّمَاوَاتِ', 'وَالْأَرْضَ'), 34)]
[(('تَحْتِهَا', 'الْأَنْهَارُ'), 34)]
[(('تَجْرِي', 'تَحْتِهَا'), 34)]
[(('عَذَابٌ', 'أَلِيمٌ'), 33)]
[(('بَنِي', 'إِسْرَائِيلَ'), 33)]
[(('شَيْءٍ', 'قَدِيرٌ'), 33)]
[(('فَبِأَيِّ', 'آلَاءِ'), 32)]
[(('قَالَ', 'رَبِّ'), 31)]
[(('رَبِّكُمَا', 'تُكَذِّبَانِ'), 31)]
[(('آلَاءِ', 'رَبِّكُمَا'), 31)]
[(('وَاتَّقُوا', 'اللَّهَ'), 30)]
[(('إِنَّهُ', 'كَانَ'), 29)]
[(('اللَّهَ', 'كَانَ'), 29)]
[(('إِنَّ', 'رَبَّكَ'), 28)]
[(('شَاءَ', 'اللَّهُ'), 28)]
[(('فِي', 'الدُّنْيَا'), 28)]
[(('كُنْتُمْ', 'صَادِقِينَ'), 28)]
[(('اللَّهَ', 'وَرَسُولَهُ'), 28)]
[(('قَالُوا', 'يَا'), 26)]
[(('أَنْزَلَ', 'اللَّهُ'), 26)]
[(('أُولَئِكَ', 'الَّذِينَ'), 26)]
[(('كُنْتُمْ', 'تَعْمَلُونَ'), 26)]
[(('الَّذِينَ', 'أُوتُوا'), 26)]
[(('بِكُلِّ', 'شَيْءٍ'), 25)]
[(('رَبِّ', 'الْعَالَمِينَ'), 25)]
[(('اللَّهَ', 'غَفُورٌ'), 25)]
[(('وَأَنَّ', 'اللَّهَ'), 25)]
[(('فِي', 'الْآخِرَةِ'), 24)]
[(('الَّذِينَ', 'قَبْلِهِمْ'), 24)]
[(('ذَلِكَ', 'لَآيَاتٍ'), 24)]
[(('وَاللَّهُ', 'عَلِيمٌ'), 24)]
[(('يَا', 'مُوسَى'), 24)]
[(('الْعَزِيزُ', 'الْحَكِيمُ'), 24)]
[(('الَّذِينَ', 'ظَلَمُوا'), 24)]
[(('خَلَقَ', 'السَّمَاوَاتِ'), 23)]
[(('بِآيَاتِ', 'اللَّهِ'), 23)]
[(('إِلَّا', 'قَلِيلًا'), 23)]
[(('هَذَا', 'إِلَّا'), 22)]
[(('جَنَّاتٍ', 'تَجْرِي'), 22)]
[(('وَقَالَ', 'الَّذِينَ'), 21)]
[(('الَّذِي', 'خَلَقَ'), 21)]
[(('إِنَّ', 'هَذَا'), 21)]
[(('الْحَمْدُ', 'لِلَّهِ'), 21)]
[(('كَانَ', 'عَاقِبَةُ'), 21)]
[(('اللَّهَ', 'يُحِبُّ'), 21)]
[(('أَيُّهَا', 'النَّاسُ'), 21)]
[(('إِلَّا', 'الَّذِينَ'), 21)]
[(('ذَلِكَ', 'لَآيَةً'), 20)]
[(('السَّمَاءِ', 'مَاءً'), 20)]
[(('الْأَنْهَارُ', 'خَالِدِينَ'), 20)]
[(('فِي', 'ضَلَالٍ'), 20)]
[(('اللَّهُ', 'الَّذِي'), 20)]
[(('فِي', 'الْحَيَاةِ'), 20)]
[(('وَالَّذِينَ', 'كَفَرُوا'), 20)]
[(('وَالْيَوْمِ', 'الْآخِرِ'), 20)]
[(('عِنْدَ', 'رَبِّهِمْ'), 19)]
[(('يَهْدِي', 'الْقَوْمَ'), 19)]
[(('اللَّهُ', 'الَّذِينَ'), 19)]
[(('فِي', 'قُلُوبِهِمْ'), 19)]
[(('كُلُّ', 'نَفْسٍ'), 19)]
[(('بِاللَّهِ', 'وَالْيَوْمِ'), 19)]
[(('مُلْكُ', 'السَّمَاوَاتِ'), 19)]
[(('اللَّهِ', 'وَاللَّهُ'), 18)]
[(('لِلَّذِينَ', 'آمَنُوا'), 18)]
[(('إِنَّا', 'كُنَّا'), 18)]
[(('أُوتُوا', 'الْكِتَابَ'), 18)]
[(('الَّذِينَ', 'يُؤْمِنُونَ'), 18)]
[(('فِي', 'السَّمَاءِ'), 18))]

Here is a bigram overlay for the english and arabic corpora. Striking correlation right there. What does this mean anyway? We think it means, God ensures that that "His" word remains true, regardless of the centuries of translations. It is "His" guarantee, as specified in the Holy Quran.

Obviously, this assumption has to be tested for all possible versions for the Holy Quran. However, we find this discovery very interesting and have developed a processing framework to scale this to larger datasets. [Feel free to contact me and this project can be huge, for all religious books perhaps]

Trigram Analysis

Like a bigram, a trigram is a similar concept in natural language processing [n-grams], only that it pertains with 3 words, instead of two. For example, if the sentence is "Then I asked him to go to school", the trigrams in this case would be "Then I asked", "I asked him" and so on.


And the text only.

Most frequent tri-grams:
[(('god', 'gracious', 'merciful'), 116)]
[(('name', 'god', 'gracious'), 114)]
[(('o', 'ye', 'believe'), 87)]
[(('god', 'oftforgiving', 'merciful'), 49)]
[(('said', 'o', 'lord'), 32)]
[(('lord', 'ye', 'deny'), 31)]
[(('favours', 'lord', 'ye'), 31)]
[(('created', 'heavens', 'earth'), 29)]
[(('verily', 'thy', 'lord'), 24)]
[(('said', 'o', 'people'), 24)]
[(('god', 'exalted', 'power'), 23)]
[(('god', 'full', 'knowledge'), 23)]
[(('god', 'last', 'day'), 23)]
[(('heareth', 'knoweth', 'things'), 19)]
[(('dominion', 'heavens', 'earth'), 19)]
[(('full', 'knowledge', 'wisdom'), 18)]
[(('ye', 'believe', 'ye'), 17)]
[(('jesus', 'son', 'mary'), 17)]
[(('establish', 'regular', 'prayers'), 17)]
[(('believe', 'god', 'last'), 17)]
[(('order', 'ye', 'may'), 17)]
[(('beneath', 'rivers', 'flow'), 16)]
[(('establish', 'regular', 'prayer'), 16)]
[(('thou', 'wilt', 'see'), 16)]
[(('exalted', 'power', 'wise'), 16)]
[(('hath', 'power', 'things'), 16)]
[(('heavens', 'earth', 'god'), 15)]
[(('fear', 'god', 'god'), 15)]
[(('exalted', 'might', 'wise'), 15)]
[(('thus', 'doth', 'god'), 15)]
[(('god', 'hath', 'power'), 15)]
[(('give', 'glad', 'tidings'), 14)]
[(('fear', 'god', 'ye'), 14)]
[(('gardens', 'beneath', 'rivers'), 14)]
[(('dwell', 'therein', 'ever'), 14)]
[(('things', 'heavens', 'earth'), 13)]
[(('said', 'o', 'moses'), 13)]
[(('god', 'well', 'acquainted'), 13)]
[(('practise', 'regular', 'charity'), 13)]
[(('believe', 'work', 'righteousness'), 13)]
[(('exalted', 'power', 'full'), 13)]
[(('god', 'sees', 'well'), 12)]
[(('men', 'path', 'god'), 12)]
[(('obey', 'god', 'apostle'), 12)]
[(('god', 'heareth', 'knoweth'), 12)]
[(('believe', 'work', 'righteous'), 12)]
[(('wilt', 'thou', 'find'), 12)]
[(('gracious', 'merciful', 'l'), 12)]
[(('exalted', 'might', 'merciful'), 12)]
[(('work', 'righteous', 'deeds'), 12)]
[(('seest', 'thou', 'god'), 11)]
[(('travel', 'earth', 'see'), 11)]
[(('fear', 'shall', 'grieve'), 11)]
[(('ask', 'thee', 'concerning'), 11)]
[(('god', 'doth', 'know'), 11)]
[(('apostles', 'clear', 'signs'), 11)]
[(('power', 'full', 'wisdom'), 11)]
[(('god', 'hath', 'revealed'), 11)]
[(('good', 'things', 'life'), 11)]
[(('thee', 'thy', 'lord'), 11)]
[(('promise', 'god', 'true'), 11)]
[(('earth', 'see', 'end'), 11)]
[(('right', 'hands', 'possess'), 11)]
[(('god', 'knoweth', 'well'), 10)]
[(('hast', 'thou', 'turned'), 10)]
[(('say', 'o', 'lord'), 10)]
[(('ye', 'brought', 'back'), 10)]
[(('gracious', 'merciful', 'o'), 10)]
[(('lord', 'knoweth', 'best'), 10)]
[(('whatever', 'heavens', 'earth'), 10)]
[(('god', 'one', 'god'), 10)]
[(('ye', 'fear', 'god'), 10)]
[(('woe', 'day', 'rejecters'), 10)]
[(('belongs', 'dominion', 'heavens'), 10)]
[(('ye', 'call', 'upon'), 10)]
[(('well', 'acquainted', 'ye'), 10)]
[(('fear', 'god', 'obey'), 10)]
[(('day', 'rejecters', 'truth'), 10)]
[(('god', 'exalted', 'might'), 10)]
[(('ah', 'woe', 'day'), 10)]
[(('turned', 'thy', 'vision'), 10)]
[(('merciful', 'o', 'ye'), 10)]
[(('dwell', 'therein', 'aye'), 10)]
[(('thou', 'turned', 'thy'), 10)]
[(('shall', 'fear', 'shall'), 10)]
[(('rivers', 'flow', 'dwell'), 9)]
[(('gardens', 'rivers', 'flowing'), 9)]
[(('worship', 'besides', 'god'), 9)]
[(('revealed', 'unto', 'thee'), 9)]
[(('full', 'knowledge', 'things'), 9)]
[(('work', 'deeds', 'righteousness'), 9)]
[(('god', 'hath', 'full'), 9)]
[(('god', 'power', 'things'), 9)]
[(('god', 'strict', 'punishment'), 9)]
[(('believe', 'god', 'apostle'), 9)]
[(('ye', 'believe', 'god'), 9)]
[(('mercy', 'thy', 'lord'), 9)]
[(('shall', 'ye', 'brought'), 9)]
[(('among', 'people', 'book'), 9)]
[(('worship', 'god', 'ye'), 9)]


And the arabic text. Again, the top trigram in english directly matches the arabic one.

Most frequent tri-grams:
[(('بِسْمِ', 'اللَّهِ', 'الرَّحْمَنِ'), 114)]
[(('اللَّهِ', 'الرَّحْمَنِ', 'الرَّحِيمِ'), 114)]
[(('يَا', 'أَيُّهَا', 'الَّذِينَ'), 92)]
[(('أَيُّهَا', 'الَّذِينَ', 'آمَنُوا'), 89)]
[(('إِنَّ', 'فِي', 'ذَلِكَ'), 50)]
[(('آمَنُوا', 'وَعَمِلُوا', 'الصَّالِحَاتِ'), 50)]
[(('فِي', 'سَبِيلِ', 'اللَّهِ'), 44)]
[(('فِي', 'السَّمَاوَاتِ', 'فِي'), 37)]
[(('السَّمَاوَاتِ', 'فِي', 'الْأَرْضِ'), 37)]
[(('الَّذِينَ', 'آمَنُوا', 'وَعَمِلُوا'), 36)]
[(('تَجْرِي', 'تَحْتِهَا', 'الْأَنْهَارُ'), 34)]
[(('كُلِّ', 'شَيْءٍ', 'قَدِيرٌ'), 33)]
[(('فَبِأَيِّ', 'آلَاءِ', 'رَبِّكُمَا'), 31)]
[(('آلَاءِ', 'رَبِّكُمَا', 'تُكَذِّبَانِ'), 31)]
[(('فِي', 'السَّمَاوَاتِ', 'وَالْأَرْضِ'), 30)]
[(('فِي', 'ذَلِكَ', 'لَآيَاتٍ'), 24)]
[(('خَلَقَ', 'السَّمَاوَاتِ', 'وَالْأَرْضَ'), 22)]
[(('اللَّهَ', 'غَفُورٌ', 'رَحِيمٌ'), 22)]
[(('إِنَّ', 'اللَّهَ', 'كَانَ'), 22)]
[(('جَنَّاتٍ', 'تَجْرِي', 'تَحْتِهَا'), 21)]
[(('فِي', 'ذَلِكَ', 'لَآيَةً'), 20)]
[(('يَا', 'أَيُّهَا', 'النَّاسُ'), 20)]
[(('تَحْتِهَا', 'الْأَنْهَارُ', 'خَالِدِينَ'), 19)]
[(('قَالَ', 'يَا', 'قَوْمِ'), 19)]
[(('إِنَّ', 'اللَّهَ', 'يُحِبُّ'), 19)]
[(('مُلْكُ', 'السَّمَاوَاتِ', 'وَالْأَرْضِ'), 19)]
[(('بِاللَّهِ', 'وَالْيَوْمِ', 'الْآخِرِ'), 19)]
[(('فِي', 'الْحَيَاةِ', 'الدُّنْيَا'), 19)]
[(('إِنَّ', 'الَّذِينَ', 'كَفَرُوا'), 18)]
[(('وَلَكِنَّ', 'أَكْثَرَ', 'النَّاسِ'), 17)]
[(('بِكُلِّ', 'شَيْءٍ', 'عَلِيمٌ'), 16)]
[(('الَّذِينَ', 'أُوتُوا', 'الْكِتَابَ'), 16)]
[(('إِنَّ', 'الَّذِينَ', 'آمَنُوا'), 16)]
[(('اللَّهَ', 'إِنَّ', 'اللَّهَ'), 15)]
[(('إِنَّ', 'اللَّهَ', 'غَفُورٌ'), 14)]
[(('ذَلِكَ', 'لَآيَاتٍ', 'لِقَوْمٍ'), 14)]
[(('فِي', 'الدُّنْيَا', 'وَالْآخِرَةِ'), 13)]
[(('اللَّهِ', 'إِنَّ', 'اللَّهَ'), 13)]
[(('وَاللَّهُ', 'غَفُورٌ', 'رَحِيمٌ'), 13)]
[(('يَا', 'أَيُّهَا', 'النَّبِيُّ'), 13)]
[(('فِي', 'ضَلَالٍ', 'مُبِينٍ'), 13)]
[(('وَاللَّهُ', 'عَلِيمٌ', 'حَكِيمٌ'), 13)]
[(('الَّذِي', 'خَلَقَ', 'السَّمَاوَاتِ'), 12)]
[(('وَاعْلَمُوا', 'أَنَّ', 'اللَّهَ'), 12)]
[(('وَاللَّهُ', 'كُلِّ', 'شَيْءٍ'), 12)]
[(('عَلِيمٌ', 'بِذَاتِ', 'الصُّدُورِ'), 12)]
[(('وَاللَّهُ', 'يَهْدِي', 'الْقَوْمَ'), 12)]
[(('لِلَّهِ', 'فِي', 'السَّمَاوَاتِ'), 12)]
[(('وَقَالَ', 'الَّذِينَ', 'كَفَرُوا'), 12)]
[(('اللَّهَ', 'كُلِّ', 'شَيْءٍ'), 12)]
[(('يَا', 'أَهْلَ', 'الْكِتَابِ'), 12)]
[(('وَالَّذِينَ', 'آمَنُوا', 'وَعَمِلُوا'), 11)]
[(('أَكْثَرَ', 'النَّاسِ', 'يَعْلَمُونَ'), 11)]
[(('الرَّحْمَنِ', 'الرَّحِيمِ', 'يَا'), 10)]
[(('الرَّحِيمِ', 'يَا', 'أَيُّهَا'), 10)]
[(('فِي', 'قُلُوبِهِمْ', 'مَرَضٌ'), 10)]
[(('فَاتَّقُوا', 'اللَّهَ', 'وَأَطِيعُونِ'), 10)]
[(('أَصْحَابُ', 'النَّارِ', 'خَالِدُونَ'), 10)]
[(('يَهْدِي', 'الْقَوْمَ', 'الظَّالِمِينَ'), 10)]
[(('إِنَّ', 'اللَّهَ', 'عَلِيمٌ'), 10)]
[(('تَرَ', 'أَنَّ', 'اللَّهَ'), 10)]
[(('افْتَرَى', 'اللَّهِ', 'كَذِبًا'), 10)]
[(('الْحَمْدُ', 'لِلَّهِ', 'الَّذِي'), 10)]
[(('وَعْدَ', 'اللَّهِ', 'حَقٌّ'), 10)]
[(('آتَيْنَا', 'مُوسَى', 'الْكِتَابَ'), 9)]
[(('إِنَّ', 'اللَّهَ', 'كُلِّ'), 9)]
[(('وَلَكِنَّ', 'أَكْثَرَهُمْ', 'يَعْلَمُونَ'), 9)]
[(('وَكَانَ', 'اللَّهُ', 'غَفُورًا'), 9)]
[(('ذَلِكَ', 'الْفَوْزُ', 'الْعَظِيمُ'), 9)]
[(('اللَّهُ', 'غَفُورًا', 'رَحِيمًا'), 9)]
[(('اللَّهُ', 'الَّذِينَ', 'آمَنُوا'), 9)]
[(('أَظْلَمُ', 'مِمَّنِ', 'افْتَرَى'), 9)]
[(('كَذَلِكَ', 'يُبَيِّنُ', 'اللَّهُ'), 9)]
[(('اللَّهُ', 'إِلَهَ', 'إِلَّا'), 9)]
[(('الَّذِينَ', 'كَذَّبُوا', 'بِآيَاتِنَا'), 9)]
[(('رَحِيمٌ', 'يَا', 'أَيُّهَا'), 9)]
[(('يَبْسُطُ', 'الرِّزْقَ', 'يَشَاءُ'), 9)]
[(('تَدْعُونَ', 'دُونِ', 'اللَّهِ'), 9)]
[(('قَوْمِ', 'اعْبُدُوا', 'اللَّهَ'), 9)]
[(('وَاتَّقُوا', 'اللَّهَ', 'إِنَّ'), 9)]
[(('يَا', 'قَوْمِ', 'اعْبُدُوا'), 9)]
[(('فَانْظُرْ', 'كَانَ', 'عَاقِبَةُ'), 9)]
[(('مِمَّنِ', 'افْتَرَى', 'اللَّهِ'), 9)]
[(('اعْبُدُوا', 'اللَّهَ', 'إِلَهٍ'), 9)]
[(('اللَّهَ', 'إِلَهٍ', 'غَيْرُهُ'), 9)]
[(('فِي', 'الْأَرْضِ', 'جَمِيعًا'), 9)]
[(('اللَّهِ', 'إِلَهًا', 'آخَرَ'), 9)]
[(('الَّذِينَ', 'آمَنُوا', 'اتَّقُوا'), 8)]
[(('رَبُّ', 'السَّمَاوَاتِ', 'وَالْأَرْضِ'), 8)]
[(('إِنَّ', 'وَعْدَ', 'اللَّهِ'), 8)]
[(('اللَّهَ', 'شَدِيدُ', 'الْعِقَابِ'), 8)]
[(('كَانَ', 'أَكْثَرُهُمْ', 'مُؤْمِنِينَ'), 8)]
[(('لِلَّهِ', 'رَبِّ', 'الْعَالَمِينَ'), 8)]
[(('مُؤْمِنِينَ', 'وَإِنَّ', 'رَبَّكَ'), 8)]
[(('ذَلِكَ', 'لَآيَةً', 'كَانَ'), 8)]
[(('رَبَّكَ', 'الْعَزِيزُ', 'الرَّحِيمُ'), 8)]
[(('الَّذِينَ', 'يُؤْمِنُونَ', 'بِالْآخِرَةِ'), 8)]
[(('الصَّلَاةَ', 'وَآتُوا', 'الزَّكَاةَ'), 8)]
[(('أَكْثَرُهُمْ', 'مُؤْمِنِينَ', 'وَإِنَّ'), 8)]
[(('فَإِنَّ', 'اللَّهَ', 'غَفُور'))]

Like before, here is a trigram overlay of the two corpora. Again, striking correlation. We would go as far as to say, the closest match of all plots done so far. The trigram overlay shows a great similarity between the english and arabic corpora of the Holy Quran - even though the two languages are different.

Word Length

A word length frequency distribution shows the lengths of the words in the corpora and how many times do they occur. This statistical analysis does not use stop word filters, but retains other filters as described earlier.


Most common word lengths:
[(3, 41869)]
[(4, 37017)]
[(2, 33338)]
[(5, 18445)]
[(6, 11003)]
[(7, 10342)]
[(8, 5608)]
[(1, 5330)]
[(9, 3818)]
[(10, 2328)]
[(11, 1065)]
[(12, 434)]
[(13, 271)]
[(14, 69)]
[(15, 10)]
[(16, 7)]
[(18, 2)]
[(17, 1)]
[(19, 1)]
[(20, 1)]

The most common word length in the english corpus of the Holy Quran is 3, amounting to 25 % of the entire text.


For arabic, the most common word length is 7, amounting to 4 % of the entire text.

Most common word lengths:
[(7, 11449)]
[(6, 9681)]
[(8, 9425)]
[(9, 8499)]
[(10, 7623)]
[(5, 7386)]
[(4, 6812)]
[(11, 5807)]
[(3, 3545)]
[(12, 3097)]
[(13, 2454)]
[(14, 1383)]
[(15, 732)]
[(16, 235)]
[(17, 126)]
[(18, 43)]
[(2, 40)]
[(19, 23)]
[(20, 9)]
[(1, 6)]
[(21, 1)]

Dispersion Plots

We love dispersion plots. A dispersion plot allows you to map a location of a word i.e. how many words from the beginning it appears. This positional information is interesting because it visually represents a word's journey in a corpus of text.

Each stripe represents an instance of a word, and each row represents an entire text. For this experiment, we will use our fully filtered corpora, without the stop words and choose specific words for plotting.


Let's look at the mention of the various prophets using a dispersion plot. Muhammed (PBUH) appears 4 times as compared to Jesus, who appears 29 times. This is due to God referring to Muhammed (PBUH) directly as "you" multiple times throughout the book.





Check back for part 2. We are interested in hearing your thoughts in the comments section below. Please share if you found this interesting, and don't forget to share your insight by telling us "what should we cover" in part 2 of this short mini-research project. Regards.

About Ali Gajani

Hi. I am Ali Gajani. I started Mr. Geek in early 2012 as a result of my growing enthusiasm and passion for technology. I love sharing my knowledge and helping out the community by creating useful, engaging and compelling content. If you want to write for Mr. Geek, just PM me on my Facebook profile.