African languages and descriptional density

I have recently been having some fun with bibliographical data. Specifically, I have tried to determine a simple way to calculate the "descriptional density" for various African languages, especially with regard to grammar descriptions.

Descriptional density (a concept I’ve invented myself, I think) aims to determine how well-described any given language is in terms of existing grammar books and dictionaries. For instance, if a given language has only one grammar book written about it, and another language has fourteen grammar books written about it, then obviously the latter language is more well-described than the former. In other words, it’s descriptional density is higher.

There are no doubt many factors that should be taken into account when calculating something like descriptional density, such as number of publications or titles, size of description, number of authors involved, number of varieties described, availability of the grammar(s), and so on. However, many such factors are difficult to operationalize in simple ways. For instance, the size of a grammar book is not always related to its inherent usefulness, quality or even comprehensiveness. The availability of an item is difficult to determine easily (at least as a numerical value). Indeed, there are seemingly only two factors that can be handled without stumblig onto major difficulties, and still get a reasonably informative result: number of titles or works (W) and time span (T). These can be worked into a formula as follows:

DD formula

In general, one grammar book equals a W value of 1. However, many grammar books appear in second, third, fourth, etc., editions. It seems unintuitive to give a second edition the same weight as a first edition. After all, it is still essentially the same book, albeit with some minor or major revisions. Hence it seems convenient to distinguish primary works (W1) from secondary works (W2). While primary works are given a value of 1, secondary works are given a value of 1/3 (a third).

T (time span) represents the number of years spanning between the publication of the first and the latest grammar. For instance, my bibliography includes 135 primary works (grammar books) for Swahili. The earliest of these was published in 1850, and the latest in 2006. This gives a time span of 156 years. In order for this number not to inflate the calculations unnecessarily, it needs to be whittled down a bit, which is why I use the square root of the actual time span in the formula.

By adding the total number of primary works (W1), with a third of the total number of secondary works (W2), and the square root of the time span (T), we get a total index value representing the descriptional density (DD) for any given particular language.

Here, then, is a list of fifteen of the largest Bantu languages spoken in Africa, ranked according to their DD (descriptional density) values:

    LANGUAGE DD VALUE W1 + W2 T
    Swahili 173.49 135 + 78 156
    Zulu 70.53 42 + 48 157
    Kikongo 67.29 45 + 11 347
    Chewa/Nyanja 51.11 31 + 26 131
    Xhosa 42.15 20 + 27 173
    Shona 41.63 26 + 15 113
    Setswana 39.08 20 + 18 171
    Lingala 37.29 23 + 11 113
    Sesotho 31.45 16 + 9 155
    North Sotho 25.91 14 + 3 119
    Luba-Kasai 25.82 13 + 7 110
    Kirundi 24.23 12 + 7 98
    Kinyarwanda 21.54 9 + 9 91
    Sukuma 20.87 11 + 1 91
    Kikuyu 17.31 7 + 2 93

Notice how the ranking only roughly corresponds to the actual number of grammar descriptions (whether we look at primary works only or primary and secondary works jointly). By taking time span into account, we get a bit more sophisticated picture of how well-described any given language is. As already mentioned, I have only looked at grammar descriptions. For a more comprehensive look, I need to look also at dictionaries, but that is a project for another sleepless night.

You can read more details about this here.