dianoigo blog

Wednesday, 23 November 2016

Word counts by book for Septuagint, New Testament, Apostolic Fathers and Justin Martyr

Religious studies meets data science. The result:

Click on image for a larger result.

Texts used for this exercise were as follows. The LXX text is taken from freely available text files from the Center for Computer Analysis of Texts at the University of Pennsylvania. The NT text is taken from freely available text files of the SBL GNT maintained by James Tauber. The Apostolic Fathers text is taken from the Logos software edition of Michael W. Holmes' critical text.1 Justin Martyr's writings are taken from online Greek texts which are in turn based on Goodspeed's 1915 critical text (at least for the Apologies; no attribution is present for the Dialogue with Trypho). Text mining to obtain the word counts was conducted using R statistical software.

A few fun facts:

  • We have more words of Justin Martyr preserved (69741) than the entire Apostolic Fathers corpus (64757), thanks to the truly massive size of his Dialogue with Trypho.
  • Justin Martyr and the Apostolic Fathers combined (134498) are only slightly shorter than the New Testament (137554).
  • The Gospels and Acts make up over 60% of the New Testament by word count. The Pauline corpus makes up "only" 23.5%.
  • The whole of the LXX consists of 589013 words (based on the texts used here). Of this, 82% comes from books considered canonical by Protestants (albeit in Hebrew). An additional 13% (77806 words) comes from books considered canonical by Roman Catholics but not Protestants (1-2 Maccabees, Wisdom of Solomon, Sirach, Judith, Tobit, Baruch, Epistle of Jeremiah, Bel and the Dragon, Susanna).2 The other 5% comes from books not considered canonical by Protestants or Roman Catholics (1 Esdras, 3-4 Maccabees, Odes of Solomon, Psalms of Solomon).

A couple of caveats. In cases where two quite divergent text families exist for a single book (e.g., Joshua, Judges, Daniel, Susanna, Bel and the Dragon, Tobit) I've just represented one of the texts. It should also be noted that some of the texts have lacunae (Epistle to Diognetus; Dialogue with Trypho) or lost endings (Gospel of Mark; Didache), so the original word count would have been larger than the one reported here. Other texts have portions extant only in Latin (Polycarp's Epistle to the Philippians; Shepherd of Hermas) which will also have slightly affected the word count since, for example, there is no article in Latin. For the Martyrdom of Polycarp I've only included chapters 1-20 since the epilogues in chapters 21-22 are obviously added by later hands.


  • 1 Michael W. Holmes, The Apostolic Fathers: Greek Texts and English Translations (Grand Rapids: Baker, 2007).
  • 2 The Greek additions to Esther, also considered canonical by Roman Catholics, are not included here since I didn't go to the trouble of counting these words separately.


Brian said...

Hi Tom. I have only just this moment discovered your blog. Is it still active? I have a question I’d like to ask you about the word count. For a start, I must point out that I have no academic qualifications. I’m just a general reader with an interest in Biblical languages.
I believe the four books of Kingdoms originally made up a single long book, and similarly the two books of Chronicles, and that in both cases the division into the existing “books” was simply a practical matter having to do with keeping each scroll short enough to be easily handled. Can you please confirm this? If so, I have a follow-up question I’d like to ask you.

Tom said...

Hi Brian.

Yes, the blog is still active in the sense that I monitor comments and am happy to engage with readers. The rate of new content being published has slowed almost to a standstill due to my offline life having become much busier. I'm considering launching a podcast and, if I do, that will probably more or less take the place of the blog in terms of new content.

In the Hebrew Bible, Samuel, Kings, and Chronicles are each one book. In the Septuagint, at least as transmitted by the early church, Samuel and Kings were "split and combined" into four books, with Samuel split into 1 & 2 Kingdoms and Kings into 3 & 4 Kingdoms. Chronicles was also split into two books.

I don't know why the books were split, but the explanation that it is due to the constraints of scroll length sounds plausible.

Narski said...

I was wondering whether you also had analyzed the number of unique lexemes in those books? I've tried to Google this for some time and looked at a few books about the Septuagint, but nowhere can I find an estimate of the number of unique words in each septuagint book.

Tom said...

Hi Narski.

Raw Septuagint data can be found here:

If you're willing and able to do the necessary data wrangling, I think you should be able to get the number of unique lexemes per book.

It also looks like James Tauber is in the process of creating morphologically tagged versions of the Apostolic Fathers, as he previously did for the New Testament. See here:


Tom said...

A follow-up comment: there are morphologically tagged text files for the LXX in Unicode produced from the CATSS files by Nathan Smith here. These should be much easier to work with.

Narski said...

Thanks! This should make the process painless.