Since the Phase-I release of 32,853 texts by the Early English Books Online – Text Creation Partnership, I’ve been playing around with the texts in R. I wanted to develop a simple set of techniques for searching the database, filtering it for my always-changing interests, and importing the texts into R for quantitative analysis. Basically, I wanted to be able to build small- and medium-sized document collections on the fly, and I wanted to be able to run some quick analyses (word frequency counts, KWIC concordances, and topic models using mallet) without having to start from scratch every time. I also wanted to make something that would be useful to myself in the future, my students, and, maybe, other people. With the help of Travis Mullen, and thanks to an ASPIRE I research grant from the University of South Carolina, this summer we converted these functions into an R package, which we’re calling tei2r.

tei2r is still very much in development. It’s designed to be simple, so hopefully there won’t be too many glitches. It’s basically a wrapper for two already existing R packages, XML, which parses XML documents, and mallet. We’d love to have others test it out on their machines. Hopefully it will be useful. Right now, it’s available as a repository on Github, https://github.com/michaelgavin/tei2r. (For installation instructions, see here.) Testing and feedback very welcome!

How to build a topic model from EEBO in 5 commands

Once you have tei2r installed and in your library, you can build a topic model and store its results in R, including easy-to-read tables showing the results in several ways, in just 5 commands. (In this example, I’m downloading, processing, and modeling 789 documents from the EEBO-TCP that were published in 1660.)

  1. Search the EEBO-TCP for the files you want
    results = tcpSearch(range = 1660, field = "Date", write = T)
  2. Download the results
    tcpDownload(results)
  3. Initialize your collection in R
    dl = buildDocList(directory = getwd(), indexFile = "index.csv")
  4. Pre-process the texts
    dt = importTexts(dl)
  5. Train the model and store the results in R
    tmod = buildModel(dl = dl, dt = dt, tnum = 20)

 

The above commands are offered merely as a sample to introduce one piece of functionality. For an introduction to the tei2r package, including installation instructions and an overview of its primary functions, click here. Thanks!

Leave a Reply

Your email address will not be published.