Creating a Digital Dissertation

Jeri E. Wieringa

2017-11-07 00:00

The process for creating a narrative dissertation in history is, on the whole, well-established. A researcher chooses a topic of study, a theoretical framework (or two), identifies the relevant archives, spends years in those archives reading and analyzing the materials, and then uses that evidence to construct an account of a particular time, place, person, or event that had some influence on the development of culture, politics, and the like. Because the genre is well-defined, there is generally little attention paid to the process beyond the traditional acknowledgment in introductions and footnotes of the relevant theoretical frameworks and the consulted archival sources.

With the rise of digital technologies, many of the assumptions regarding the standard processes for archival research, analysis, and even publication are open to challenge and revision. Where the standard methodology for interpretation was reading and the application of theory and logic, computational algorithms enable new forms of analysis of data to ground interpretive claims.¹ Additionally, the use of computers to process more traditional forms of historical evidence create new challenges for how that research is evaluated and extended. Where scholarly arguments based on interpretation could be evaluated through a rereading of the historical objects and a consideration of the logic of the initial interpretation, evaluating work that relies on computation requires engagement with the source code. A reader cannot adequately assess arguments where the author notes a source base and a general computational approach, since part of the logic of the analysis is embedded in the implementation and is only visible through the code.²

Additionally, the very process of using computational tools in the analysis of historical data and the crafting of an interpretive narrative expands what has traditionally been considered the work of an individual scholar into a work that is more obviously a composite of the intellectual work of multiple authors.³ Whereas the written dissertation weaves together evidence and theory with well-constructed prose in order to convince readers of a particular interpretation, digital work weaves together evidence, theory, software libraries, and display frameworks in creating the final result, where the software and display contribute intellectual work of their own in shaping the final product.

As a result of these concerns, the Department of History and Art History at George Mason University has included the requirement for a self-reflective process statement as part of digital dissertation projects submitted to the department. This statement, which follows, is required to give a “full accounting of the technical and analogue work that went into building the digital dissertation” as well as the “code and software employed to produce the final dissertation.”⁴ This process of documenting the technical structure of the project aids in its evaluation as a complex and multi-layered work of scholarship. Much of the information found here is also documented throughout the dissertation itself, as part of my argument is that the technical elements are as much a part of the intellectual work of the dissertation as the more traditional narrative prose.⁵ By providing an overview of the technical whole of the project in this space, readers can quickly orient themselves to the project.

The process statement provides a summary and accounting of the three primary layers of the dissertation: the data, the analysis, and the presentation interfaces. In it, I document the software used for developing each layer and provide the information necessary for running the different aspects of the project on a local machine. This provides a mechanism for future viewers to reconstruct the project should it cease to live online, to test elements as part of an evaluation of the work, or to extend parts of the project for other applications.

Data for Historical Analysis

As with any digital project, A Gospel of Health and Salvation is dependent on the availability of digital sources upon which to operate. Due to the SDA’s commitment to making their historical materials as widely available as possible, a large percentage of the denomination’s periodical literature has been digitized and is available through the church’s websites. At the time of writing, the periodical scans were hosted at the Online Archive of the SDA’s Office of Archives, Statistics, and Research. The files are also increasingly available through the Adventist Digital Library, a compilation archive for a range of historic SDA materials. A full listing of the included periodicals, with links to the original PDF files, is included in the online bibliography.

In Chapter Two of the dissertation I describe the processes by which I selected, evaluated, and cleaned the textual data from the digital files. That work is also documented in the Gather, Clean, and Preprocess Notebooks. I do not offer my own interface for accessing the digital files as part of this project, recommending instead that they be accessed and viewed through the infrastructure of the SDA. I can make the processed text that was used for the topic modeling phase available on request. However, that text should also be re-creatable using the original scans and the associated notebook files.

I used additional data sources to support the analysis of the periodicals. This included place and people names from the SDA’s Yearbooks, place names from USGS, and third-party spelling lists. The data that I compiled or generated are available for download through the site, while external data sources are documented in the online bibliography.

Computational Analysis

From automating the download of the denominational periodicals to visualizing the topic model, this project has relied on computational methods at each stage. The primary coding language for the project is Python, as documented in the included notebooks section of the dissertation. Additionally, I used AntConc for computing keyness values as part of my evaluation of the strength of the topic model, as I discuss in Chapter 3. Other than the work in AntConc and my initial pass at downloading the PDF documents and extracting the text, all the computational pieces of the dissertation were done using Python libraries rather than desktop or online software so that the work could be easily documented and reproduced.

The topic model was created using Mallet through the command-line interface, as documented in the Model notebook. Some preprocessing steps, such as connecting noun phrases and the creation of the stopword list were completed using Gensim. I analyzed the model with Python, using the Pandas, Plotly, and pyLDAviz libraries for data analysis and visualization.

A list of the major software libraries utilized in this project is available in the bibliography.

Digital Interfaces

Selecting and developing the interfaces for the dissertation proved to be a challenge. While a single integrated application is the goal for future iterations of the project, the constraints of the dissertation resulted in my more modular approach to the different components of the project. There are four main interfaces for interacting with the dissertation. The first is the main project website, located at dissertation.jeriwieringa.com. The second is the topic model browser, located at browser.dissertation.jeriwieringa.com. The third is Jupyter Notebooks, which serves the dual functions of documenting the code used throughout the project and being an executable document, meaning that the provided notebook files can be used to execute the code locally. The final interface for the project is GitHub, with repositories for the main sites, the code notebooks, and a Python library I created for frequently used functions.

I created the main dissertation website using Nikola, a Python static site generator. This site generator enabled me to include a variety of different formats as part of the single project, including the notebook, html, markdown, and reStructuredText files. The default styling for the site uses the Bootstrap framework, which I have adapted. The notebook pages are included within the body of the dissertation in a static format — they can be viewed as last run but not executed. The notebook files rely on a Jupyter server to execute.

The main topic model browser is an application of Andrew Goldstone’s Topic Model Browser, which relies on D3.js. Using the output from MALLET and the scripts provided as part of the browser, I transformed the data so that it could be interacted with in this format. I manually created the labels for the topics, as I describe in Chapter 3.

Finally, I captured a number of the more repeatedly used functions into a Python library for ease and stability of use. This library is necessary for testing the code included in the project notebooks and offers examples for adoption and extension in other contexts. It can be downloaded from Github and installed locally using pip, as described in the library README file.

Archiving and Reconstructing A Gospel of Health and Salvation

Submitting a project such as this as a dissertation raised a whole myriad of questions regarding how to use existing processes and platforms for complex digital objects. The current system for archiving dissertations at George Mason relies on the use of a DSpace repository, which is optimized for the collection and preservation of singular, preferably PDF, files. Additionally, the formatting requirements for dissertations assume a textual final product, one that can be created with Microsoft Word. For the submission and archiving of this dissertation, I chose to pursue a hybrid strategy. This essay, along with the introduction, website overviews, and bibliography, make up the “dissertation object” for this project — the primary object that is properly formatted, cataloged, and archived in the repository.

For archiving and preserving the digital aspects of the project, I pursued a two-tiered strategy. First, to preserve the appearance of the project at the point of submission, I captured the web interfaces using WebRecorder.io, as well as submitted the sites to the Internet Archive. This includes dissertation.jeriwieringa.com and browser.dissertation.jeriwieringa.com. These interfaces are preserved within .warc (WebARChive) files and are viewable through a web archive player. For capturing the code and data of the project, I archived the code and data files for the different dissertation components and documented the required software and versions for running the code. These files can be downloaded and run on a local machine to test different aspects of the project or to modify them for other uses. This approach captures both the reading experience of the digital dissertation, as well as its technical underpinnings. These collections are preserved in the George Mason University Archival Repository, MARS.

Reproducing the Research

The computational aspects of the dissertation are split into four components, two for computational processing, and two for presentation:

Python Library - https://github.com/jerielizabeth/text2topics
Jupyter Notebooks - https://github.com/jerielizabeth/Gospel-of-Health-Notebooks
Website - https://github.com/jerielizabeth/Gospel-of-Health-Website
Model Browser - https://github.com/jerielizabeth/dfr-browser

Because of this structure, there is some duplication of files between the repositories. I managed the moving of files using the DoIt Automation tool.

To recreate the computational work that underlies the dissertation, one will need three primary components:

The periodical files from the Adventist Archives,⁶
The Python library that contains my custom functions, and
The Jupyter notebooks that document my processes for processing the text files and the resulting data.

To run on a personal laptop, you will need Jupyter running locally and the supporting python libraries installed. Those libraries are documented in the environment.yml file in the Notebooks directory. The notebook files can also be uploaded to a third party Jupyter server, such as Microsoft Azure Notebooks or Google Colaboratory, for users who do not wish to set up a local Jupyter server.

Together these components make it possible for the intellectual work of preparing and analyzing the text to be examined and duplicated.

Rebuilding the Websites

To recreate the website portions of the dissertation, one will need the files for both the main website and the model browser. The main website uses Nikola to create html pages from a collection of markdown, restructured text, and notebook files. To run locally, use the nikola serve command from the root of the site directory.

The model browser is a single-page JavaScript application, built using D3.js. To run locally, run /bin/server from the root of the browser directory to launch a basic Python3 webserver.

Both sites require Python3. I recommend using a package and environment management system for running these elements locally. I used Miniconda for the dissertation.

Technical Support

When I started working on the main part of the dissertation, my programming experience to date had been the two required digital history courses at George Mason University, an additional introduction to programming course, some hands-on experience in web development through my research assistantship at the Roy Rosenzweig Center for History and New Media, and a Rails Girls workshop. In retrospect, embarking on a technical project from that starting point was a bit over-ambitious. My initial design of the project included network analysis from the denominational Yearbooks and geospatial analysis of people, publications, and ideas, along with text analysis. While I still think that these additional modes of analysis would help illuminate the development of this particular group of people, these ideas have been bracketed for future iterations of the project.

I have been overly committed in this project to doing my own computational work, both because I am committed to the idea that one needs to grapple with the assumptions and implementation of computational and historical analysis when bringing the two modes of inquiry together and because of the gender politics of the field. I am, however, deeply indebted to many people who have given their time and energy to help me understand and troubleshoot the technical aspects of this project. Chief among these are Fred Gibbs, the Experimental Humanities Group at the Iliff School of Theology, Lincoln Mullen, who consulted on the network analysis piece of the project that was unfortunately tabled due to time constraints, Amanda Regan, Taylor Arnold, and Lauren Tilton. The computational work in this project is primarily my own, aside from a myriad of snippets gleaned from StackOverflow and the libraries and resources noted above. The one major external contribution was the workflow I used for moving model files from DigitalOcean, where I ran Mallet due to the size of the corpus, to AmazonS3 for storage and locally for use, which was set up by Jason Wieringa.

Tanya A. Clement, “Where Is Methodology in Digital Humanities?” in Debates in the Digital Humanities (University of Minnesota Press, 2016), http://dhdebates.gc.cuny.edu/debates/text/65 argues that for digital humanities to be situated “within a humanist epistemological framework” it “must also entail an explicit articulation of … how our techniques are tied to theory.” She notes that “the hermeneutical methods associated with reading,” the default methodology in humanities studies, “remain largely unarticulated,” which further complicates the work of introducing new methods.↩
This problem of source code is one of the core elements of Da’s recent critiques of recent work in computational literary analysis. Nan Z. Da, “The Computational Case Against Computational Literary Studies,” Critical Inquiry 45, no. 3 (2019): 601–39, https://www.journals.uchicago.edu/doi/10.1086/702594. These concerns are not unique to history or the humanities. As computation and data science gain ground in the sciences as a mechanism for knowledge production, similar questions around reproducibility and code are increasingly of central concern.↩
Whether that myth of the individual author has ever been true is another question, as all intellectual work relies on a robust intellectual community.↩
Department of History Art History, “Digital Dissertation Guidelines,” 2019, https://historyarthistory.gmu.edu/graduate/phd-history/digital-dissertation-guidelines.↩
Additionally, traditional dissertations may also benefit from such statements, particularly as software makes complex computational processes easier for non-technical users and such work is incorporated into traditional narrative prose.↩
A list of the titles I used for the dissertation is included in the bibliography of the project.↩