About the DfR Browser

Jeri E. Wieringa

2019-01-24 00:00

At the core of the dissertation is a 250-topic model of the periodical literature of the Seventh-day Adventist denomination. As I discuss throughout the dissertation, topic modeling is a form of unsupervised machine learning that enables the exploration of a corpus of literature based on the co-occurrence of words, clustered into units called “topics.” While for the computer, the topics represent the likelihood of word occurrences in different contexts, for the human reader topics can be used to track different themes or discourses within a corpus. The model provides a useful abstraction of a corpus of literature, highlighting different features within a collection of texts.

For this project, I used the DfR Browser, created by Andrew Goldstone. While not the only topic model interface available, the DfR Browser has a number of advantages. First, structurally the browser is a static, single-page website, which reduces the complexity of hosting and archiving the work. Second, the browser provides a number of useful interfaces into the topic model data, enabling the reader to explore from the level of the overall model and corpus to the level of individual words. Third, the methods used by Goldstone to calculate topic weights over time are similar to those I use throughout the rest of the dissertation, computing the percentage of total tokens, or words, assigned to each topic in a given year. This provides a consistent representation of the topics between the visualizations within the dissertation and the browser.

To aid use of the browser, I provide here an overview of the different “views” to help orient the reader and clarify the ways I anticipate the model being used.¹ These views are the model view, the topic view, the document view, and the word view. Each provides a different perspective on the tokens within the corpus of Seventh-day Adventist periodicals and can be used to explore different aspects of the model and of the periodicals over time.

The Model View

The landing page of the model browser enables the user to view the overall structure of the topic model and provides four different views of all of the topics. The “Grid” view shows all the topics as circles, with the boundary widths indicating overall prominence, while the “Scaled” view (currently not functional) varies the size of the circles according to the prominence of the topic across the whole corpus. The “List” view shows the topics alphabetically with a chart of prevalence over time as well as overall proportion in the corpus and can be sorted along any of these columns. And finally, the “Stacked” view provides a steamgraph visualization of all of the topics, with the width corresponding to percentage, or total word counts, depending on the user selection.

The model view provides the user insight into the overall structure of the topic model. In these views, we can see that there is a large number of topics for the corpus, with both more generalized and focused topics. Because of the amount of data generated with this many topics, documents, and words, the model view provides suggestions of places for further inquiry but offers few clear suggestions of patterns in the overall corpus.

The Topic View

The topic view allows the user to explore each topic individually, showing the words that are most prominently associated with the topic, as well as the documents where the topic is most prevalent. Additionally, when a user selects a year in the graph showing the “Conditional Probability of Words in Topic,” the top documents list updates to reveal the documents with the greatest prevalence of the topic in that year. These features enable the user to explore which areas of the corpus are most strongly associated with a particular topic. This information, combined with the graph of topics over time, helps illuminate the language patterns identified by the model. I used this information in assigning interpretive labels to the topic, reading the context from the top documents as well as exploring particular years, both with low and high overall prevalence, in order to further refine my understanding of the topics.²

The Document View

The document view provides provides insight into how the topic modeling algorithm labeled the words within each document, showing each identified topic and the number and percentage of words associated with it. This breakdown of topic assignments provides something of a summary of the content of the document, indicating which general themes or discourses the document is likely engaged with. The view also provides some initial indication as to which topics might be linked to one another.

The document level view can also be used to begin to evaluate the performance of the topic modeling algorithm. In the case of the first document, linked above, the long list of identified topics, many with only one word associated with it, suggests that the topic model could be further refined by encouraging fewer topic assignments within documents.³ Despite this weakness, the top three topics assigned to the document provide a good indication of the content, which features the personal testimony of the conversion experience of one of the students at Battle Creek College.

The Word View

The word level view has two associated interfaces: the first shows all of the words that appear in the topic model browser and the second shows the topics in which individual words appear. These two interfaces provide insight into the overall language use within the full set of documents, as well as the ways the topic model treated the individual words from the texts. The full list shows the most prominent words that appear in the corpus, providing a summary of the language of the denomination. The topic view for individual words reveals how words have been grouped by the algorithm, together with the interpretive label I assigned to that cluster of language. This interface provides another way to engage the corpus, working from key words of interest through the associated topics to the documents.

Conclusion

The DfR Browser is well-designed to provide users with a high-level view of a topic model as well as the more detailed information needed for evaluating the strengths of the model and using the model to identify documents related to a subject of inquiry. The four views of the browser let users move back and forth between these different views of the corpus and its language, a key feature for putting the model to use in interpretive work. While not the only topic model browser under development, this approach has significant advantages for research work in humanities contexts, as it grounds the interface in the documents, rather than the model itself. Additionally, the technical load of the browser is light, simplifying the processes involved in presenting and archiving the model interface.

My dissertation project has also brought to light some additional features that would enhance the browser for model exploration and interpretation. The default set up for the browser enables exploration of the topic model, but only limited manipulation of the model data. While the algorithm currently clusters topics based on time, enabling further grouping of documents and topics based on other metadata variables at the level of the user interface would enable additional exploration into the ways particular topics, or discourses are distributed by publication type or geographic region.⁴ Additionally, the inclusion of the document text within the document view, with the topic assignment of the words indicated, would provide further useful information on how the model has characterized the language of the document. Finally, a reporting mechanism that would enable returning a collection of documents that fit a particular set of parameters would expand the usefulness of the browser from the exploration of a model to leveraging the clustering work of the algorithm to identify related documents on a given topic.

A number of these changes push the interface beyond what can be achieved within a static site, as the memory load on the browser would become difficult to support. As such, it is also worth considering an adaptation of the browser that relies on a database for serving the corpus and model data. This modification would remove some of the advantages of the static-site approach but would make the browser a more robust tool for research.

Leveraging a topic model for historical research is a multifaceted problem, extending from the work of selection and preparation of a set of documents to the interfaces for interpreting and utilizing the resulting data. While experimental work in the digital humanities has focused largely on the process of running topic modeling algorithms on a set of documents, there is significant research to be done on the pre-processing and post-processing steps that make the running of the algorithm possible and legible. As both of these greatly shape the results of the computational analysis and its usefulness in historical research, they are vital areas for further scholarly attention from those working at the intersection of computation and the humanities.

I provide details on the construction and computational use of the topic model within the dissertation and the dissertation notebooks.↩
This suggests a future change to the browser view, where topic assignments of under some threshold are dropped from the data prior to the steps of aggregating, smoothing, and normalizing the topic assignment data. In testing for the threshold, I would begin around 5% of the words in the document.↩
For a technical introduction to the various parameters involved in creating and optimizing a topic model, see Hanna M. Wallach, David M. Mimno, and Andrew McCallum, “Rethinking Lda: Why Priors Matter,” Neural Information Processing Systems 2009 22 (2009): 1973–81, https://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter. See also the Appendix of Andrew Goldstone and Ted Underwood’s essay in the Journal of Digital Humanities. Andrew Goldstone and Ted Underwood, “What Can Topic Models of Pmla Teach Us About the History of Literary Scholarship?” Journal of Digital Humanities 2, no. 1 (2013), http://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/.↩
The current version of the browser does enable some customization on this front during setup, but not such fluid manipulation on the user end.↩