Downloading Corpus Files

The source base for A Gospel of Health and Salvation is the collection of scanned periodicals produced by the Office of Archives, Statistics, and Research of the Seventh-day Adventist Church (SDA). That this collection of documents is openly available on the web has been fundamental to the success of this project. One of the greatest challenges for historical scholarship that seeks to leverage large digital collections is access to the relevant materials. While projects such as Chronicling America and resources such as the Digital Public Library of America are indispensable, many specialized resources are available only through proprietary databases and library subscriptions that impose limits on the ways scholars can interact with their resources.[^1]

The publishing of the digital periodicals on the open web by the SDA made it unnecessary for me to navigate through the firewalls (and legal land-mines) of using text from major library databases, a major boon for the digital project.[^3] And, although the site does not provide an API for accessing the documents, the structure of the pages is regular, making the site a good candidate for web scraping. However, relying on an organization to provide its own historical documents raises its own challenges. Due to the interests of the hosting organization, in this case the Seventh-day Adventist Church, the collection is shaped by and shapes a particular narrative of the denomination's history and development. For example, issues of Good Health, which was published by John Harvey Kellogg, are (almost entirely) dropped from the SDA's collection after 1907, which corresponds to the point when Kellogg was disfellowshipped from the denomination, even though Kellogg continued its publication into the 1940s.[^2] Such interests do not invalidate the usefulness of the collection, as all archives have limitations and goals, but those interests need to be acknowledged and taken into account in the analysis.

To determine the list of titles that applied to my time and regions of study, I browsed through all of the titles in the periodicals section of the site and compiled a list of titles that fit my geographic and temporal constraints. These are:

As this was my first technical task for the dissertation, my initial methods for identifying the URLs for the documents I wanted to download was rather manual. I saved an .html file for each index page that contained documents I wanted to download. I then passed those .html files to a script (similar to that recorded here) that used BeautifulSoup to extract the PDF ids, reconstruct the URLs, and write the URLs to a new text file, scrapeList.txt. After manually deleting the URLs to any documents that were out of range, I then passed the scrapeList.txt file to wget using the following syntax:[^4]

wget -i scrapeList.txt -w 2 --limit-rate=200k

I ran this process for each of the periodical titles included in this study. It took approximately a week to download all 13,000 files to my local machine. The resulting corpus takes up 27.19 GB of space.

This notebook reflects a more automated version of that process, created in 2017 to download missing documents. The example recorded here is for downloading the Sabbath School Quarterly collection, which I missed during my initial collection phase.

In these scripts I use the requests library to retrieve the HTML from the document directory pages and BeautifulSoup4 to locate the filenames. Finally, I use requests to download the files.

In [3]:
from bs4 import BeautifulSoup
from os.path import join
import re
import requests
In [4]:
def check_year(pdfID):
    """Use regex to check the year from the PDF filename.

    Args:
        pdfID (str): The filename of the PDF object, formatted as 
            PREFIXYYYYMMDD-V00-00
    """
    split_title = pdfID.split('-')
    title_date = split_title[0]
    date = re.findall(r'[0-9]+', title_date)
    year = date[0][:4]
    if int(year) < 1921:
        return True
    else:
        return False


def filename_from_html(content):
    """Use BeautifulSoup to extract the PDF ids from the HTML page. 

    This script is customized to the structure of the archive pages at
    http://documents.adventistarchives.org/Periodicals/Forms/AllFolders.aspx.

    Args:
        content (str): Content is retrieved from a URL using the `get_html_page` 
            function.
    """
    soup = BeautifulSoup(content, "lxml")
    buttons = soup.find_all('td', class_="ms-vb-title")

    pdfIDArray = []

    for each in buttons:
        links = each.find('a')
        pdfID = links.get_text()
        pdfIDArray.append(pdfID)

    return pdfIDArray


def get_html_page(url):
    """Use the requests library to get HTML content from URL
    
    Args:
        url (str): URL of webpage with content to download.
    """
    r = requests.get(url)

    return r.text

The first step is to set the directory where I want to save the downloaded documents, as well as the root URL for the location of the PDF documents.

This example is set up for the Sabbath School Quarterly.

In [6]:
"""If running locally, you will need to create the `corpus` folder or 
update the path to the location of your choice.
"""
download_directory = "/Users/jeriwieringa/Desktop/test"
baseurl = "http://documents.adventistarchives.org/SSQ/"

My next step is to generate a list of the IDs for the documents I want to download.

Here I download the HTML from the index page URLs and extract the document IDs. To avoid downloading any files outside of my study, I check the year in the ID before adding the document ID to my list of documents to download.

In [7]:
index_page_urls = ["http://documents.adventistarchives.org/SSQ/Forms/AllItems.aspx?View={44c9b385-7638-47af-ba03-cddf16ec3a94}&SortField=DateTag&SortDir=Asc",
              "http://documents.adventistarchives.org/SSQ/Forms/AllItems.aspx?Paged=TRUE&p_SortBehavior=0&p_DateTag=1912-10-01&p_FileLeafRef=SS19121001-04%2epdf&p_ID=457&PageFirstRow=101&SortField=DateTag&SortDir=Asc&&View={44C9B385-7638-47AF-BA03-CDDF16EC3A94}"
             ]
In [8]:
docs_to_download = []

for url in index_page_urls: 
    content = get_html_page(url)
    pdfs = filename_from_html(content)
    
    for pdf in pdfs:
        if check_year(pdf):
            print("Adding {} to download list".format(pdf))
            docs_to_download.append(pdf)
        else:
            pass
Adding SS18880101-01 to download list
Adding SS18880701-03 to download list
Adding SS18890101-01 to download list
Adding SS18890701-03 to download list
Adding SS18891001-04 to download list
Adding SS18900104-01 to download list
Adding SS18900301-01e to download list
Adding SS18900405-02 to download list
Adding SS18900415-02e1 to download list
Adding SS18900501-02e2 to download list
Adding SS18900705-03 to download list
Adding SS18901004-04 to download list
Adding SS18910103-01 to download list
Adding SS18910201-02 to download list
Adding SS18910601-03 to download list
Adding SS18911001-04 to download list
Adding SS18920101-01 to download list
Adding SS18920401-02 to download list
Adding SS18920701-03 to download list
Adding SS18921001-04 to download list
Adding SS18930101-01 to download list
Adding SS18930401-02 to download list
Adding SS18930701-03 to download list
Adding SS18931001-04 to download list
Adding SS18940101-01 to download list
Adding SS18940401-02 to download list
Adding SS18940701-03 to download list
Adding SS18941001-04 to download list
Adding SS18950101-01 to download list
Adding SS18950401-02 to download list
Adding SS18950701-03 to download list
Adding SS18951001-04 to download list
Adding SS18960101-01 to download list
Adding SS18960401-02 to download list
Adding SS18960701-03 to download list
Adding SS18961001-04 to download list
Adding SS18970101-01 to download list
Adding SS18970402-02 to download list
Adding SS18970701-03 to download list
Adding SS18971001-04 to download list
Adding SS18980101-01 to download list
Adding SS18980401-02 to download list
Adding SS18980701-03 to download list
Adding SS18981001-04 to download list
Adding SS18990101-01 to download list
Adding SS18990401-02 to download list
Adding SS18990701-03 to download list
Adding SS18991001-04 to download list
Adding SS19000101-01 to download list
Adding SS19000401-02 to download list
Adding SS19000701-03 to download list
Adding SS19001001-04 to download list
Adding SS19010101-01 to download list
Adding SS19010401-02 to download list
Adding SS19010701-03 to download list
Adding SS19011001-04 to download list
Adding SS19020101-01 to download list
Adding SS19020401-02 to download list
Adding SS19020701-03 to download list
Adding SS19021001-04 to download list
Adding SS19030101-01 to download list
Adding SS19030401-02 to download list
Adding SS19030701-03 to download list
Adding SS19031001-04 to download list
Adding SS19040101-01 to download list
Adding SS19040401-02 to download list
Adding SS19040701-03 to download list
Adding SS19041001-04 to download list
Adding SS19050101-01 to download list
Adding SS19050401-02 to download list
Adding SS19050701-03 to download list
Adding SS19051001-04 to download list
Adding SS19060101-01 to download list
Adding SS19060401-02 to download list
Adding SS19060701-03 to download list
Adding SS19061001-04 to download list
Adding SS19070101-01 to download list
Adding SS19070401-02 to download list
Adding SS19070701-03 to download list
Adding SS19071001-04 to download list
Adding SS19080101-01 to download list
Adding SS19080401-02 to download list
Adding SS19080701-03 to download list
Adding SS19081001-04 to download list
Adding SS19090101-01 to download list
Adding SS19090401-02 to download list
Adding SS19090701-03 to download list
Adding SS19091001-04 to download list
Adding SS19100101-01 to download list
Adding SS19100401-02 to download list
Adding SS19100701-03 to download list
Adding SS19101001-04 to download list
Adding SS19110101-01 to download list
Adding SS19110401-02 to download list
Adding SS19110701-03 to download list
Adding SS19111001-04 to download list
Adding SS19120101-01 to download list
Adding SS19120401-02 to download list
Adding SS19120701-03 to download list
Adding SS19121001-04 to download list
Adding SS19130101-01 to download list
Adding SS19130401-02 to download list
Adding SS19130701-03 to download list
Adding SS19131001-04 to download list
Adding SS19140101-01 to download list
Adding SS19140401-02 to download list
Adding SS19140701-03 to download list
Adding SS19141001-04 to download list
Adding SS19150101-01 to download list
Adding SS19150401-02 to download list
Adding SS19150701-03 to download list
Adding SS19151001-04 to download list
Adding SS19160101-01 to download list
Adding SS19160401-02 to download list
Adding SS19160701-03 to download list
Adding SS19161001-04 to download list
Adding SS19170101-01 to download list
Adding SS19170401-02 to download list
Adding SS19170701-03 to download list
Adding SS19171001-04 to download list
Adding SS19180101-01 to download list
Adding SS19180401-02 to download list
Adding SS19180701-03 to download list
Adding SS19181001-04 to download list
Adding SS19190101-01 to download list
Adding SS19190401-02 to download list
Adding SS19190701-03 to download list
Adding SS19191001-04 to download list
Adding SS19200101-01 to download list
Adding SS19200401-02 to download list
Adding SS19200701-03 to download list
Adding SS19201001-04 to download list

Finally, I loop through all of the filenames, create the URL to the PDF, and use requests to download a copy of the document into my directory for processing.

In [11]:
for doc_name in docs_to_download:
    url = join(baseurl, "{}.pdf".format(doc_name))
    print(url)
    with open(join(download_directory, "{}.pdf".format(doc_name)), "wb") as file:
        # get request
        response = requests.get(url)
        # write to file
        file.write(response.content)
http://documents.adventistarchives.org/SSQ/SS18880101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18880701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18890101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18890701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18891001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18900104-01.pdf
http://documents.adventistarchives.org/SSQ/SS18900301-01e.pdf
http://documents.adventistarchives.org/SSQ/SS18900405-02.pdf
http://documents.adventistarchives.org/SSQ/SS18900415-02e1.pdf
http://documents.adventistarchives.org/SSQ/SS18900501-02e2.pdf
http://documents.adventistarchives.org/SSQ/SS18900705-03.pdf
http://documents.adventistarchives.org/SSQ/SS18901004-04.pdf
http://documents.adventistarchives.org/SSQ/SS18910103-01.pdf
http://documents.adventistarchives.org/SSQ/SS18910201-02.pdf
http://documents.adventistarchives.org/SSQ/SS18910601-03.pdf
http://documents.adventistarchives.org/SSQ/SS18911001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18920101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18920401-02.pdf
http://documents.adventistarchives.org/SSQ/SS18920701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18921001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18930101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18930401-02.pdf
http://documents.adventistarchives.org/SSQ/SS18930701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18931001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18940101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18940401-02.pdf
http://documents.adventistarchives.org/SSQ/SS18940701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18941001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18950101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18950401-02.pdf
http://documents.adventistarchives.org/SSQ/SS18950701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18951001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18960101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18960401-02.pdf
http://documents.adventistarchives.org/SSQ/SS18960701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18961001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18970101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18970402-02.pdf
http://documents.adventistarchives.org/SSQ/SS18970701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18971001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18980101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18980401-02.pdf
http://documents.adventistarchives.org/SSQ/SS18980701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18981001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18990101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18990401-02.pdf
http://documents.adventistarchives.org/SSQ/SS18990701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18991001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19000101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19000401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19000701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19001001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19010101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19010401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19010701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19011001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19020101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19020401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19020701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19021001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19030101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19030401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19030701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19031001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19040101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19040401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19040701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19041001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19050101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19050401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19050701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19051001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19060101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19060401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19060701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19061001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19070101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19070401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19070701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19071001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19080101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19080401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19080701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19081001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19090101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19090401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19090701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19091001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19100101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19100401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19100701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19101001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19110101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19110401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19110701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19111001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19120101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19120401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19120701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19121001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19130101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19130401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19130701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19131001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19140101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19140401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19140701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19141001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19150101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19150401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19150701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19151001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19160101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19160401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19160701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19161001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19170101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19170401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19170701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19171001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19180101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19180401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19180701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19181001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19190101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19190401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19190701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19191001-04.pdf
http://documents.adventistarchives.org/SSQ/SS19200101-01.pdf
http://documents.adventistarchives.org/SSQ/SS19200401-02.pdf
http://documents.adventistarchives.org/SSQ/SS19200701-03.pdf
http://documents.adventistarchives.org/SSQ/SS19201001-04.pdf

    In [ ]: