Process-metadata-for-dfrbrowser

Metadata for browser:

  • doi: url
  • title: Expanded Name (i.e., Review and Herald (1.1) Dec. 12, 1870)
  • authors: blank
  • journal: expand name
  • volume: number
  • issue: number
  • date: yyyy-mm-dd
  • page: page
In [1]:
import csv
import datetime
from text2topics import utilities
import pandas as pd
import re
import urllib
In [2]:
base_url = "http://documents.adventistarchives.org/Periodicals/"
In [3]:
title_keys = {"ADV": "Training School Advocate",
"AmSn": "American Sentinel",
"ARAI": "Advent Review and Sabbath Herald Anniversary Issue",
"CE": "Christian Education",
"CUV": "Welcome Visitor (Columbia Union Visitor)",
"EDU": "Christian Educator",
"GCB": "General Conference Bulletin",
"GH": "Gospel Herald",
"GOH": "Gospel of Health",
"GS": "Gospel Sickle",
"HM": "Home Missionary",
"HR": "Health Reformer",
"IR": "Indiana Reporter",
"LB": "Life Boat",
"LH": "Life and Health",
"LibM": "Liberty",
"LUH": "Lake Union Herald",
"NMN": "North Michigan News Sheet",
"PHJ": "Pacific Health Journal and Temperance Advocate",
"PTAR": "Present Truth (Advent Review)",
"PUR": "Pacific Union Recorder",
"RH": "Review and Herald",
"Sligo": "Sligonian",
"SOL": "Sentinel of Liberty",
"ST": "Signs of the Times",
"SUW": "Report of Progress, Southern Union Conference",
"TCOG": "The Church Officer's Gazette",
"TMM": "The Missionary Magazine",
"WMH": "West Michigan Herald",
"YI": "Youth's Instructor"}
In [4]:
df = pd.read_table('/Users/jeriwieringa/Dissertation/models/data/target_300_10_Sample.txt',
                  header=None,
                  names=['doc_id', 'label', 'content'])
In [5]:
df
Out[5]:
doc_id label content
0 ADV18981201-V02-01-page12.txt en uncommon thing member boarding school form cl...
1 ADV18981201-V02-01-page13.txt en lex truly lord able save uttermost bound long...
2 ADV18981201-V02-01-page15.txt en personal experience lord impressed written te...
3 ADV18981201-V02-01-page16.txt en pride heart thought foolish think thing isa p...
4 ADV18981201-V02-01-page20.txt en anew important department college college con...
5 ADV18990101-V01-01-page34.txt en advocate church school miss writes school jun...
6 ADV18990101-V01-01-page35.txt en the_advocate younger year old frightened thre...
7 ADV18990101-V01-01-page44.txt en the_advocate library reading_room furnished m...
8 ADV18990101-V01-01-page45.txt en the_advocate student desire work order reduce...
9 ADV18990201-V01-02-page54.txt en advocate weight pound ounce silver charger cu...
10 ADV18990301-V01-03-page41.txt en the_advocate power spirit saith lord host wor...
11 ADV18990301-V01-03-page60.txt en spring announcement library reading_room furn...
12 ADV18990301-V01-03-page61.txt en spring announcement point new student soon po...
13 ADV18990501-V01-05-page47.txt en advocate word work church school teacher beau...
14 ADV18990501-V01-05-page48.txt en the_advocate low isn queer took long think ou...
15 ADV18990501-V01-05-page49.txt en the_advocate foundation building foundation t...
16 ADV18990601-V01-06-page110.txt en advocate difficulty rose mountain high faith ...
17 ADV18990601-V01-06-page111.txt en the_advocate close school student entered can...
18 ADV18990601-V01-06-page114.txt en the_advocate little house cut basement school...
19 ADV18990601-V01-06-page115.txt en the_advocate bers kept coining present enrolm...
20 ADV18990601-V01-06-page116.txt en the_advocate mission playground little alley ...
21 ADV18990601-V01-06-page118.txt en the_advocate est increased daily subject disc...
22 ADV18990601-V01-06-page119.txt en pea advocate opened enrolled day member ship ...
23 ADV18990601-V01-06-page121.txt en the_advocate farming hour spend farmer outof ...
24 ADV18990901-V01-08-page41.txt en the_advocate duplicated country matter seen n...
25 ADV18991001-V01-09-page10.txt en advocate ion kellogg let consider aim purpose...
26 ADV18991001-V01-09-page11.txt en the_advocate hud grows swell swell finally un...
27 ADV18991001-V01-09-page12.txt en the_advocate digestion respect doe thinking t...
28 ADV18991001-V01-09-page13.txt en tee advocate teacher provide best possible co...
29 ADV18991001-V01-09-page14.txt en the_advocate school time come right line god ...
... ... ... ...
180814 YI19200106-V68-01-page6.txt en marquesan beauty woman slavery far land raid ...
180815 YI19200106-V68-01-page7.txt en january youth instructor correct thing conven...
180816 YI19200106-V68-01-page8.txt en the_youth young clean looking railroad conduc...
180817 YI19200106-V68-01-page9.txt en january youth instructor free bagging best po...
180818 YI19200113-V68-02-page10.txt en the_youth instructor january think alice smoo...
180819 YI19200113-V68-02-page11.txt en january youth instructor honestly know teach ...
180820 YI19200113-V68-02-page12.txt en instructor january missionary volunteer depar...
180821 YI19200113-V68-02-page13.txt en january youth instructor christ crucified lat...
180822 YI19200113-V68-02-page14.txt en outh gift clever speech wider flung wit sense...
180823 YI19200113-V68-02-page2.txt en belgium decided electrify railroad beginning ...
180824 YI19200113-V68-02-page3.txt en youth instructor vol lxvii takoma park washin...
180825 YI19200113-V68-02-page4.txt en the_youth america democratic president govern...
180826 YI19200113-V68-02-page5.txt en january youth instructor grow little new wood...
180827 YI19200113-V68-02-page6.txt en the_youth instructor january calmly confident...
180828 YI19200113-V68-02-page7.txt en january youth instructor sharp intrusion righ...
180829 YI19200113-V68-02-page8.txt en the_youth instructor january beginning day bi...
180830 YI19200113-V68-02-page9.txt en january youth instructor recital event dread ...
180831 YI19200120-V68-03-page10.txt en the_youth instructor january federacy prey un...
180832 YI19200120-V68-03-page11.txt en january youth instructor business principle g...
180833 YI19200120-V68-03-page12.txt en junior teaching harold gregg iome let said jo...
180834 YI19200120-V68-03-page13.txt en january youth instructor said dick hope borro...
180835 YI19200120-V68-03-page14.txt en the_youth instructor january issionary volunt...
180836 YI19200120-V68-03-page2.txt en sir william osler noted physician connected j...
180837 YI19200120-V68-03-page3.txt en youth instructor iii takoma park washington j...
180838 YI19200120-V68-03-page4.txt en the_youth instructor january yard away glimps...
180839 YI19200120-V68-03-page5.txt en january youth instructor tryphena tryphosa la...
180840 YI19200120-V68-03-page6.txt en the_youth instructor january gave aviary year...
180841 YI19200120-V68-03-page7.txt en january youth instructor going succeed druggi...
180842 YI19200120-V68-03-page8.txt en the_youth instructor january missionary land ...
180843 YI19200120-V68-03-page9.txt en january youth instructor echo history futile ...

180844 rows × 3 columns

In [6]:
def get_split_id(doc_id):
    return doc_id.split('-')

def construct_url(base_url, abrev, split_id):
    return urllib.parse.urljoin(base_url, "{}/{}.pdf".format(abrev, "-".join(split_id[:-1])) )

def get_date(split_id):  
    return re.search(r'[0-9]+', split_id[0]).group()

def get_volume(split_id):   
    return re.search(r'[0-9]+', split_id[1]).group()
    
def get_issue(split_id):
    return split_id[2]

def get_page(split_id):
    return re.search(r'[0-9]+', split_id[3]).group()   
In [7]:
with open('/Users/jeriwieringa/Dissertation/browser/data/meta.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for index, row in df.iterrows():
        _id = row['doc_id'].strip()
#         print(row['doc_id'])
        abrev = utilities.get_title(_id)
        split_id = get_split_id(_id)
        date_data = get_date(split_id)
        url = construct_url(base_url, abrev, split_id)
        
        try:
            if len(date_data) == 4:
                date = datetime.datetime.strptime(get_date(split_id), "%Y").date()           
            elif len(date_data) < 7:
                date = datetime.datetime.strptime(get_date(split_id), "%Y%m").date()
            else:
                date = datetime.datetime.strptime(get_date(split_id), "%Y%m%d").date()
        except:
            date = datetime.datetime.strptime(get_date(split_id)[:6], "%Y%m").date()

        try:
            volume = get_volume(split_id)
        except:
            volume = "XX"
            
        try:
            issue = get_issue(split_id)
        except:
            issue = "XX"
            
        page = get_page(split_id)
        journal = title_keys[abrev]
        title = "{} (Vol. {}.{}) {}, page {}".format(journal, volume, issue, date.strftime('%b %d, %Y'), page)
        writer.writerow([_id, title, "NA", journal, volume, issue, date, page, url, abrev])