ADV-OCR-Evaluation-and-Correction
OCR Evaluation, Correction, and Normalization for The Training School Advocate (ADV).¶
Summary¶
(Link to title overview for more information about The Training School Advocate.)
The layout and typesetting of The Training School Advocate make this a challenging title for Optical Character Recognition. In particular, this title suffers from line-ending problems, where the OCR engine was unable to recognize the dash that indicates that a word was split over a line-break. Using my spelling_dictionary of verified words (link to discussion of creating that dictionary), the baseline overall average of correct tokens was 92.328%. By correcting for regular occurring OCR errors, including removing special characters, addressing line-endings, reconnecting split words and identifying and reconstructing "burst" words, I improved the overall average to 96.56%. These changes fixed regularly occurring error patterns. In addition, I looked to improve the overall accuracy, as well as improve the dataset for clustering, by normalizing the corpus. This involves correcting some common spelling errors, some of which are caused by OCR misrecognition, and some of which are editorial mistakes in the original publication. With normalization, I was able to improve the verified token rate to 96.88%. I examined the remaining documents that posted an above 30% error rate, and these were all title, index, and advertisement pages, all of which offer little content value for my studies. As additional interventions were providing little marginal improvement in the verified OCR rate, I decided to end my correction and normalization efforts at this point.
%load_ext autoreload
%autoreload 2
from text2topics import reports
from text2topics import utilities
from text2topics import clean
import re
import os
from os import listdir
from os.path import isfile, join
import collections
%matplotlib inline
wordlist_dir = "/Users/jeriwieringa/Dissertation/drafts/data/word-lists"
wordlists = ["2016-12-07-SDA-last-names.txt",
"2016-12-07-SDA-place-names.txt",
"2016-12-08-SDA-Vocabulary.txt",
"2017-01-03-place-names.txt",
"2017-02-14-Base-Word-List-SCOWL&KJV.txt",
"2017-03-01-Additional-Approved-Words.txt",
"2017-02-14-Roman-Numerals.txt"]
spelling_dictionary = utilities.create_spelling_dictionary(wordlist_dir, wordlists)
title = "ADV"
base_dir = "/Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/{}/".format(title)
Establishing the baseline¶
cycle = 'baseline'
stats = reports.overview_report(join(base_dir, cycle), spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/baseline Average verified rate: 0.9237448782373405 Average of error rates: 0.09447501892505678 Total token count: 1293500
errors_summary = reports.get_errors_summary( stats )
reports.top_errors( errors_summary, 50 )[:50]
[('ò', 5257), ('ó', 5047), ('e', 3991), ('ñ', 2666), ('t', 2451), ('w', 2403), ('m', 1910), ('r', 1708), ('n', 1587), ('f', 1319), ('d', 1198), ('-', 1049), ('õ', 991), ('*', 953), ('tion', 815), ('g', 750), ('godõs', 682), ('u', 601), ('re', 571), ('ô', 528), ('”', 498), (')', 493), ('õõ', 459), ('k', 456), ('“', 442), ('^', 422), ("'", 419), ('co', 399), ('dren', 323), ('ex', 318), ('ment', 309), ('th', 306), ('educa', 301), ('chil', 294), ('x', 261), ('\ufeff', 254), ('(', 234), ('ç', 227), ('ers', 209), ('tions', 203), ('è', 200), ('¥', 176), ('edu', 163), ('ence', 162), ('lordõs', 155), ('teachersõ', 154), ('—', 152), ('pre', 144), ('christõs', 143), ('un', 142)]
Correction 1 -- Replace "õ" with "'"¶
The first correction is to replace "õ" found in the middle of words with a "'".
# %load shared_elements/replace_accented_o.py
prev = "baseline"
cycle = "correction1"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
with open(join(directories['prev'], filename)) as f:
content = f.read()
content = re.sub(r"(\w+)(õ|Õ)", r"\1'", content)
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction1 Average verified rate: 0.9266633165829146 Average of error rates: 0.09163701741105224 Total token count: 1293500
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 50 )[:50]
[('ò', 5257), ('ó', 5047), ('e', 3991), ('ñ', 2666), ('t', 2451), ('w', 2403), ('m', 1910), ('r', 1708), ('n', 1587), ('f', 1319), ('d', 1198), ('-', 1049), ('õ', 969), ('*', 953), ('tion', 815), ('g', 750), ('u', 601), ('re', 571), ('ô', 528), ('”', 498), (')', 493), ("õ'", 477), ('k', 456), ('“', 442), ("'", 441), ('^', 422), ('co', 399), ('dren', 323), ('ex', 318), ('ment', 309), ('th', 306), ('educa', 301), ('chil', 294), ('x', 261), ('\ufeff', 254), ('(', 234), ('ç', 227), ('ers', 209), ('tions', 203), ('è', 200), ('¥', 176), ('edu', 163), ('ence', 162), ('—', 152), ('pre', 144), ('un', 142), ('ac', 131), ('«', 130), ('ôô', 130), ('mis', 129)]
Next step will be to remove special characters. First, rather than assume that everything uses the English alphabet, I am filtering for tokens with special characters and sorting by frequency. This allows me to quickly gage if there is a regular use of languages other than English. If yes, we need to preserve those characters in use. If not, we can remove all special characters.
Review Special Characters¶
reports.tokens_with_special_characters(errors_summary)[:100]
[('ò', 5257), ('ó', 5047), ('ñ', 2666), ('õ', 969), ('*', 953), ('ô', 528), ('”', 498), (')', 493), ("õ'", 477), ('“', 442), ('^', 422), ('\ufeff', 254), ('(', 234), ('ç', 227), ('è', 200), ('¥', 176), ('—', 152), ('ôô', 130), ('«', 130), ('>', 126), ('òthe', 120), ('_', 108), ('|', 106), ('»', 100), ('=', 97), ('£', 69), ('¡', 68), ('’', 67), ('%', 66), (']', 61), ('<', 61), ('\\', 59), ('/', 58), ('#', 57), ('òi', 54), ('in\xad', 53), ('**', 45), ('©', 41), ('õs', 38), ('con\xad', 37), ("the}'", 37), ('óñ', 34), ('ñthe', 34), ('òwe', 33), ('~', 31), ('•', 31), ('ôôthe', 31), ('re\xad', 31), ('ôthe', 30), ('be\xad', 30), ('chil\xad', 29), ('á', 28), ('’’', 28), ('[the', 28), ('«fr', 28), ('õtis', 27), ('ôliving', 26), ('òa', 26), ('de\xad', 26), ('***', 25), ('(a', 25), ('♦', 25), ('com\xad', 24), ('(the', 24), ('òin', 23), ('ôi', 22), ('(a)', 21), ('(sketch', 21), ('à', 21), ('ex\xad', 20), ('ã', 20), ('\\v', 20), ('*ô', 18), ('òit', 18), ('the}-', 18), ('twenty=five', 17), ('[', 17), ('{poem)', 17), ('(see', 16), ("õ'ñ", 16), ('{', 16), ('lessonsó', 16), ('òif', 16), ('ña', 16), ('¤', 16), ('(b)', 16), ("*'", 15), ('i*', 15), ("'ô", 15), ('òall', 15), ('stu\xad', 15), ('pro\xad', 15), ('ñe', 15), ("'õ", 15), ('im\xad', 14), ('‘', 14), ('(poem)', 14), ('educa\xad', 14), ('j*', 14), ('òhe', 13)]
Scanning the results, there is no evidence that this title regularly uses a non-English vocabulary, so we can proceed with removing the special characters. Before we do, however, it is also wise to standardize our character set for the dash and for the apostrophe.
Correction 2 -- Normalize Characters¶
Second correction is to remove all special characters.
# %load shared_elements/normalize_characters.py
prev = cycle
cycle = "correction2"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
# Substitute for all other dashes
content = re.sub(r"—-—–‑", r"-", content)
# Substitute formatted apostrophe
content = re.sub(r"\’\’\‘\'\‛\´", r"'", content)
# Replace all special characters with a space (as these tend to occur at the end of lines)
content = re.sub(r"[^a-zA-Z0-9\s,.!?$:;\-&\'\"]", r" ", content)
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction2 Average verified rate: 0.9469682115153845 Average of error rates: 0.06859084027252083 Total token count: 1274990
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 50 )[:50]
[('e', 4078), ('t', 2604), ('w', 2471), ('m', 1965), ('r', 1826), ('n', 1635), ('f', 1386), ('-', 1369), ("'", 1273), ('d', 1230), ('tion', 819), ('g', 778), ('u', 635), ('re', 614), ('k', 475), ('co', 409), ('th', 352), ('ex', 344), ('chil', 324), ('dren', 324), ('educa', 319), ('ment', 309), ('x', 285), ('ers', 210), ('tions', 206), ('edu', 173), ('ence', 162), ('pre', 157), ('un', 157), ('ac', 138), ('mis', 135), ('ple', 131), ('z', 131), ('tian', 131), ('ith', 129), ('tional', 128), ('q', 114), ('ful', 112), ('es', 105), ('al', 100), ('ap', 98), ('ments', 98), ('ent', 97), ('fr', 97), ('ber', 92), ('peo', 92), ('em', 89), ('prin', 86), ('ture', 85), ('ucation', 84)]
Correction 3 -- Correct Line Endings¶
Our third correction will be to address line-endings that the OCR engine did not rejoin into a single token. This will done by identifying the following pattern: word- word and transforming it to: wordword.
# %load shared_elements/correct_line_endings.py
prev = cycle
cycle = "correction3"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
content = re.sub(r"(\w+)(\-\s{1,})([a-z]+)", r"\1\3", content)
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction3 Average verified rate: 0.9475463094540847 Average of error rates: 0.0680223315669947 Total token count: 1273714
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 50 )[:50]
[('e', 4078), ('t', 2595), ('w', 2469), ('m', 1958), ('r', 1825), ('n', 1635), ('f', 1383), ('-', 1353), ("'", 1273), ('d', 1230), ('tion', 812), ('g', 777), ('u', 634), ('re', 614), ('k', 474), ('co', 409), ('th', 351), ('ex', 344), ('chil', 324), ('dren', 322), ('educa', 319), ('ment', 303), ('x', 284), ('tions', 205), ('ers', 203), ('edu', 173), ('ence', 161), ('pre', 157), ('un', 157), ('ac', 139), ('mis', 135), ('ple', 131), ('z', 131), ('ith', 129), ('tional', 128), ('tian', 128), ('q', 114), ('ful', 112), ('es', 105), ('al', 100), ('ap', 98), ('ments', 98), ('fr', 97), ('ent', 96), ('ber', 92), ('peo', 92), ('em', 89), ('prin', 86), ('ucation', 84), ('ture', 82)]
Correction 4 -- Remove Extra Dashes¶
# %load shared_elements/remove_extra_dashes.py
prev = cycle
cycle = "correction4"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
text = re.sub(r"[0-9,!?$:;&]", " ", content)
tokens = utilities.tokenize_text(text)
replacements = []
for token in tokens:
if token[0] is "-":
replacements.append((token, token[1:]))
elif token[-1] is "-":
replacements.append((token, token[:-1]))
else:
pass
if len(replacements) > 0:
# print("{}: {}".format(filename, replacements))
for replacement in replacements:
content = clean.replace_pair(replacement, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction4 Average verified rate: 0.9491645050465717 Average of error rates: 0.06563361090083271 Total token count: 1273736
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 20 )[:50]
[('e', 4097), ('t', 2611), ('w', 2484), ('m', 1970), ('r', 1838), ('n', 1645), ('f', 1392), ("'", 1287), ('d', 1240), ('tion', 812), ('g', 782), ('u', 640), ('re', 619), ('k', 476), ('co', 445), ('th', 352), ('ex', 345), ('chil', 325), ('dren', 322), ('educa', 322), ('ment', 303), ('x', 289), ('tions', 205), ('ers', 204), ('edu', 175), ('pre', 162), ('ence', 161), ('un', 160), ('ac', 139), ('mis', 135), ('z', 132), ('ple', 131), ('ith', 129), ('tian', 128), ('tional', 128), ('q', 115), ('ful', 112), ('es', 105), ('al', 101), ('ap', 98), ('ments', 98), ('fr', 97), ('ent', 96), ('peo', 93), ('ber', 92), ('em', 89), ('prin', 86), ('ucation', 84), ('ll', 83), ('ture', 82)]
Correction 5 -- Remove Extra Quotation Marks¶
# %load shared_elements/replace_extra_quotation_marks.py
prev = cycle
cycle = "correction5"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
text = re.sub(r"[0-9,!?$:;&]", " ", content)
tokens = utilities.tokenize_text(text)
corrections = []
for token in tokens:
token_list = list(token)
last_char = token_list[-1]
if last_char is "'":
if len(token) > 1:
if token_list[-2] is 's' or 'S':
pass
else:
corrections.append((token, re.sub(r"'", r"", token)))
else:
pass
elif token[0] is "'":
corrections.append((token, re.sub(r"'", r"", token)))
else:
pass
if len(corrections) > 0:
# print('{}: {}'.format(filename, corrections))
for correction in corrections:
content = clean.replace_pair(correction, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction5 Average verified rate: 0.9493657966718064 Average of error rates: 0.06538493565480696 Total token count: 1273724
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 20 )[:50]
[('e', 4099), ('t', 2616), ('w', 2489), ('m', 1976), ('r', 1843), ('n', 1645), ('f', 1392), ("'", 1262), ('d', 1240), ('tion', 812), ('g', 783), ('u', 640), ('re', 620), ('k', 477), ('co', 445), ('th', 354), ('ex', 345), ('chil', 325), ('dren', 322), ('educa', 322), ('ment', 303), ('x', 291), ('tions', 205), ('ers', 205), ('edu', 175), ('pre', 162), ('ence', 161), ('un', 160), ('ac', 139), ('mis', 135), ('z', 132), ('ple', 131), ('ith', 129), ('tian', 128), ('tional', 128), ('q', 115), ('ful', 112), ('es', 107), ('al', 101), ('ap', 98), ('ments', 98), ('fr', 97), ('ent', 96), ('peo', 94), ('ber', 92), ('em', 89), ('prin', 86), ('ucation', 84), ('ll', 83), ('ture', 82)]
Correction 6 -- Rejoin Burst Words¶
# %load shared_elements/rejoin_burst_words.py
prev = cycle
cycle = "correction6"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
pattern = re.compile("(\s(\w{1,2}\s){5,})")
replacements = []
clean.check_splits(pattern, spelling_dictionary, content, replacements)
if len(replacements) > 0:
# print('{}: {}'.format(filename, replacements))
for replacement in replacements:
content = clean.replace_pair(replacement, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
# %load shared_elements/summary.py
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction6 Average verified rate: 0.9515400232590228 Average of error rates: 0.063221044663134 Total token count: 1268325
# %load shared_elements/top_errors.py
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 10 )[:50]
[('e', 3307), ('w', 2412), ('t', 2107), ('m', 1846), ('r', 1406), ('f', 1339), ('n', 1304), ("'", 1262), ('d', 926), ('tion', 812), ('g', 650), ('re', 617), ('u', 509), ('co', 438), ('k', 438), ('th', 353), ('ex', 345), ('chil', 325), ('dren', 322), ('educa', 322), ('ment', 303), ('x', 282), ('ers', 205), ('tions', 205), ('edu', 175), ('pre', 162), ('ence', 161), ('un', 160), ('ac', 139), ('mis', 135), ('ple', 131), ('ith', 129), ('z', 129), ('tian', 128), ('tional', 128), ('q', 112), ('ful', 112), ('es', 107), ('al', 101), ('ap', 98), ('ments', 98), ('fr', 97), ('ent', 96), ('peo', 94), ('ber', 92), ('em', 88), ('ucation', 86), ('prin', 86), ('ll', 83), ('ture', 82)]
Correction 7 -- Rejoin Split Words¶
# %load shared_elements/rejoin_split_words.py
prev = cycle
cycle = "correction7"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
text = re.sub(r"[0-9,!?$:;&]", " ", content)
tokens = utilities.tokenize_text(text)
errors = reports.identify_errors(tokens, spelling_dictionary)
replacements = clean.check_if_stem(errors, spelling_dictionary, tokens, get_prior=False)
if len(replacements) > 0:
# print('{}: {}'.format(filename, replacements))
for replacement in replacements:
content = clean.replace_split_words(replacement, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
# %load shared_elements/summary.py
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction7 Average verified rate: 0.9620770901564865 Average of error rates: 0.05361544284632855 Total token count: 1260109
# %load shared_elements/top_errors.py
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 10 )[:50]
[('e', 3160), ('w', 2406), ('t', 2080), ('m', 1845), ('r', 1361), ('f', 1338), ('n', 1279), ("'", 1262), ('d', 912), ('g', 644), ('u', 506), ('k', 435), ('co', 353), ('x', 280), ('ment', 275), ('th', 230), ('ers', 191), ('tion', 153), ('re', 143), ('z', 129), ('tian', 128), ('ith', 126), ('q', 112), ('ence', 107), ('fr', 92), ('ful', 89), ('ex', 87), ('ucation', 86), ('ll', 82), ('tions', 79), ('ments', 76), ('ofthe', 74), ('ent', 73), ('struction', 73), ('pp', 71), ('hy', 67), ('lege', 64), ('ay', 63), ('ft', 61), ('ary', 59), ('ance', 59), ('ents', 57), ('io', 55), ('ure', 52), ('ber', 51), ('bers', 51), ('il', 49), ('ual', 48), ('tle', 46), ('ference', 46)]
Correction 9 -- Rejoin Split Words II¶
Run through the joining functions again.
# %load shared_elements/rejoin_split_words.py
prev = cycle
cycle = "correction9"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
text = re.sub(r"[0-9,!?$:;&]", " ", content)
tokens = utilities.tokenize_text(text)
errors = reports.identify_errors(tokens, spelling_dictionary)
replacements = clean.check_if_stem(errors, spelling_dictionary, tokens, get_prior=True)
if len(replacements) > 0:
# print('{}: {}'.format(filename, replacements))
for replacement in replacements:
content = clean.replace_split_words(replacement, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction9 Average verified rate: 0.9677998391667305 Average of error rates: 0.04839023467070402 Total token count: 1253472
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 10 )[:50]
[('e', 3145), ('t', 2054), ('w', 1969), ('m', 1494), ('r', 1353), ('f', 1337), ('n', 1268), ("'", 1262), ('d', 899), ('g', 636), ('u', 503), ('k', 432), ('co', 352), ('x', 280), ('th', 226), ('z', 129), ('q', 112), ('fr', 92), ('ment', 88), ('tion', 80), ('re', 75), ('ofthe', 74), ('pp', 71), ('ex', 68), ('ers', 57), ('ft', 56), ('io', 55), ('mo', 44), ('mt', 43), ('il', 42), ('ky', 41), ('si', 39), ('oi', 38), ('ol', 34), ('ucation', 34), ('--', 33), ('va', 32), ('es', 31), ('dren', 30), ('tbe', 30), ('bo', 29), ('tlie', 29), ('jt', 28), ('pa', 27), ('al', 26), ('ma', 26), ('op', 26), ('ou', 26), ('pm', 26), ('chil', 26)]
Review Remaining Errors¶
reports.docs_with_high_error_rate( summary, min_error_rate = .2 )
[('ADV18990101-V01-01-page1.txt', 0.75), ('ADV19021101-V04-11-page36.txt', 0.625), ('ADV18990601-V01-06-page3.txt', 0.583), ('ADV19020801-V04-08-page36.txt', 0.571), ('ADV19011201-V03-10-page34.txt', 0.561), ('ADV19000401-V02-04-page68.txt', 0.533), ('ADV19020101-V04-01-page35.txt', 0.524), ('ADV19001001-V02-10-page38.txt', 0.514), ('ADV19020301-V04-03-page35.txt', 0.51), ('ADV19000101-V02-01-page1.txt', 0.509), ('ADV18990601-V01-06-page38.txt', 0.5), ('ADV19000301-V02-03-page39.txt', 0.5), ('ADV19001001-V02-10-page2.txt', 0.473), ('ADV19010101-V03-01-page40.txt', 0.468), ('ADV19010201-V03-02-page3.txt', 0.445), ('ADV19001001-V02-10-page1.txt', 0.444), ('ADV18991101-V01-10-page2.txt', 0.444), ('ADV19000401-V02-04-page2.txt', 0.439), ('ADV19001201-V02-12-page2.txt', 0.427), ('ADV19020801-V04-08-page1.txt', 0.426), ('ADV19010601-V03-06-page3.txt', 0.425), ('ADV19010201-V03-02-page39.txt', 0.419), ('ADV19010301-V03-03-page41.txt', 0.419), ('ADV19000801-V02-08-page35.txt', 0.417), ('ADV19000201-V02-02-page1.txt', 0.417), ('ADV18990301-V01-03-page72.txt', 0.417), ('ADV19020201-V04-02-page1.txt', 0.417), ('ADV19020701-V04-07-page1.txt', 0.413), ('ADV19010401-V03-04-page3.txt', 0.408), ('ADV19001001-V02-10-page35.txt', 0.406), ('ADV19020901-V04-09-page1.txt', 0.405), ('ADV19010301-V03-03-page43.txt', 0.404), ('ADV19010501-V03-05-page39.txt', 0.4), ('ADV19000801-V02-08-page1.txt', 0.395), ('ADV19000801-V02-08-page38.txt', 0.395), ('ADV19000801-V02-08-page2.txt', 0.393), ('ADV19000501-V02-05-page2.txt', 0.391), ('ADV19010101-V03-01-page37.txt', 0.391), ('ADV19010101-V03-01-page3.txt', 0.386), ('ADV19000601-V02-06-page2.txt', 0.383), ('ADV19000201-V02-02-page2.txt', 0.378), ('ADV19020201-V04-02-page35.txt', 0.375), ('ADV18990201-V01-02-page67.txt', 0.375), ('ADV19031001-V05-10-page36.txt', 0.374), ('ADV19001101-V02-11-page2.txt', 0.372), ('ADV19020401-V04-04-page1.txt', 0.37), ('ADV19010501-V03-05-page1.txt', 0.367), ('ADV19010501-V03-05-page3.txt', 0.367), ('ADV19010201-V03-02-page37.txt', 0.366), ('ADV19010301-V03-03-page3.txt', 0.361), ('ADV19000701-V02-07-page38.txt', 0.36), ('ADV19001201-V02-12-page35.txt', 0.357), ('ADV19010601-V03-06-page1.txt', 0.355), ('ADV19000601-V02-06-page35.txt', 0.352), ('ADV19001101-V02-11-page35.txt', 0.347), ('ADV18990401-V01-04-page61.txt', 0.346), ('ADV19020601-V04-06-page1.txt', 0.345), ('ADV19000701-V02-07-page35.txt', 0.341), ('ADV19001101-V02-11-page38.txt', 0.34), ('ADV18990901-V01-08-page2.txt', 0.337), ('ADV19000401-V02-04-page59.txt', 0.337), ('ADV19011001-V03-08-page2.txt', 0.335), ('ADV19010401-V03-04-page43.txt', 0.333), ('ADV19001201-V02-12-page38.txt', 0.333), ('ADV19000301-V02-03-page2.txt', 0.332), ('ADV18990201-V01-02-page63.txt', 0.328), ('ADV19020801-V04-08-page35.txt', 0.323), ('ADV19000501-V02-05-page35.txt', 0.323), ('ADV19010501-V03-05-page37.txt', 0.321), ('ADV19010801-V03-07-page36.txt', 0.321), ('ADV18990401-V01-04-page64.txt', 0.318), ('ADV19000201-V02-02-page39.txt', 0.318), ('ADV19010301-V03-03-page42.txt', 0.317), ('ADV19041001-V06-10-page20.txt', 0.317), ('ADV19010601-V03-06-page35.txt', 0.315), ('ADV19010401-V03-04-page44.txt', 0.311), ('ADV19010401-V03-04-page41.txt', 0.31), ('ADV18990101-V01-01-page50.txt', 0.308), ('ADV19020501-V04-05-page1.txt', 0.308), ('ADV18990601-V01-06-page4.txt', 0.3), ('ADV19011201-V03-10-page1.txt', 0.3), ('ADV18990901-V01-08-page45.txt', 0.3), ('ADV18990301-V01-03-page66.txt', 0.297), ('ADV19020101-V04-01-page1.txt', 0.296), ('ADV19000301-V02-03-page35.txt', 0.293), ('ADV19020301-V04-03-page1.txt', 0.289), ('ADV19020501-V04-05-page23.txt', 0.287), ('ADV19020601-V04-06-page36.txt', 0.286), ('ADV19020901-V04-09-page36.txt', 0.286), ('ADV19010801-V03-07-page1.txt', 0.286), ('ADV19000201-V02-02-page34.txt', 0.285), ('ADV19000101-V02-01-page2.txt', 0.276), ('ADV19041001-V06-10-page1.txt', 0.276), ('ADV19000701-V02-07-page2.txt', 0.273), ('ADV19000601-V02-06-page36.txt', 0.273), ('ADV19010801-V03-07-page2.txt', 0.273), ('ADV19000801-V02-08-page37.txt', 0.271), ('ADV19000101-V02-01-page38.txt', 0.271), ('ADV18990201-V01-02-page4.txt', 0.269), ('ADV19000401-V02-04-page64.txt', 0.266), ('ADV19030401-V05-04-page21.txt', 0.263), ('ADV19011001-V03-08-page1.txt', 0.262), ('ADV19020401-V04-04-page19.txt', 0.26), ('ADV19000501-V02-05-page37.txt', 0.256), ('ADV19000401-V02-04-page60.txt', 0.255), ('ADV19030401-V05-04-page20.txt', 0.255), ('ADV19000501-V02-05-page31.txt', 0.254), ('ADV19010101-V03-01-page2.txt', 0.254), ('ADV18990701-V01-07-page2.txt', 0.25), ('ADV19010601-V03-06-page37.txt', 0.248), ('ADV19030401-V05-04-page34.txt', 0.246), ('ADV19020301-V04-03-page2.txt', 0.246), ('ADV19040501-V06-05-page1.txt', 0.244), ('ADV18990601-V01-06-page9.txt', 0.244), ('ADV19021201-V04-12-page36.txt', 0.24), ('ADV19021001-V04-10-page44.txt', 0.239), ('ADV19040401-V06-04-page20.txt', 0.238), ('ADV19000301-V02-03-page34.txt', 0.237), ('ADV19000101-V02-01-page37.txt', 0.231), ('ADV19000401-V02-04-page66.txt', 0.229), ('ADV19000101-V02-01-page31.txt', 0.228), ('ADV19000101-V02-01-page32.txt', 0.227), ('ADV19000401-V02-04-page61.txt', 0.226), ('ADV18990901-V01-08-page51.txt', 0.223), ('ADV19010101-V03-01-page39.txt', 0.222), ('ADV18991101-V01-10-page50.txt', 0.222), ('ADV19021001-V04-10-page1.txt', 0.22), ('ADV18991101-V01-10-page52.txt', 0.22), ('ADV19010601-V03-06-page36.txt', 0.219), ('ADV18990201-V01-02-page2.txt', 0.219), ('ADV18990701-V01-07-page67.txt', 0.218), ('ADV18990501-V01-05-page52.txt', 0.217), ('ADV18990101-V01-01-page54.txt', 0.217), ('ADV19010501-V03-05-page38.txt', 0.217), ('ADV19001001-V02-10-page37.txt', 0.217), ('ADV19000201-V02-02-page38.txt', 0.214), ('ADV19001001-V02-10-page34.txt', 0.213), ('ADV18990601-V01-06-page134.txt', 0.211), ('ADV19000601-V02-06-page38.txt', 0.208), ('ADV19020801-V04-08-page34.txt', 0.205), ('ADV19000601-V02-06-page37.txt', 0.204), ('ADV19000301-V02-03-page31.txt', 0.203), ('ADV19000301-V02-03-page36.txt', 0.203), ('ADV19000601-V02-06-page34.txt', 0.201)]
ADV18990101-V01-01-page1.txt¶
- Cover page
- Available text was : "Published Monthly. Training School Advocate January, 1899. Battle Creek College, Battle Creek, Mich. Vol. 1. No. 1."
- OCR was " WvniwniiAr 'Cnn i m"
ADV18990601-V01-06-page3.txt¶
- Image
- Available text was : "The College (Main Building)"
- OCR was " Th e Co l l e g e Ma in Bu il d in g ."
ADV19011201-V03-10-page34.txt¶
- Advertisement page
- Lots of flourishes resulting in a very confused OCR transcription
ADV19001001-V02-10-page2.txt¶
- Table of contents page.
- Lots of split words
Summary¶
High error pages tend to be cover pages, pages with images, advertisements, and table of contents. Of these, advertisements have the most content that might be of interest for my study.
reports.long_errors(errors_summary, min_length=15)
(['terlyreportabetterone', 'achristianschool', 'tobedrawnfromthischapter', 'rririj-ijirirurrt', 'itwillcontainlive', 'itbecomesthefirstdutyof', 'hssshsbshshshshieshsh', 'fromthepenofsister', 'patriarchsandprophets', 'vwvwvwwvwvwwwvwwvwwwvwwv', 'thewavesofthesea', 'andtoldmeicouldsay', 'sell-improvement', 'anychurchschoolhasarighttobecome', 'nrnjtjtjtjtjijmnjuuxrijt', 'thelessonsareinteresting', 'themarriagesapper', 'ishardlytobeconceived', 'wasfirstpublishedintheinterestsofa', 'iruxrijirutitjxrutjtj', 'verysensitivetotheinfluenceoflight', 'thepeachesbeallyourown', 'yousufelyneedatoncethelessonsandcounselcon', 'howmissionariesaremade', 'membersarestudentteachers', 'rliirirlrlnjxrvvuxrlrlruirijtjm', 'ingbythewordofgod', 'tolovetheebettereveryday', 'mxnruiixmuuuuxnrltu', 'thelittleboywill', 'rutrutrutnjrritijtjijtj', 'thisisoneofseveral', 'themotherasateacher', 'qrmrninjxnjrnjtjtj', 'thealedochurchschool', 'ofthosenowattending', 'forquarterendingjune', 'iritijxnjtjxmtruxru', 'reportsanattendance', 'aspecialfeatureoftheseptemberadvocate', 'dayarefortunateinbeing', 'swearetemptedtogowherewe', 'hlbiiiiibhuftmflllllbluiibillllbuhfhiii', 'jtjtjtrlrlruutjxrutjinjtjitlt', 'vruotjtxutruvtji', 'cambcrfsnutbuttermiib', 'itisasadfactthat', 'utjtjtjmjuutjtrl', 'foolsalonecompletetheireducation', 'isthetimesetfortheopening', 'novemberadvocate', 'affordtobewithoutthe', 'theunfailingprogressfromcause', 'thatthereadersoftheadvocate', 'swimtowardtheboat', 'onewhosedutyitistobringforwardthere', 'matterbearinguponthesubject', 'studentsowooooooooooooooooooooooooooo', 'ofthetestimonies', 'dayisafreshbeginning', 'providedbythestate', 'fifteencopiesofthejanuaryadvocate', 'trueeducationfor', 'njtxijxrixtjtjijtjtjxn', 'issomethingaboutthesmellof', 'ourdailyattendanceisnow', 'thought-producing', 'theheadsoffamilies', "union'conference", 'quotationsweregivenshow', 'whereshallwefindit', 'normalclassistaughthowtomakeablack', 'onlandwherethedeedofsaidlaud', 'didhehavetowaitsolong', 'gladtoextendtothema', 'nooneatallfittedfortheplacewould', 'butifattheendofanhourtherewasa', 'cbebattlecreekcollegebookstandi', 'aryissueoftheadvocate', 'conservativeestimate', 'carefullystudied', 'ifyoufailtoarouseaninterestin', 'lxljtjxnjtjtajutjxri', 'rheadfulloffiguresandfacts', 'brighttacksandrustynailsmay', 'tjttutjixituuvrutrxjtrlnjvruv', 'yeofthelordraininthetimeofthelatter', 'itisagreatthingtoteach', 'inordertomaketheworkofthemost', 'aformerstudentofkeene', 'thetraining-school', 'andthehorsereplied', 'theywerenotredaswehadsupposed', 'ididnotknowbeforethatthebibleissuchaninterestingbo', 'iiiiliiiiiiiiiimiiiiin', "bralliar'sarticlesintheadvocate", "livingfountains'", 'hasneededandstillneeds', 'methodsofworkingthesoil', 'madebythechurchesofmichi', 'n-rijtjtrijtjt-rlrijijmrljttiitrirlrinjtjtrtjtjt', 'andtheirattentioniscalledtochristian', 'copyofthemayadvocate', 'mentofchristianschools', "every'fridaynight", 'asecond-gradecertificate', 'amountofmoneydoesthena', 'spiritual-minded', 'stillcontinuestoinvite', 'wewereattractedby', 'railroadagentintheunitedstatesorcanada', 'buttheiropportunity', 'lesshehimselfiseducated', 'thewholecolonywascen', 'chnjlruxtltulixrulnj', 'oftheprotestantbibleasarequiredexercise', 'knowledge-supplying', 'nrutjijarurrlrnrjxnjtjinn', 'becausewesellforcashonly', 'andisjustaswelloffinschoolasanywhere', 'goddirectsourwarfare', 'themindwillbeofthesame', 'character-forming', 'subscriptionpriceofthepaper', 'itisadelusiontoitspossessor', 'nxlrutjxrrjtjijxajxanjxn', 'theseproblemsaffectsevery', 'twoandahalfmiles', 'schoolsbeforethenewyear', 'thosesamequestionswould', 'helptobegivenourschools', 'biblehygieneandtreatmentofdiseases', 'thisisaspecialoffer', 'itisnotabodyweare', 'itissadtofindthat', 'thechurchatwolflake', 'tisbloomandgrace', 'andwhatdoesitmeanwhenap', 'thehaskellhometrainingschool', "fountains'withcare", 'rmjvanjtjtnnjtjajtjirlnj', 'ingmetbytheselessons', 'evhiggeusandothers', 'andifihaveallfaith', 'naturalscienceandmathematics', 'takegreaterpainstoimprovetheirmethods', 'arguethatthechildislearningallthetime', 'comfortableliving', 'themshowshowtrulywelovehim', 'learnhowrtoworkforthesalvationofsouls', 'educationalsecretary', 'thebenumbingpowerof', 'schoolwasestablished', 'fyingtheworkoftheclasses', 'thedenseignorancedisplayed', 'uvuinnimruuinrumruuijinjtnnjvijvarinjtruuuuuvrinjimuxrirlrlruinjlnnnnnnnjijxtj', 'previouslessonsinthe', 'willcontinueourclub', 'himontheflooragain', 'correspondencehelponeto', 'irirlnjtjtriiijahjijtjtjijijtjtrlrlmijtrutnnjxrnij', 'toteachpropermethodstotheyouth', 'questionstogether', 'nowilaymedowntosleep', 'youthforimmediateserviceisthecallofthe', 'individualsthroughcollege', 'principalofthesouth', 'oneofourmissionariesinsouth', 'thingfortheadvocate', 'mentsofthosewhocounted', 'njinnjxnjvirijin', 'beinstilledintotheseyoung', 'morethanfillsthe', 'whoisteachingatenyart', 'innnjvnjirbirmjumnjtjtnnnjtjijximiti', 'everypageoftheadvocate', 'inalienableduties', 'dropathicinstitute', 'ibegantoexaminetheob', 'astheplantthatblooms', 'rxnrlnnjxjxruxruxn', 'knowingthatyouare', "tliis'arrangement", 'educatoroncesaid', 'rltltutjtjtjxnnjtjtjtjtrijx', 'ruxrijijtjxruiajttutjtjijtxiajijxriat', 'thesupremetestofamode', 'studentswhowinintheoperationofthe', 'wonderwhatplacethebibleshouldhold', 'theirdeliverance', 'perversionofthename', 'qjrnjxririnrlnjtrijtrinnnjtjtjijtjtjtnnjtmtjtnjtnnjtrutjtrij', 'utjirutjrrutjtjtjir', 'lmtjutjrutruirutixrinjxri', 'shouldbeestablishedatsomepoint', 'uijijxaruruutjtjutjt', 'properly-conducted', 'rlruijutrinjinjiji', 'manaanajmatutaaaj', 'ixutjijtjtnjtjlruirutjt-rlrlf', 'tiruutjtjxrurrutjijxrijijtjxrixuutjtjt', 'thecapacitytomakemoneyseemsthe', 'injxrmnjuonririnrumrmnrutrutjtnjanjtrinimruvrijtjmnjtrmrutn', 'thenletthemovementbegin', 'abookforlittlechildrenbible', 'lnjnjnrlruurjijnjtxlntuijtritiaixmuuitmjtnjtnju', 'paperissuedfromourpresses', "w'hiletellingthestoryaboutsam", 'acresweredeededtothe', 'agricultureshouldbe', 'rutrltlrlrltlrijxrijxnjtxlnxvrlrumrtruxn', 'experiencesinthechi', 'coursearethefoundation', 'placeacopyinthehandsof', 'thepublicschools', 'well-replenished', 'theeducationalconference', 'receiving-housesfor', 'ofthenewvolumebymrs', 'aruiruxruonrurutnxutnjtj', 'tjuuu-ltltltltuutjtjtjtjtxltuuxrmjxnjtjtjmjtjxrltln', '----------------------------', 'njtjtjxrmjxnrltlrirlnjtjtn', 'uutjirtjtjnnjtjurnjaruirutjurnjarijixijtji', 'ru-lrumj-uurrutj-ln', 'biggledypiggledy', 'tothechurchschool', 'whatisfairtooneisfairtoanother', 'thesecretofachieve', 'medical-missionary', 'childrenandyouth', 'howshouldtheworkofthehomede', 'itcombinestheveryfeatures', 'centralizedschools', 'howthingsofnaturegrow', 'clublistinnovember', 'thattheeditorsoftheadvocate', "howmuchmyfather'sprayersat", 'andbessienicolahave', 'oldhewillnotdepartfromit', 'greatersimplicity', 'innersideofeverj', 'jack-in-the-pulpit', 'indianahaswritten', 'organofthechurchschools', 'andallouryouthshouldbepermitted', 'werelikewhatwearesometimes', 'educationalsystem', 'protestantssettled', 'individualcloserandclosertogodisthe', 'rmruttmitxifiruxriruxrmnruxitjtxinrirmruxrutjtjintmnruxnjiruxrinrlririnn', 'cotiscientiously', 'bethoroughlyfamiliar', 'andthebronzewasgettingthe', 'child-missionaries', 'didnottrustinhim', 'jecttomakethisapreparatoryschoolwhere', 'enceofthatspirit', 'njtjxiijxrlnnrutjtjtjtjtnj', 'rlnjrxruxarinntlnjtjttirm', 'dotoadvancethecause', 'tobeabletoaskquestionswhicharefull', 'self-preservation', 'takespleasureincallingthe', 'gravityofthesituation', 'locationwasnotatthattimedetermined', 'thelaborsoftheday', 'frfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfr', 'ifhehadbuthalfaminuteto', 'dutiesandqualificationsofeducationalsuperintendents', 'becausethesepages', 'ruxrmjtjtjxrirltutjxiormjt', 'achronologicalchart', 'theeditorsoftheadvocate', 'sonsmadethemselvesvile', 'thecollegetohave', 'lartnnrirxnnrinruannjtmituxanruti', 'thelittleboywantsto', 'hciilllbmtlillllfhwliinlihlfmilml', 'szszsgszsasasaszshsesasasasaros', 'soldseveralcopiesofthe', 'orregisteredletter', 'placeforrepentance', 'stitutinglivingquestionsforsomeofthe', 'directoryofsabbathschool', 'departmentsoftheunit', 'christianeducationis', 'studyofmankindisgod', 'writesoftheyoung', 'church-fellowship', 'inexinexperienced', 'wherethenecessitiesofthefamilycallforthe', 'tjtjiruojtjtjtjtjt', 'couldstartlethelivingofall', 'aknowledgeofhiswotks', 'nivmrtjxrlnlntuiruiriritijiruijxrijtjtrjijijmrb', 'gladtobeabletosaythatwiscon', 'rmjixirmiinrinivvu', 'estedingardenwork', 'hasbeentryingtoset', 'politicalscience', 'jenniewillamanwrites', 'whilethemarchadvocate', 'hasmuchtosayonthework', 'thatreadsinitthinksitsuchadearbook', 'jmkkkkkkksmssskkkkigsikkki', 'readersforthechildren', 'cameintoexistencewhenastrongspirit', 'youwillfindaboxofchocolatedropsin', 'jtjtjtjtjtjtjtjtjtjtjtjtjtmlnjtjtjtjtjtjtjirut', 'tothelastnumberofthea', 'theexperienceoftheteacherandsizeoftheschool', 'thededicationofwoodland', 'liniifiiiiiiirriimniiiiimmu', 'utjijnjtjijtjinruijxrirutxlnjviitjtjtruanivinjtrinrinjuuijtjtjtjtjxm', 'ijtitjittnjrtrutr', 'andalsoseethateachchildthatisold', 'maybehoarywithage', "teachers'institute", 'voteshisenergiestothegoodofhumanity', 'churchatwolflake', 'sanitariumphysicians', 'itiswelltobearin', 'donotsendloosecoin', 'frfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfrfr', 'eeeeeeeeesseeeeeeeeeeeeeeeeeee', 'jhosewhoenterthis', 'theyareforalittlebc', 'helpspreadthegospelmethodsofeducation', 'andwhatisthepriceofthe', 'itisamostsacredlegacysealedforeverbydivine', 'laajltlmxruinnrltirlrlrij', 'gatherthechildren', 'farther-reaching', 'followingnuifirt', 'systematizingfaculty', 'willsuggestlivetopicsforseminars', 'ante-revolutionary', 'systemraisesaints', 'non-sabbath-keeping', "his'all-embracing", 'littlerulesweallshouldkeep', 'ifone-halfthetimew', 'assistanteditorsigns', 'hibitoneseesachartwhichshowstherela', 'awatch-springisbroken', 'hasbeencalledfrom', 'properpreparationsaremadeforthechurchschool', 'astepinadvancewouldbemade', 'andhavingcometothesecond', 'westfollowingcities', 'ipiknewyouandyouknewme', 'ittoallourpeople', 'correlationinarithmetic', 'openedaschoolfor', 'ofstudentsduringthelast', 'experiencesofteachers', 'feelingsofloveand', 'strong-principled', 'sarytotheestablishmentofthework', 'ferencebulletinoneof', 'aresevenintheschool', 'm------------------', 'mjtjinjirutjtjxrutjtnrutnjtjtjtjtjt', 'isbornofthespirit', 'planandthuscreatethefundwhichmakes', 'insteadofreadingtheadvocate', 'noamountofargument', 'lrirlrmjuxrijijinjtrlnjtmutjtjtnj', 'cluboftenadvocates', 'everytinguaranteed', 'thereareseveralpointstobeconsidered', 'reseeeeeeeeeeeeeeeeeeeeeeeeeeek', 'forquarterendingdecember', 'jxnnjijtnrijtjtrijutjxnrutjtjtjtjijxnjtjtjtjtnjtjtnnjxnj', 'reasonsforindustrinischools', 'educationbeganwiththemother', 'subscribeforthedoublepurposeofbringingtheadvocate', 'schooloflikegrade', 'tjttlajuljtjtnjtjtjtjtjutj', 'librarywhatyouwish', "thatgod'slovedoth", 'tthoroughlyytrained', 'theteachersdutytowardparentsand', 'shouldhavethetraining', "samson's-victories", 'thelessonsofgeographywillbe', 'rtnruanjirlnjtrnjuiruttitmminjxrt', 'fromthephysiologyclass', 'areviewopthechurchschoolwork', 'andthedoglookedsolemn', 'presidentofthewest', "we'llhaveahotirontoclaprighton", 'inixruxrijtrijtjinj', 'tittlewildneighbors', 'readersoftheadvocate', 'butwhenthatwhichisperfectiscome', 'johnbrisbenwalker', 'whenthoushaltvow', 'onemustbewillingtodie', 'solongaswreremaininignoranceof', 'thebusyweekisalmostover', 'theseaandthedryland', 'howintenseistheeffortinthesedaysto', 'mat-weavingisexcellentforteaching', 'communicationsand', 'mentalarithmetic', 'lookingatthebrain', 'ifliehadbuthalfaminuteto', 'tohavetheadvocate', 'theschoolbuildings', 'eachnumberoftheadvo', 'itisthesamewithus', 'densely-illustrated', 'andwhichdevelopsaperfect', 'shouldbereadbyeverychris', 'purposesintroducing', 'recitation-rooms', 'eeeeeeeeexexexxxeeeeeeeexeeeeee', 'noticethelabelonthewrapper', 'nearlydoubledifwehadhadteachers', 'andalsosendmoney', 'isastudentofhealdsburg', 'arenotthepeopleinneedofthetruths', 'multitudesstandreadytofollow', 'sight-destroying', 'uxrxrutjarirumjxnjtjtrimtminriruxnj', 'thecluboftenadvocates', 'tothechildlaborlawofger', 'theraindropsgoldardgems', 'self-abandonment', 'willfillalong-feltwant', 'hasbeenchosenpresidentofthewashing', 'whenastudentinandover', 'five-hundred-pound', 'thatisthesecretof', 'schoolforwestern', 'christianschools', 'questionswillbetheinevitable', 'thearticleintheadvocate', '-------------------------------------------------------------------------', 'acordialandactive', 'njtrutjtrutnnruxnj', 'chosensuperintendent', 'thepuritansaseducators', "summerschoolsandteachers'institutes", 'thepresshasagitatedthequestion', 'spanish-speaking', 'missedithrozelle', 'thighbonesofthebodiestheyhavedissected', 'climbingorfighting', 'study------------------', 'artofquestioning', 'thatisasentiment', 'thequestionwhichshoulddecidewhether', 'necessarilyattended', 'boybutoneatthejohnworthy', 'secretaryoffaculty', 'throughthecolumnsoftheadvocate', 'aretryingtofindtherightwayto', 'commencementaddressdeliveredatunioncollegeby', 'rijinjxrijinjtitjtjtjijtj', 'assecono-classmattm', 'leftsundayeveningfor', 'kkkkkkkxkkkkskkkkkh', 'thatthatisthewaytogo', 'nevercananirreligiousschool', 'forseveralyearsa', 'sometimesitisbutau', 'ruxruxnnnjxnjtjtrunrlmtjtjtj', 'theclassspiritin', 'doestheschoolmakethepeo', 'theclubofadvocates', 'isofgeneralinter', 'sentouthundredsofteacherstoconduct', 'rvirijanjmjtjtjtjtj', 'accjuamtancewithgod', "havingchargeoftheboys'home", 'hasbeenmadebysisters', 'makeopportunities', 'andsadlyhebeggedthem', 'anduntilweknowandexperience', "gideon'svictoryoverthemidianites", 'self-examination', 'qrinnjirmrinnjtrijtjxnru', '-----------------------------------------------------------', 'thenexttrainstarts', 'foryearsprincipal', 'rumru-uinnrulru-mru-mrulnnnrmruuuinruuuiru', 'thancanbederivedfromgallons', 'right-about-face', 'affordsexcellentfacilitiesforyouraccomodation', 'educationalmatters', 'respondwiththeeditorofthe', "xrmixajxnjxruxnmuxnjuxnjrintirlnjtrlrtnan'lrlr", 'utjtjtjxrirlrijtjt', 'thechildrenimaginethethreethat', 'thisnumberthetrainino', 'thereisareformto', 'shouldbealivingtheme', 'tailoringdepartment', 'theindependentofjuly', 'addressingtheadvocate', 'zsesasaszsaszsaasaszszsasbseszsisesaszbgsesasasasaszszsasesaszszmk', 'thecourageandhopetoeducatethemtotill', 'njanjtruxiannrinjtrijnj', 'istheresultofthe', 'rjiriruuxririjrnjtjtjuuxrijtjtriruirijtrlrutjr', 'wasneveradaysomistyandgray', 'preparationofchurchesforchurchschools', "jrlruu-u-lrulrumj'uuuu'utjinjuvuiririjutjuuutjuuutjtjutj-ij-iju'ij", 'patriarchsatidprophets', 'tlkbattlecmkcollegebookstand', 'theracehassuffered', 'non-sabbath-keep', 'butthatweshouldobservethelawofcolors', 'buthethatstrivesin', 'training-schoolfor', 'cross-fertilization', 'fortheencouragementofmy', 'loveneverfaileth', 'afterthereadinglessonwasover', "andwatchedthebarber'sshears", 'swordshouldbethebasisofeveryedu', 'tjxnjinjtjtjtjtjrrmj', 'willbegivenforalimitednumberofstudents', 'asofmanyothersciences', 'ofeducationalprinciples', 'thecookingclasswereresponsiblefor', 'foroneyearandpamphlet', 'isroomforthemanwhocanset', 'whatareyougoingtodowiththem', 'theeditorofthelifeboathas', 'second-classmatter', 'thesmallestdetails', 'onthekansascityline', 'mostofthedepartmentshave', 'readwiththeclassverseseighteenand', "uinjtnruxnjtnjxrutj'jxruijxrti", "writingfrombyrd's", 'missionsduringtheyear', 'nnjtjxrltijtriruxnj', 'comestomesayingthat', 'keekkkeeeeeeeeseeeeeeeeeeeeee', 'jtjiruutjtjtjinnjtririj', 'andnumberlessotherlittle', 'whichismorehelpfulthantheadvocate', 'thepeopleofthenorth', 'gospelofhealthand', 'asthepeopleweredependent', 'ruuxruxruxrxruinrinruxruxruxmxruuxruiruxruxnrlm', 'whisperstohimwhatheoughttodo', 'therewasadiversityoi', "theeditorofthechicagoamericansaj's", 'rmmxnjxmuxruxnjtnjulrutjxntlnjt', 'ijijinjtjtannjtjtarinjxnjitmmtjtru', 'fromthebeginningofhuman', 'goodhealthpublishingco', 'vvaauunntteetthh', 'cannotbehappywithoutit', 'startofmebyyears', 'calledustoapartintheeducationalwork', 'strawberry-growers', 'connectedwitheitherthefarmorthehome', 'thegreatobjectof', 'jtjtjxiijtrutjtjtjtjtfutjuxrtj', 'tionofacluboftenadvocates', 'ltumrimu-uuinnnnri', 'pobtofficeatbimrkn', 'towardoneanother', 'awaythatappealstothe', "teachers'sanitary", 'shtooiitwitfbeplacedin', 'thandmulberrysts', 'mearepreparedtofurnish', 'especiallyifitbealargeredappleor', 'itistoseewhatheisfittedtodoandhow', 'rinriruxnjtjijtnnj', 'thereissomethinginitakintomotherhood', 'ijijxrjntrxminrirlrijxrij-u', 'battlecreekcollegebookstand', 'addresstheadvocate', 'faszszseszszszszszszseszssszezszszseszszszszszszszizszszseseszszszszszszsetil', 'drenallthattheyneed', 'istheharmoniousdevelop', 'writesthatdefinitear', 'endurethallthings', 'jijxnjtjtjxru-ijtjt', 'onetwenty-fourth', 'taaaaaaaaaaaaaaaaaytaaiss', 'parentstocomeinandvisitusthatday', 'the--------------conference', 'inourdenomination', 'teachersontheisland', 'thesysteminaschoolwheretheprinciples', 'haveyoureadthecalendarnumberoftheadvocate', 'thebabyasfastasyoucan', 'topromotewickedness', 'weresenttoalargenumberofour', 'lakeunionconference', 'thisprinciplemorefullyinto', 'mammahastaughthimtopray', 'ituirlruxrmjxrinjtjijtjtr', 'iwanttosayafewwordstothereadersof', 'medico-psychological', 'tobeartestimonytothefactthatithasopenedmy', 'buildingisinreadinessinseptember', 'ourschoolisprogressingnicely', 'terestedintheadvocate', 'dayonehotfoodmaybecookedandserved', 'mxfinjtjtjvinjtjtixnjxri', 'thereasonswhyphysicaltrainingshould', 'eeeeeeeeeeeeeeeeskeeeeeeeeeeeeeee', 'awordofexplanationwillmakeitclearwhy', 'theformerstudentswere', 'directoryofeducational', 'makerandaffordsmorereliefi', 'bbbbbbbbbbbbbbbbbbbbbbbb', 'rrjijijtjtrlrijiixnjtxi', 'requiremuchsympathy', 'rijijijxnnnjijijin', 'thepublicschoolsoffrance', 'thompson-anderson', '-----------------', 'thevoiceofnature', 'jackthe-giant-killer', 'editoroftheadvocate', 'firsttrainleavesat', '-------------------------------------', 'shesaysthattherewould', 'nature-study-work', 'rlruinndmnruvirumnjruiruuxrtnju', 'ifyoureceivetheadvocate', 'jtjutjtjtjxrijtjtjutjtxutjxrlr', 'takeacopyoftheadvocate', 'njvfxrutxijtjinjtjtjxrijxri', 'njtjrnnjinjixuxrm', 'achanceforateach', 'andwethinkittobethe', 'willteachtheplano', 'thebattlecreekcollegebookstand', 'ijixuxjtjnjtjtjiruntutjijvvmjuonjt', 'youngesfscholars', 'ihavetakentheadvocate', 'liedivineplanofteaching', 'theengineswilldif', 'asaleoffifty-oneadvocates', 'the-hundred-best-picture', 'utjojtjlitjljtxlru', "innnnru'jtjtjxru", 'umbmihluuciiiilbhitfl', 'andwehavehadaveryinteresting', 'anddoitupinaknot', 'isnoindexofcharactersosure', 'ruvmnnnnnlruxrtfmrtrii', 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 'rutjtjtjtjxrijtjxrlr', 'tothismanwillilook', 'andthreecopiesofthe', 'prayer-answering', 'prayforthespiritofrevelation', 'nnnnnnmtrumru-ltlruuumrmrlruuuuuuuuu', 'businessmanagerofwalla', 'isteachingthechurch', 'heartilyfavortheplan', 'wesatintheduskthatevening', 'readinthepublicschools', 'rltlfutjtjtjtjxruvfjtjtjtjinixruummum', 'iruxanruxnnnnjxnnnrln', 'jxrlrijijmrlnjijijxrirlrinnjmnjtjm', 'school-of-health', 'baldwinofhyannis', 'andinthatcityivisitedoneoftheparo', 'withusduringtheshorttimewespent', 'oeneraltrafficmanager', 'thisisasplendidpeach', 'ofthereadersoftheadvocatetoasample', 'rutjxnxlrutnrlnjtjtjtixrmjxr', 'njutjtxutrtjtrijtjxrijuxriititjxnjijxrtjxrtrijxnjxnjxixri', 'vantumjiixrijtjti', 'teachersthebenefitoftheadvocate', 'self-development', 'asseoond-ccassmatter', 'nrmjijtjtjtjtjinjtjtjxnjtjtjiarutj', 'camponahillsidenear', 'urjxixnjxrtruvrlrirlrin', 'ofboyswereskating', 'isjustthepaperforthisplace', 'andwifehavebegun', 'wemanufacturethefamous', 'makeobjectswithpeasandtoothpicks', 'thereadersoftheadvocate', 'whileworkingliaidasastone-mason', 'bemeasuredtoyouagain', "thecuttingwon'thurt", 'rijtjxnjxnjtjirixtiijxnjimirirjijtpjiruinnrux', 'soil-cultivation', 'toshowhowthisenduringand', 'mnjitltirlrrljirijiriruanjm', 'asyoudidinyourdockingof', 'isawtheneedofchris', 'single-heartedness', 'bythecourtesyofitspub', "mjinjtjtjvltij'mjlja", 'mutjtjtrjtnjtjiruttlm-lnjtjij-uu-uu', 'mtjitltltlruuoruanjtixnj', 'correctconclusionsfromwhathesees', 'rijtjxixrjttmrmjtj', 'annnnrmnjtjxrutjxnjijxnjtj', 'coursesinagricultureareto', 'whenthatwhichisperfectiscome', 'nnjtjxruxorinruinjnnnan', "ij'ijij'tjijlflrtllrtgu", 'andafterreadingityourself', 'divergentroadstoattainthesamegoal', 'comegoodworkersinanyline', 'readingtheadvocate', 'educationforstreetboys', 'thedailyinter-oceanofnov', 'whohashadnoexperience', 'willbeissuedthelastofau', 'ruorutitnruirlruxruxrutnrutjtjvn', 'hasbeenagreathelpto', 'tjtfljtrijtjtjtjtjtjlrutjtjtjtjljtjljtjtjtrutjtju', 'youmightwritealettertomymother', 'theyareabletomakerapidadvancrnent', 'whohashadchargeof', 'inthiswayhewillnotonlygetbetter', 'purchasefiardwarc', 'thisletterwaswrittenbyabra', 'wordhenterprises', 'ijtjtjutrlrijtrlrijxrutjxnjxnjtj', 'itisworthnothing', 'directoryofeducationalworkers', 'thejuniataindustrialschool', 'thereisamorbidaversiontosunburn', 'ofthedeathofmissfranceswright', 'ofourchurchschoolteacherswrites', 'terestinthecauseforwhiththead', 'umrltltltlruutjarlrltltltijut', 'utjajtjtxurruirmji', 'ifwegivealittleandgetagreatdeal', 'schoolshouldbeplantedina', 'writingfromlakeodessa', 'missionarycollegeopened', 'carefully-thought-out', 'ttuajutnrutjxixm', 'forquarterendingsept', 'postoffioeatberriensprings', "ptklcethandkerchief'", 'pleasealwaysmentiontheadvocate', 'character-growth', 'posedofinlessthantenminutes', 'forquarterendingmarch', 'mentionedintheadvocate', 'believemeifishould', 'itisessentialthat', 'ulrurnnnjijinjtjxixrlrlnjuub', 'uniformexaminationsforele', 'whichisnowapointof', 'robertcollegeisone', 'toforeigncountries', "breathedeeplyofgod'sgreatout-ofdoors", 'juvenilereadingofthehighest', 'missjennielarmouth', 'ixinxinjxrtriruirun', 'asyoucannotaffordtomissthevisitsofthe', 'royalbluebrandcannedgoodsisthefinest', 'oneoftheteacherswho', 'nevertodoanythingwhich', 'andonlyincreasestwo', 'utjtjtjtnjtruijtjxrmntutiijtjxmtjtjtjtjinjtitnnnjtrij', 'fieldsecretaries', 'oronapieceofbrick-colored', 'anicelineofsheets', 'istingintheeducationalworldinthesouth', 'itwillteachyoutot', 'taughtinapublicschool', 'awordforthecaterpillar', 'fessorsutherlandinthead', 'childrenshouldbeleftasfreeas', 'thepromisemadeinisaiah', 'ittaketomakethetrip', 'booksforourschools', 'advocateshouldgointoeveryseventhmonths', 'letusfollowthemandsee', 'hissorrowandanger', 'vrlrlrutjinjtftjmnjtnjmji', 'wiseadministered', 'madenomistakeintheexampleinthegar', 'industrialacademy', 'tlianksgivingday', 'adirectoryofchristian', 'theynolongerspend', 'andknowallmysteriesand', 'emmanuelillmssionariecollege', 'planwouldyousuggesttohelpmatters', 'wearetodetermine', 'takenupwiththenew', 'kkkkkkskkkkkskkkkkkskkkkkkkkkkk', 'andneverseemtocare', 'self-forgetfulness', 'officewillbeopenfrom', 'spirituallessonofthepowerandworkof', 'knowledge-making', 'imnjitlruiruinjxajruanniixuxruxnixrumrtxirmivxiirxnjxrinruajuixinntinixjtrtruxrinnrufe', 'chebattlecreekcollegebookstand', 'unitedstateswhichhas', 'issueoftheadvocate', 'ijxriruijijutjtjtjirutjtjtjtrlrln', 'somestrongfeaturesofbiblereader', 'carepreparedtofurnishcollegesand', 'andtriedtogetloose', 'hewilllayuponusabur', 'keepwatchofthepassengers', 'amanthinkethinhisheart', 'industrialdepartment', 'thefieldsandextensiveplains', 'utxlruxruxruuxanjixuin', 'seesomeoftheletters', 'jxrrmjtjijttliijarlanjtjtjtrmrlriait', 'rrrutjmnrmruxnjvrirutjanjmrutnj', 'ofcorrespondence', 'jacksonboulevard', 'youshouldbewellinformed', 'wasthequietanswer', "school-inaster's", 'littlechildrentocomfort', 'canwefindsogoodameans', 'havefeltthatitwasourgreatest', 'whatissorareasadayinjune', 'njtjtnjtrutrijijmrmjtnjtnjtnnjtnjtr', 'thehomeandschool', 'whereyourtreasureis', 'isalawinthestateofmichigan', 'theresourcefulness', 'jack-at-all-trades', 'church-schoolteachers', 'ijijijtjtjtxuuxrutrinjmjmjxnxi', 'theindispensable', 'comestotelltheadvocatefamilyofthe', 'thenisitnottimeforusto', 'writetothetraining', 'jtjxrmrlnjtrjinrlrijxrlrijijxrinrtririnji', 'inthejanuarynumberoftheadvocate', 'eeeeseeeeeeeeeeeeeeeeseeeeeeeeeee', 'ruxnjinjtjtjnrutjtjtjuvrmjtjtjtjtjtjtjxnjxnjtjtjt', 'vocalandinstrumentalmusic', 'comeswillseetoitthattheyare', 'utjtjtjtjtjxmuijijtjxnjijtj', 'tjulnjtjxrxnjijvljtjtjtjxnjtj', 'theindustrialschools', 'peachtreesshouldbepruned', 'mjnjijtjtjtjxrij', 'lnjrrunruxruxrianri', "children'scorner", 'butfewtext-books', 'training-schoolfo', 'andtherewasayoungman', 'didnothesitateto', 'xrvuvuxrtnnnivwvaantvrv', 'andtesttheabilityofeventheyoungerones', 'js-----------------------------------------------------------------------', 'theonlytruehappiness', 'joyisabroadintheworld', 'goodsstoreinbattlecreek', 'tyofconcentration', 'forblindchildren', "christless'graves", 'ihjuurnxltltltlrvjttlrltlftruruu', 'uxrxrvirijuinjxririj', 'aswewouldifwewereinheaven', 'cationbecounteracted', 'youareamemberofthechurch', 'sonotallthatappearsiswheat', 'ingconvincingreasonswhytheschoolgar', 'thewesternpartofthestateoforegon', 'theseptemberissueoftheadvocate', 'southlancasteracademy', 'astonishedgazeatthewriter', 'understandfigures', 'tothepublicschool', 'hastheardmyprayer', 'jxnrijutjxrutnjt', 'njirnirmjtjtjtjtjtnrutjxmrrmjirlrijtjtjtjarlarijtjaj', 'greatthingintheworld', 'notastepbeforeme', 'self-justification', 'wonfirstprizeinacom', 'thestreetwearsfinergar', 'promoteconstructive', 'andfeelsurethatourschools', 'blackboardillustration', 'thecharacterofeverystudenttrainedwithin', 'eeeeseeeeeeeeeeeeeeeeeseeeeeees', 'andshowbythepicturesthat', 'hasgeneraloversightof', 'rijimirutjxrutjiruutjij', 'whohathledwilllead', 'uiruuirultuu-uuirlnrxnjirirb', '-----------------------------------------', 'thefactthatrattle', "sisofworkinyoungpeople'ssocieties", 'whenthewholechurchworktogetherin', 'jijijxrinjtjiajtjxrln', "impresses'eternal", 'readinthebiblereader', 'whatworkcanhedowithmewheniamin', 'thecapetownchurchschool', 'correspondencesolicited', 'pagesoftheadvocate', "gideon'squestion", "ijtjxruxrutjlj-ijxrixruxruuijtjirijtxuxruuxnxutnjxruuumi'u-lru-u", 'parkerjointlessfountainpen', 'bagasasasasasasaszsasasaszsaszggasasaszsaszsasasasasasaszsasaszsasasasa', 'isatideintheaffairsofmen', 'injtjtjtjtjxrmnjtjtjtjttirmnjti', 'out-of-theexpenses', 'berdellachatfield', 'andourworkisprogressing', 'theywillfinditaverygreatblessing', 'whichtheschoolsotherwisetendtoproduce', 'misslottiefarreelwrites', 'theyloglaughedout', 'clothestothelaundry', 'posiofficeatbcrhicnsibir', 'rlrutrintjtjtjtri', 'irixixutjtjtjiitjtjtjtjtjtruijtjxruutjtjxriitjtjtjiruxixnjiitnjijtjtnjtjxrijt', 'operationofparentsandteachers', 'edvrctrrioisirilw', 'thathehascomefarshort', 'utjtjtjtjxnjulrltltxnjxrlru', 'oflifeandwhichinspirelife', 'asystemofeducationwithoutfaithand', 'thehousewelivein', 'heislikeuntoamanbeholding', 'thetrainingschool', 'ikkkkkxxkkxkkkkkkkskkkkkkkkksi', 'youreceivetheadvocate', 'andtheawfuldangerofmakingthisourdwell', 'onebrotherhasgivenfiveacresof', '----------------', 'toremovemountains', 'aboybadafterkilling', 'hewillbeassistedbyprof', 'icannotseethatyouhave', 'wallawallacollege', 'foregoingcommand', 'mackaywasateacher', 'hasinthepastcalledthe', 'bencfitsiifchristianeducation', 'neverhadchildrenlearntoreadsoquick', 'outrageoushoodlumism', 'witsltthimlrprttrrtfthf', 'rfutjinitnrltlru-u', 'muchmoreoftheword', 'intheimageofgodcreatedhe', 'andtohimonewhohasnotpassed', 'andsimichiganave', 'somemanwhochangesposition', 'fourandahalfincheslongandthreeincheswide', 'thecityofwashington', 'addressingthecollege', 'ofthechurchesandschoolsinthediocese', 'wepayforschoolsnetsomuchoutof', 'epottothecollege', 'neartwenty-second', 'thecorrespondence-study', "parents'meetings", 'referencetobroadening', 'primarylanguagelessons', 'xinruumnjtjtanjtjtjtjtjtjtjirurnnjtjtj-mnnjtjtj', 'ofthelastissueoftheadvocate', 'xnjuijijuxrijtitrutjtjtririjlruijtjtjxnjtjtjtj', 'whatimprovementmaywe', 'uptothepresenttimeourchurchmem', 'heisasmuchstrongerthanastu', 'unitedeffortsofthepupils', 'spanish-american', 'andpricesaremoderate', 'whatiftheyshould', 'spromisetoabraham', 'hasamissionoutside', 'drooping-branched', 'wehavehadthebiblelessons', 'themachineryinorder', 'whatthestatedoes', 'workertogivetheadvocate', 'vnrtrxrxmxruxnnruxruxruxrlnnruijuxixrinnrlrijxixrxrmnririjxnjxixnnruxnjxrlrlrtn', 'writingforthecent', 'themaynumberoftheadvocate', 'whenthechildrenhavelearned', 'pointwhereamancannotsharpenhisown', 'oftheunfortunatevictimsofdrughabits', 'andastheshadowsroundmecreep', 'thiswaythatmanualtrainingisplacedon', 'theaboveisfromtheopeningparagraph', 'rmjtjmjxrinruirixmjinxiaj', 'deliveredtheclosing', 'howmanyeducatedmothers', ...], 15)
Correction 10 -- Remove long error tokens¶
# %load shared_elements/remove-tokens-with-long-strings-of-characters.py
prev = cycle
cycle = "correction10"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
text = re.sub(r"[0-9,!?$:;&]", " ", content)
tokens = utilities.tokenize_text(text)
sub_list = ["e|E", "n|N", "u|U", "x|X", "t|T", "b|B", "a|A", "w|W", "s|S", "x|X", "k|K"]
replacements = []
for sub in sub_list:
replacements.append(clean.check_for_repeating_characters(tokens, sub))
replacements = [item for sublist in replacements for item in sublist]
if len(replacements) > 0:
print('{}: {}'.format(filename, replacements))
for replacement in replacements:
content = clean.replace_pair(replacement, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
ADV18991101-V01-10-page32.txt: [('Iknowhadneedoneeveningtopassbetween', ' ')] ADV19000101-V02-01-page31.txt: [('uvuinnimruuinrumruuiJinjTnnJviJvarinjTruuuuuvrinjimuxrirLrLruinjLnnnnnnnjiJxtJ', ' '), ('uvuinnimruuinrumruuiJinjTnnJviJvarinjTruuuuuvrinjimuxrirLrLruinjLnnnnnnnjiJxtJ', ' ')] ADV19000101-V02-01-page32.txt: [('imnjiTLruiruinjxajruanniixuxruxnixrumrTxirmivxiirxnJxrinruaJuixinnTiniXJtrtruxrinnrufe', ' ')] ADV19000101-V02-01-page35.txt: [('KKKKKKKKKKKKK', ' '), ('KKKKSKKKKSKKKS', ' ')] ADV19000201-V02-02-page39.txt: [('EEEESEEEEEEEEEEEEEEEEESEEEEEEES', ' '), ('EEEEEEEEEEEEEEEESKEEEEEEEEEEEEEEE', ' ')] ADV19000301-V02-03-page35.txt: [("JrLruu-u-LruLrumj'uuuu'UTjinjuvuiririjutjuuutjuuutjtjutj-ij-iju'ij", ' '), ('uiruuiruLTuu-uuirLnrxnjirirb', ' ')] ADV19000301-V02-03-page38.txt: [('iKKKKKXXKKXKKKKKKKSKKKKKKKKKSI', ' ')] ADV19000301-V02-03-page39.txt: [('EEEESEEEEEEEEEEEEEEEESEEEEEEEEEEE', ' ')] ADV19000401-V02-04-page57.txt: [('xratJTJTJTJTrin.TJuuu-ltltltltuutjtjtjtjtxltuuxrmjxnjTjTjmjTJxrLTLn.ruu', ' ')] ADV19000401-V02-04-page60.txt: [('vnrtrxrxmxruxnnruxruxruxrLnnruiJuxixrinnrLriJxixrxrmnririJxnJxixnnruxnjxrLrLrTn.nirLr', ' '), ('LnJXTLnjXTLTXiLrUXrXrXnrLrUUXrUXrUXIXrXfTJXrXnrUXIXrUUXrLrUUXTJlJXrDi', ' ')] ADV19000401-V02-04-page67.txt: [('KXKKKKKSSKSKKKKSSKSKKSSKKKKM', ' '), ('KXKKKKKSSKSKKKKSSKSKKSSKKKKM', ' ')] ADV19000501-V02-05-page34.txt: [('nnnnnnmtrumru-LTLruuumrmrLruuuuuuuuu', ' ')] ADV19000601-V02-06-page37.txt: [('KKKKKKKSKSKKKKKKKKSSIKKKSKKKKK', ' ')] ADV19000601-V02-06-page38.txt: [('IKKKKKKKKKKSK', ' ')] ADV19000701-V02-07-page35.txt: [('jxrLrmrLrinjxrLrLiinnjTJTJXixnjmnxmjTJTnjTjnnjTJirmnnrianiJ', ' ')] ADV19000701-V02-07-page38.txt: [('KEEKKKEEEEEEEESEEEEEEEEEEEEEE', ' '), ('EEEEEEEEESSEEEEEEEEEEEEEEEEEEE', ' '), ('eeeeeeeseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', ' ')] ADV19000801-V02-08-page38.txt: [('KKKKKKSKKKKKSKKKKKKSKKKKKKKKKKK', ' ')] ADV19001001-V02-10-page38.txt: [('EEEEEEEEEXEXEXXXEEEEEEEEXEEEEEE', ' '), ('EEEESEEXSEEEEEEXXEEESEEXXXEE', ' ')] ADV19001101-V02-11-page38.txt: [('BBBBBBBBBBBBBBBBBBBBBBBB', ' ')] ADV19001201-V02-12-page17.txt: [("ijTJxruxruTJlj-ijxRixruxruuiJTJirijTXUxruuxnxuTnjxruuumi'u-Lru-u", ' ')] ADV19001201-V02-12-page35.txt: [('LartnnrirxnnrinruannJTmiTuxanruTi', ' ')] ADV19010101-V03-01-page37.txt: [('wwwwwwwwwwwwt', ' ')] ADV19010201-V03-02-page39.txt: [('ESXEEEEEEEEEEEEKEEEEEEEEEEEEEEE', ' '), ('EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE', ' ')] ADV19010301-V03-03-page3.txt: [('nnjinnjinnrirm', ' '), ('iinnnnruinnnnnn', ' ')] ADV19010401-V03-04-page41.txt: [('qjrnjxririnrLnjTriJTrinnnjTJTJiJTJTJTnnjTmTJTnjTnnjTruTJTrij', ' ')] ADV19010401-V03-04-page43.txt: [('RESEEEEEEEEEEEEEEEEEEEEEEEEEEEK', ' ')] ADV19010401-V03-04-page44.txt: [('I.AAAAAAAAAAAA', ' ')] ADV19010501-V03-05-page3.txt: [('iJTrinnjiJTnnJVinnjuxnnjanjTrinjmanjxri', ' ')] ADV19010501-V03-05-page39.txt: [('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', ' '), ('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', ' ')] ADV19010501-V03-05-page40.txt: [('AAAAAAAAAAAAAXA', ' '), ('tAAAAAAAAAAAAAAAAAytAAisS', ' ')] ADV19010601-V03-06-page34.txt: [('JMKKKKKKKSMSSSKKKKigSiKKKi', ' ')] ADV19010601-V03-06-page37.txt: [('VWVWVWWVWVWWWVWWVWWWVWWV', ' ')] ADV19020201-V04-02-page35.txt: [('BSSSSBSSBIBSSi', ' '), ('KKKKKKKXKKKKSKKKKKH', ' '), ('KiSKSKSKKKSKSKKSSKKXaEKKKKKMKHKSfrSS', ' ')] ADV19040901-V06-09-page2.txt: [('rvrLnjruxruuxrtrtJirLrLrLruuxTLTLTtnruuuu', ' ')]
# %load shared_elements/summary.py
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction10 Average verified rate: 0.9678376476557775 Average of error rates: 0.04819682059046178 Total token count: 1253422
# %load shared_elements/top_errors.py
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 10 )[:50]
[('e', 3145), ('t', 2054), ('w', 1969), ('m', 1494), ('r', 1353), ('f', 1337), ('n', 1268), ("'", 1262), ('d', 899), ('g', 636), ('u', 503), ('k', 432), ('co', 352), ('x', 280), ('th', 226), ('z', 129), ('q', 112), ('fr', 92), ('ment', 88), ('tion', 80), ('re', 75), ('ofthe', 74), ('pp', 71), ('ex', 68), ('ers', 57), ('ft', 56), ('io', 55), ('mo', 44), ('mt', 43), ('il', 42), ('ky', 41), ('si', 39), ('oi', 38), ('ol', 34), ('ucation', 34), ('--', 33), ('va', 32), ('es', 31), ('dren', 30), ('tbe', 30), ('bo', 29), ('tlie', 29), ('jt', 28), ('pa', 27), ('al', 26), ('ma', 26), ('op', 26), ('ou', 26), ('pm', 26), ('chil', 26)]
Correction 11 -- Correct common OCR character substitutions¶
# %load shared_elements/make_common_substitutions.py
prev = cycle
cycle = "correction11"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
f = utilities.readfile('/Users/jeriwieringa/Dissertation/drafts/data', 'common_substitutions.txt')
sub_list = f.split('\n')
common_substitutions = [tuple(i.split('\t')) for i in sub_list]
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
text = utilities.strip_punct(content)
tokens = utilities.tokenize_text(text)
errors = reports.identify_errors(tokens, spelling_dictionary)
errors_updated = []
for error in errors:
if not error.lower() in spelling_dictionary:
errors_updated.append(error)
replacements = []
for error in errors_updated:
if len(error) > 1:
for sub in common_substitutions:
pattern = sub[0]
if re.search(pattern, error):
test_sub = re.sub(pattern, sub[1], error)
if test_sub.lower() in spelling_dictionary:
replacements.append((error, test_sub))
if len(replacements) > 0:
print('{}: {}'.format(filename, replacements))
for replacement in replacements:
content = clean.replace_pair(replacement, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
ADV18981201-V02-01-page11.txt: [('tlie', 'the')] ADV18981201-V02-01-page14.txt: [('bv', 'by')] ADV18981201-V02-01-page16.txt: [('wliat', 'what'), ('Ood', 'God')] ADV18981201-V02-01-page17.txt: [('ve', 'ye')] ADV18981201-V02-01-page18.txt: [('witli', 'with')] ADV18981201-V02-01-page19.txt: [('vour', 'your')] ADV18981201-V02-01-page26.txt: [('coine', 'come')] ADV18981201-V02-01-page7.txt: [('bv', 'by')] ADV18990101-V01-01-page30.txt: [('youtli', 'youth')] ADV18990101-V01-01-page36.txt: [('Cedak', 'Cedar')] ADV18990101-V01-01-page43.txt: [('OP', 'OF')] ADV18990101-V01-01-page45.txt: [('Ollice', 'Office')] ADV18990101-V01-01-page6.txt: [('liow', 'how')] ADV18990201-V01-02-page13.txt: [('trutli', 'truth')] ADV18990201-V01-02-page2.txt: [('DRUOS', 'DRUGS'), ('Mk', 'Mr')] ADV18990201-V01-02-page23.txt: [('tlio', 'tho')] ADV18990201-V01-02-page35.txt: [('tains', 'tams')] ADV18990201-V01-02-page42.txt: [('liave', 'have')] ADV18990201-V01-02-page47.txt: [('tlio', 'tho')] ADV18990201-V01-02-page52.txt: [('witli', 'with')] ADV18990201-V01-02-page54.txt: [('needetli', 'needeth')] ADV18990201-V01-02-page57.txt: [('joung', 'young')] ADV18990201-V01-02-page58.txt: [('Pkof', 'Prof')] ADV18990201-V01-02-page66.txt: [('ik', 'ir')] ADV18990301-V01-03-page17.txt: [('tlior', 'thor')] ADV18990301-V01-03-page26.txt: [('cliurch-school', 'church-school')] ADV18990301-V01-03-page33.txt: [('lield', 'held')] ADV18990301-V01-03-page40.txt: [('ine', 'me')] ADV18990301-V01-03-page48.txt: [('eartli', 'earth'), ('Ood', 'God')] ADV18990301-V01-03-page55.txt: [('Tvro', 'Tyro')] ADV18990301-V01-03-page7.txt: [('whP', 'whF')] ADV18990401-V01-04-page18.txt: [('Aliab', 'Ahab')] ADV18990401-V01-04-page19.txt: [('witli', 'with')] ADV18990401-V01-04-page21.txt: [('tlio', 'tho')] ADV18990401-V01-04-page31.txt: [('liis', 'his')] ADV18990401-V01-04-page41.txt: [('teacli', 'teach'), ('arinor', 'armor')] ADV18990401-V01-04-page57.txt: [('common-scliool', 'common-school')] ADV18990401-V01-04-page62.txt: [('Por', 'For')] ADV18990501-V01-05-page21.txt: [('li', 'h')] ADV18990501-V01-05-page23.txt: [('ineals', 'meals')] ADV18990501-V01-05-page26.txt: [('tliose', 'those')] ADV18990501-V01-05-page27.txt: [('bearetli', 'beareth')] ADV18990501-V01-05-page30.txt: [('tlie', 'the')] ADV18990501-V01-05-page31.txt: [('bv', 'by'), ('tlio', 'tho')] ADV18990501-V01-05-page37.txt: [('wav', 'way')] ADV18990601-V01-06-page100.txt: [('sucli', 'such')] ADV18990601-V01-06-page105.txt: [('witli', 'with')] ADV18990601-V01-06-page106.txt: [('TIIE', 'THE')] ADV18990601-V01-06-page122.txt: [('Oreek', 'Greek')] ADV18990601-V01-06-page125.txt: [('liay', 'hay')] ADV18990601-V01-06-page126.txt: [('vou', 'you')] ADV18990601-V01-06-page128.txt: [('PAG', 'FAG')] ADV18990601-V01-06-page2.txt: [('tlie', 'the')] ADV18990601-V01-06-page28.txt: [('Crotliers', 'Crothers')] ADV18990601-V01-06-page43.txt: [('hav', 'hay')] ADV18990601-V01-06-page46.txt: [('backstitcliing', 'backstitching')] ADV18990601-V01-06-page50.txt: [('TIIE', 'THE')] ADV18990601-V01-06-page52.txt: [('TIIE', 'THE')] ADV18990601-V01-06-page55.txt: [('wliat', 'what')] ADV18990601-V01-06-page87.txt: [('churcli', 'church')] ADV18990701-V01-07-page18.txt: [('church-scliool', 'church-school')] ADV18990701-V01-07-page37.txt: [('mk', 'mr')] ADV18990701-V01-07-page39.txt: [('ok', 'or')] ADV18990701-V01-07-page42.txt: [('TIIE', 'THE')] ADV18990701-V01-07-page48.txt: [('ve', 'ye')] ADV18990701-V01-07-page58.txt: [('TIIE', 'THE')] ADV18990701-V01-07-page59.txt: [('witli', 'with')] ADV18991001-V01-09-page22.txt: [('tlie', 'the')] ADV18991001-V01-09-page25.txt: [('akt', 'art')] ADV18991001-V01-09-page26.txt: [('cin', 'cm')] ADV18991001-V01-09-page36.txt: [('TIIE', 'THE')] ADV18991001-V01-09-page42.txt: [('SIIASTA', 'SHASTA')] ADV18991001-V01-09-page9.txt: [('IIint', 'Hint'), ('whirletli', 'whirleth')] ADV18991101-V01-10-page10.txt: [('tliat', 'that')] ADV18991101-V01-10-page12.txt: [('sclioolhouse', 'schoolhouse'), ('witli', 'with')] ADV18991101-V01-10-page24.txt: [('ve', 'ye')] ADV18991101-V01-10-page33.txt: [('bov', 'boy')] ADV18991101-V01-10-page39.txt: [('Emina', 'Emma'), ('tliem', 'them')] ADV18991101-V01-10-page4.txt: [('faitli', 'faith')] ADV18991101-V01-10-page40.txt: [('Ou', 'Gu')] ADV18991101-V01-10-page42.txt: [('tliat', 'that')] ADV18991101-V01-10-page47.txt: [('PRUITS', 'FRUITS')] ADV18991101-V01-10-page50.txt: [('OnU', 'GnU')] ADV19000101-V02-01-page11.txt: [('OP', 'OF')] ADV19000101-V02-01-page19.txt: [('coineth', 'cometh')] ADV19000101-V02-01-page20.txt: [('ve', 'ye')] ADV19000101-V02-01-page21.txt: [('waveretli', 'wavereth')] ADV19000101-V02-01-page35.txt: [('alii', 'ahi'), ('Oen', 'Gen')] ADV19000101-V02-01-page36.txt: [('Ok', 'Or')] ADV19000101-V02-01-page8.txt: [('proceedetli', 'proceedeth')] ADV19000101-V02-01-page9.txt: [('TIIE', 'THE')] ADV19000201-V02-02-page12.txt: [('wortli', 'worth'), ('tains', 'tams')] ADV19000201-V02-02-page14.txt: [('Bok', 'Bor')] ADV19000201-V02-02-page19.txt: [('mv', 'my')] ADV19000201-V02-02-page2.txt: [('ok', 'or')] ADV19000201-V02-02-page21.txt: [('Islimael', 'Ishmael')] ADV19000201-V02-02-page22.txt: [('Abraliam', 'Abraham')] ADV19000201-V02-02-page26.txt: [('ve', 'ye')] ADV19000201-V02-02-page28.txt: [('vear', 'year')] ADV19000201-V02-02-page38.txt: [('RUGOLES', 'RUGGLES')] ADV19000201-V02-02-page5.txt: [('Bok', 'Bor')] ADV19000301-V02-03-page10.txt: [('hav', 'hay')] ADV19000301-V02-03-page13.txt: [('cometli', 'cometh')] ADV19000301-V02-03-page18.txt: [('ve', 'ye')] ADV19000301-V02-03-page20.txt: [('Tj', 'Ty')] ADV19000301-V02-03-page35.txt: [('Sav', 'Say')] ADV19000301-V02-03-page4.txt: [('ork', 'orr')] ADV19000401-V02-04-page15.txt: [('mucli', 'much')] ADV19000401-V02-04-page18.txt: [('daj', 'day')] ADV19000401-V02-04-page19.txt: [('copj', 'copy')] ADV19000401-V02-04-page21.txt: [('thev', 'they'), ('poorlv', 'poorly'), ('Bv', 'By')] ADV19000401-V02-04-page28.txt: [('simplv', 'simply')] ADV19000401-V02-04-page33.txt: [('vou', 'you')] ADV19000401-V02-04-page34.txt: [('vou', 'you'), ('vour', 'your')] ADV19000401-V02-04-page36.txt: [('vou', 'you')] ADV19000401-V02-04-page4.txt: [('tlieir', 'their')] ADV19000401-V02-04-page41.txt: [('praver', 'prayer')] ADV19000401-V02-04-page42.txt: [('praver', 'prayer')] ADV19000401-V02-04-page44.txt: [('Pra', 'Fra')] ADV19000401-V02-04-page50.txt: [('Marv', 'Mary'), ('Anatomv', 'Anatomy')] ADV19000401-V02-04-page51.txt: [('Oen', 'Gen')] ADV19000401-V02-04-page53.txt: [('mav', 'may')] ADV19000401-V02-04-page56.txt: [('machinerv', 'machinery')] ADV19000401-V02-04-page60.txt: [('tj', 'ty')] ADV19000401-V02-04-page63.txt: [('CP', 'CF'), ('UPA', 'UFA')] ADV19000401-V02-04-page66.txt: [('applv', 'apply'), ('tlie', 'the')] ADV19000501-V02-05-page15.txt: [('ine', 'me')] ADV19000501-V02-05-page16.txt: [('abidetli', 'abideth')] ADV19000501-V02-05-page18.txt: [('rightlj', 'rightly')] ADV19000501-V02-05-page21.txt: [('ve', 'ye'), ('liis', 'his')] ADV19000501-V02-05-page22.txt: [('tlie', 'the')] ADV19000501-V02-05-page28.txt: [('pavable', 'payable')] ADV19000501-V02-05-page3.txt: [('tlieir', 'their')] ADV19000501-V02-05-page38.txt: [('Citv', 'City')] ADV19000501-V02-05-page4.txt: [('TIIE', 'THE'), ('wealthj', 'wealthy')] ADV19000501-V02-05-page5.txt: [('hatli', 'hath')] ADV19000601-V02-06-page1.txt: [('Librarv', 'Library')] ADV19000601-V02-06-page12.txt: [('tlie', 'the')] ADV19000601-V02-06-page13.txt: [('TIIE', 'THE')] ADV19000601-V02-06-page16.txt: [('tliey', 'they')] ADV19000601-V02-06-page20.txt: [('ver', 'yer')] ADV19000601-V02-06-page28.txt: [('tlie', 'the')] ADV19000601-V02-06-page29.txt: [('liay', 'hay'), ('OP', 'OF')] ADV19000601-V02-06-page31.txt: [('OP', 'OF')] ADV19000601-V02-06-page37.txt: [('Cin', 'Cm'), ('li', 'h')] ADV19000601-V02-06-page5.txt: [('dav', 'day')] ADV19000601-V02-06-page6.txt: [('Melanchtlion', 'Melanchthon')] ADV19000701-V02-07-page12.txt: [('dav', 'day')] ADV19000701-V02-07-page22.txt: [("saj's", "say's"), ('liis', 'his')] ADV19000701-V02-07-page24.txt: [('ver', 'yer')] ADV19000701-V02-07-page3.txt: [('li', 'h')] ADV19000701-V02-07-page35.txt: [('mj', 'my')] ADV19000701-V02-07-page36.txt: [('Kinlev', 'Kinley')] ADV19000701-V02-07-page38.txt: [('Ok', 'Or')] ADV19000701-V02-07-page9.txt: [('manv', 'many'), ('mv', 'my')] ADV19000801-V02-08-page16.txt: [('alwajs', 'always'), ('tliem', 'them')] ADV19000801-V02-08-page17.txt: [('Seventli', 'Seventh')] ADV19000801-V02-08-page18.txt: [('mj', 'my')] ADV19000801-V02-08-page2.txt: [('ok', 'or')] ADV19000801-V02-08-page22.txt: [('scliool-boy', 'school-boy')] ADV19000801-V02-08-page24.txt: [('liis', 'his')] ADV19000801-V02-08-page36.txt: [('li', 'h')] ADV19000801-V02-08-page38.txt: [('Orand', 'Grand')] ADV19000801-V02-08-page8.txt: [('majbe', 'maybe')] ADV19001001-V02-10-page14.txt: [('ORAW', 'GRAW')] ADV19001001-V02-10-page15.txt: [('saitli', 'saith')] ADV19001001-V02-10-page22.txt: [('Bok', 'Bor')] ADV19001001-V02-10-page28.txt: [('Miclielson', 'Michelson'), ('Povsippi', 'Poysippi')] ADV19001001-V02-10-page29.txt: [('Wliat', 'What')] ADV19001001-V02-10-page30.txt: [('Nortliport', 'Northport')] ADV19001001-V02-10-page32.txt: [('necessarv', 'necessary')] ADV19001001-V02-10-page37.txt: [('Cin', 'Cm'), ('Citv', 'City')] ADV19001001-V02-10-page38.txt: [('Ok', 'Or'), ('Orand', 'Grand')] ADV19001001-V02-10-page6.txt: [('faitli', 'faith'), ('alwav', 'alway'), ('availetli', 'availeth')] ADV19001001-V02-10-page9.txt: [('tained', 'tamed')] ADV19001101-V02-11-page10.txt: [('ejre', 'eyre'), ('praj', 'pray')] ADV19001101-V02-11-page11.txt: [('ined', 'med')] ADV19001101-V02-11-page13.txt: [('sucli', 'such')] ADV19001101-V02-11-page15.txt: [('Studv-Books', 'Study-Books'), ('Patinos', 'Patmos')] ADV19001101-V02-11-page17.txt: [('tlieir', 'their')] ADV19001101-V02-11-page19.txt: [('wliat', 'what')] ADV19001101-V02-11-page21.txt: [('clumsv', 'clumsy'), ('iny', 'my'), ('liab', 'hab')] ADV19001101-V02-11-page31.txt: [('iness', 'mess')] ADV19001101-V02-11-page34.txt: [('necessarv', 'necessary')] ADV19001101-V02-11-page35.txt: [('Trv', 'Try')] ADV19001101-V02-11-page36.txt: [('Kinlev', 'Kinley')] ADV19001101-V02-11-page38.txt: [('Mk', 'Mr')] ADV19001101-V02-11-page4.txt: [('bv', 'by')] ADV19001101-V02-11-page40.txt: [('Tv', 'Ty')] ADV19001201-V02-12-page13.txt: [('vears', 'years')] ADV19001201-V02-12-page14.txt: [('li', 'h')] ADV19001201-V02-12-page2.txt: [('Citv', 'City')] ADV19001201-V02-12-page22.txt: [('drv', 'dry')] ADV19001201-V02-12-page23.txt: [("saj'S", "say'S")] ADV19001201-V02-12-page27.txt: [('inoral', 'moral')] ADV19001201-V02-12-page3.txt: [('ve', 'ye')] ADV19001201-V02-12-page31.txt: [('Bok', 'Bor')] ADV19001201-V02-12-page35.txt: [('PRUIT', 'FRUIT')] ADV19001201-V02-12-page36.txt: [('vou', 'you'), ('drv-goods', 'dry-goods'), ('anv', 'any'), ('kindlv', 'kindly'), ('factorv', 'factory')] ADV19001201-V02-12-page37.txt: [('CIk', 'CIr')] ADV19010101-V03-01-page14.txt: [('tlie', 'the')] ADV19010101-V03-01-page18.txt: [('sjstem', 'system')] ADV19010101-V03-01-page19.txt: [('tlie', 'the'), ('Tike', 'Tire')] ADV19010101-V03-01-page2.txt: [('Wliat', 'What')] ADV19010101-V03-01-page22.txt: [('tlie', 'the')] ADV19010101-V03-01-page23.txt: [('boj', 'boy')] ADV19010101-V03-01-page24.txt: [('OU', 'GU')] ADV19010101-V03-01-page25.txt: [('ve', 'ye')] ADV19010101-V03-01-page31.txt: [('Thej', 'They')] ADV19010101-V03-01-page34.txt: [('MAOAZINE', 'MAGAZINE')] ADV19010101-V03-01-page37.txt: [('fev', 'fey'), ('tv', 'ty')] ADV19010101-V03-01-page38.txt: [('stvle', 'style')] ADV19010101-V03-01-page40.txt: [('tliis', 'this')] ADV19010101-V03-01-page7.txt: [('ok', 'or')] ADV19010201-V03-02-page12.txt: [('coine', 'come')] ADV19010201-V03-02-page16.txt: [('teacli', 'teach')] ADV19010201-V03-02-page17.txt: [('tliat', 'that')] ADV19010201-V03-02-page18.txt: [('singetli', 'singeth'), ('liow', 'how')] ADV19010201-V03-02-page20.txt: [('Thev', 'They')] ADV19010201-V03-02-page21.txt: [('Peeze', 'Feeze'), ('tliat', 'that')] ADV19010201-V03-02-page22.txt: [('ve', 'ye')] ADV19010201-V03-02-page23.txt: [("lie'll", "he'll")] ADV19010201-V03-02-page32.txt: [('illy', 'iffy'), ('hvgienic', 'hygienic')] ADV19010201-V03-02-page34.txt: [("boj's", "boy's")] ADV19010201-V03-02-page35.txt: [('pavable', 'payable')] ADV19010201-V03-02-page36.txt: [("Oen'l", "Gen'l")] ADV19010201-V03-02-page38.txt: [('stvle', 'style'), ('batli', 'bath'), ('tlie', 'the')] ADV19010201-V03-02-page40.txt: [('Tlie', 'The')] ADV19010201-V03-02-page9.txt: [('thj', 'thy')] ADV19010301-V03-03-page10.txt: [('Tlie', 'The'), ('liis', 'his'), ('Thev', 'They'), ('saj', 'say')] ADV19010301-V03-03-page11.txt: [('doin', 'dom'), ('bj', 'by')] ADV19010301-V03-03-page33.txt: [('liow', 'how')] ADV19010301-V03-03-page8.txt: [('OP', 'OF'), ('energj', 'energy')] ADV19010401-V03-04-page17.txt: [('tlie', 'the')] ADV19010401-V03-04-page40.txt: [('Mk', 'Mr'), ('RUGOLES', 'RUGGLES')] ADV19010401-V03-04-page41.txt: [('OP', 'OF')] ADV19010501-V03-05-page2.txt: [('bv', 'by')] ADV19010501-V03-05-page20.txt: [('teacli', 'teach')] ADV19010501-V03-05-page23.txt: [('closelj', 'closely')] ADV19010501-V03-05-page28.txt: [('Bj', 'By')] ADV19010501-V03-05-page37.txt: [('Olt', 'Glt')] ADV19010501-V03-05-page38.txt: [('IP', 'IF'), ('Bj', 'By')] ADV19010501-V03-05-page8.txt: [('eje', 'eye')] ADV19010601-V03-06-page24.txt: [('TIIE', 'THE')] ADV19010601-V03-06-page33.txt: [('Mulberrv', 'Mulberry')] ADV19010601-V03-06-page35.txt: [('ik', 'ir')] ADV19010801-V03-07-page11.txt: [('jou', 'you')] ADV19010801-V03-07-page12.txt: [('Haugliey', 'Haughey')] ADV19010801-V03-07-page13.txt: [('Bok', 'Bor')] ADV19010801-V03-07-page23.txt: [('ork', 'orr')] ADV19010801-V03-07-page24.txt: [('ork', 'orr'), ('liis', 'his')] ADV19010801-V03-07-page34.txt: [('tv', 'ty')] ADV19010801-V03-07-page6.txt: [('slialt', 'shalt')] ADV19011001-V03-08-page13.txt: [('tlie', 'the')] ADV19011001-V03-08-page15.txt: [('ek', 'er')] ADV19011001-V03-08-page16.txt: [('Cliron', 'Chron')] ADV19011001-V03-08-page26.txt: [('slialt', 'shalt')] ADV19011001-V03-08-page27.txt: [('illy', 'iffy')] ADV19011001-V03-08-page28.txt: [('tlie', 'the')] ADV19011001-V03-08-page36.txt: [('Tliis', 'This')] ADV19011001-V03-08-page8.txt: [('OP', 'OF')] ADV19011101-V03-09-page16.txt: [('Sabbath-scliools', 'Sabbath-schools')] ADV19011101-V03-09-page19.txt: [('scliool', 'school')] ADV19011101-V03-09-page21.txt: [('coineth', 'cometh')] ADV19011101-V03-09-page25.txt: [('Bok', 'Bor')] ADV19011101-V03-09-page3.txt: [('healtli-reform', 'health-reform')] ADV19011101-V03-09-page31.txt: [('Purdliam', 'Purdham')] ADV19011101-V03-09-page33.txt: [('TIIE', 'THE'), ('bv', 'by')] ADV19011101-V03-09-page36.txt: [('clotli', 'cloth')] ADV19011101-V03-09-page4.txt: [('bj', 'by')] ADV19011101-V03-09-page5.txt: [('teacli', 'teach')] ADV19011101-V03-09-page7.txt: [('praj', 'pray')] ADV19011201-V03-10-page24.txt: [('mucli', 'much')] ADV19011201-V03-10-page29.txt: [('Wliat', 'What')] ADV19011201-V03-10-page32.txt: [('ork', 'orr')] ADV19011201-V03-10-page33.txt: [('bv', 'by')] ADV19011201-V03-10-page34.txt: [('Orand', 'Grand')] ADV19020101-V04-01-page10.txt: [('commonlj', 'commonly')] ADV19020101-V04-01-page12.txt: [('liis', 'his')] ADV19020101-V04-01-page16.txt: [('tliy', 'thy')] ADV19020101-V04-01-page2.txt: [('fej', 'fey')] ADV19020101-V04-01-page21.txt: [('hav', 'hay')] ADV19020101-V04-01-page25.txt: [('andj', 'andy'), ('CIk', 'CIr')] ADV19020101-V04-01-page30.txt: [('Rutliven', 'Ruthven')] ADV19020101-V04-01-page33.txt: [('Marv', 'Mary')] ADV19020101-V04-01-page34.txt: [('SPRINOS', 'SPRINGS')] ADV19020101-V04-01-page8.txt: [('wliich', 'which')] ADV19020201-V04-02-page1.txt: [('beO', 'beG')] ADV19020201-V04-02-page13.txt: [('liappy', 'happy')] ADV19020201-V04-02-page15.txt: [('liimself', 'himself'), ('needetli', 'needeth')] ADV19020201-V04-02-page2.txt: [('ork', 'orr')] ADV19020201-V04-02-page21.txt: [('Slieaf', 'Sheaf')] ADV19020201-V04-02-page33.txt: [('ork', 'orr')] ADV19020201-V04-02-page4.txt: [('illy', 'iffy')] ADV19020201-V04-02-page5.txt: [('dev', 'dey')] ADV19020301-V04-03-page15.txt: [('tained', 'tamed')] ADV19020301-V04-03-page17.txt: [('Primarj', 'Primary')] ADV19020301-V04-03-page2.txt: [('Summarv', 'Summary')] ADV19020301-V04-03-page32.txt: [('througli', 'through'), ('hoj', 'hoy')] ADV19020301-V04-03-page6.txt: [('bej', 'bey'), ('buoj', 'buoy')] ADV19020301-V04-03-page9.txt: [('Bok', 'Bor')] ADV19020401-V04-04-page11.txt: [('ve', 'ye'), ('eacli', 'each')] ADV19020401-V04-04-page16.txt: [('ork', 'orr'), ('lio', 'ho'), ('liad', 'had')] ADV19020401-V04-04-page2.txt: [('awav', 'away'), ('Missionarv', 'Missionary')] ADV19020401-V04-04-page20.txt: [('iPs', 'iFs')] ADV19020401-V04-04-page29.txt: [('compreliend', 'comprehend'), ('tlieir', 'their')] ADV19020401-V04-04-page30.txt: [("daj's", "day's")] ADV19020401-V04-04-page33.txt: [('ZiO', 'ZiG')] ADV19020501-V04-05-page1.txt: [('inus', 'mus')] ADV19020501-V04-05-page12.txt: [('manj', 'many')] ADV19020501-V04-05-page16.txt: [('ork', 'orr')] ADV19020501-V04-05-page2.txt: [('ork', 'orr')] ADV19020501-V04-05-page24.txt: [('ake', 'are')] ADV19020501-V04-05-page26.txt: [('OP', 'OF')] ADV19020501-V04-05-page27.txt: [('Seventli-day', 'Seventh-day')] ADV19020501-V04-05-page34.txt: [('Secretarv', 'Secretary')] ADV19020501-V04-05-page7.txt: [('tov', 'toy'), ('tlie', 'the')] ADV19020601-V04-06-page2.txt: [('Countrv', 'Country'), ('dav', 'day'), ('tlie', 'the')] ADV19020601-V04-06-page26.txt: [('ve', 'ye')] ADV19020601-V04-06-page33.txt: [('Snvder', 'Snyder'), ('Marv', 'Mary'), ('Quiinby', 'Quimby'), ('Bailev', 'Bailey')] ADV19020601-V04-06-page34.txt: [('peka', 'pera'), ('Secretarv', 'Secretary'), ('IStli', 'ISth')] ADV19020701-V04-07-page19.txt: [('lov', 'loy')] ADV19020701-V04-07-page2.txt: [('jk', 'jr')] ADV19020701-V04-07-page29.txt: [('dormitorj', 'dormitory'), ('inat', 'mat')] ADV19020701-V04-07-page32.txt: [('kee', 'ree')] ADV19020701-V04-07-page34.txt: [('OP', 'OF')] ADV19020701-V04-07-page4.txt: [('ver', 'yer')] ADV19020701-V04-07-page7.txt: [('tlie', 'the'), ('bj', 'by')] ADV19020801-V04-08-page11.txt: [('studj', 'study')] ADV19020801-V04-08-page14.txt: [('CIk', 'CIr')] ADV19020801-V04-08-page19.txt: [('ork', 'orr')] ADV19020801-V04-08-page2.txt: [('Svstem', 'System')] ADV19020801-V04-08-page34.txt: [('OP', 'OF'), ('Plioenix', 'Phoenix')] ADV19020801-V04-08-page36.txt: [('SPRINO', 'SPRING')] ADV19020801-V04-08-page8.txt: [('tained', 'tamed')] ADV19020901-V04-09-page10.txt: [('ork', 'orr')] ADV19020901-V04-09-page11.txt: [('illy', 'iffy')] ADV19020901-V04-09-page12.txt: [('Faitli', 'Faith'), ('anjone', 'anyone')] ADV19020901-V04-09-page13.txt: [('Thv', 'Thy')] ADV19020901-V04-09-page2.txt: [('Sabbatli', 'Sabbath'), ('TIIE', 'THE')] ADV19020901-V04-09-page20.txt: [('ve', 'ye')] ADV19020901-V04-09-page22.txt: [('Wliat', 'What')] ADV19020901-V04-09-page31.txt: [('inutes', 'mutes')] ADV19021001-V04-10-page12.txt: [('tliat', 'that')] ADV19021001-V04-10-page20.txt: [('Bok', 'Bor')] ADV19021001-V04-10-page24.txt: [('DeGkaw', 'DeGraw')] ADV19021001-V04-10-page25.txt: [('hav', 'hay')] ADV19021001-V04-10-page32.txt: [('Tlie', 'The')] ADV19021001-V04-10-page35.txt: [('liis', 'his')] ADV19021001-V04-10-page39.txt: [('tliem', 'them')] ADV19021001-V04-10-page41.txt: [('OP', 'OF')] ADV19021001-V04-10-page8.txt: [('historj', 'history')] ADV19021101-V04-11-page12.txt: [('tlie', 'the')] ADV19021101-V04-11-page2.txt: [('Jk', 'Jr'), ('mj', 'my'), ('tj', 'ty')] ADV19021101-V04-11-page22.txt: [('li', 'h')] ADV19021101-V04-11-page27.txt: [('missionarj', 'missionary')] ADV19021101-V04-11-page3.txt: [('oatli', 'oath')] ADV19021101-V04-11-page33.txt: [('OENERAL', 'GENERAL')] ADV19021101-V04-11-page8.txt: [('tained', 'tamed')] ADV19021201-V04-12-page13.txt: [('Christinas', 'Christmas')] ADV19021201-V04-12-page15.txt: [('one-lialf', 'one-half')] ADV19021201-V04-12-page18.txt: [('jouth', 'youth')] ADV19021201-V04-12-page24.txt: [('Tlie', 'The')] ADV19021201-V04-12-page26.txt: [("daj'S", "day'S")] ADV19021201-V04-12-page35.txt: [('Nuin', 'Num'), ('Lvle', 'Lyle'), ('ku', 'ru')] ADV19021201-V04-12-page36.txt: [('oO', 'oG')] ADV19021201-V04-12-page6.txt: [('aluinni', 'alumni')] ADV19030101-V05-01-page14.txt: [('bv', 'by')] ADV19030101-V05-01-page20.txt: [('bv', 'by')] ADV19030101-V05-01-page27.txt: [('jou', 'you')] ADV19030101-V05-01-page3.txt: [('Jehosapliat', 'Jehosaphat')] ADV19030101-V05-01-page30.txt: [('schoolliouse', 'schoolhouse')] ADV19030101-V05-01-page34.txt: [('ork', 'orr'), ('tains', 'tams'), ('Flovd', 'Floyd'), ('OP', 'OF')] ADV19030101-V05-01-page35.txt: [('Editli', 'Edith'), ('li', 'h')] ADV19030201-V05-02-page12.txt: [('tlie', 'the')] ADV19030201-V05-02-page19.txt: [('sliall', 'shall')] ADV19030201-V05-02-page2.txt: [('iny', 'my')] ADV19030201-V05-02-page21.txt: [('li', 'h')] ADV19030201-V05-02-page3.txt: [('tlie', 'the')] ADV19030201-V05-02-page30.txt: [('Teacheks', 'Teachers')] ADV19030201-V05-02-page34.txt: [('ork', 'orr'), ('Oreek', 'Greek'), ('Secretarv', 'Secretary')] ADV19030201-V05-02-page4.txt: [('mein', 'mem')] ADV19030201-V05-02-page8.txt: [('ork', 'orr')] ADV19030201-V05-02-page9.txt: [('wlio', 'who')] ADV19030301-V05-03-page18.txt: [('Ood', 'God')] ADV19030301-V05-03-page27.txt: [('Bok', 'Bor')] ADV19030301-V05-03-page31.txt: [('Cheritli', 'Cherith')] ADV19030301-V05-03-page35.txt: [('reallv', 'really')] ADV19030301-V05-03-page7.txt: [('OP', 'OF')] ADV19030401-V05-04-page20.txt: [('IIO', 'HO')] ADV19030401-V05-04-page21.txt: [('li', 'h')] ADV19030501-V05-05-page13.txt: [('vid', 'yid'), ('tained', 'tamed')] ADV19030501-V05-05-page26.txt: [('Pid', 'Fid')] ADV19030501-V05-05-page27.txt: [('coinest', 'comest')] ADV19030501-V05-05-page32.txt: [('ku', 'ru')] ADV19030501-V05-05-page33.txt: [('Bj', 'By')] ADV19030501-V05-05-page5.txt: [('OP', 'OF')] ADV19030501-V05-05-page8.txt: [('ve', 'ye')] ADV19030501-V05-05-page9.txt: [('Bok', 'Bor')] ADV19030601-V05-06-page10.txt: [('liad', 'had')] ADV19030601-V05-06-page14.txt: [('bv', 'by')] ADV19030601-V05-06-page22.txt: [('mak', 'mar')] ADV19030601-V05-06-page24.txt: [('iness', 'mess')] ADV19030601-V05-06-page28.txt: [('bj', 'by'), ('IIow', 'How')] ADV19030601-V05-06-page3.txt: [('hav', 'hay')] ADV19030601-V05-06-page30.txt: [('slovd', 'sloyd')] ADV19030601-V05-06-page33.txt: [('nearbv', 'nearby')] ADV19030601-V05-06-page4.txt: [('centurj', 'century')] ADV19030601-V05-06-page7.txt: [('twentj-five', 'twenty-five')] ADV19030701-V05-07-page28.txt: [('li', 'h')] ADV19030701-V05-07-page29.txt: [('ve', 'ye')] ADV19030801-V05-08-page12.txt: [('saj', 'say')] ADV19030801-V05-08-page35.txt: [('onlv', 'only')] ADV19030801-V05-08-page6.txt: [('mav', 'may'), ('Sucli', 'Such')] ADV19030801-V05-08-page8.txt: [("boj's", "boy's")] ADV19030901-V05-09-page1.txt: [('Oo', 'Go')] ADV19030901-V05-09-page17.txt: [('wav', 'way')] ADV19030901-V05-09-page18.txt: [('slialt', 'shalt'), ('glorv', 'glory')] ADV19030901-V05-09-page19.txt: [('seemetli', 'seemeth')] ADV19030901-V05-09-page2.txt: [('lieed', 'heed')] ADV19030901-V05-09-page24.txt: [('liis', 'his')] ADV19030901-V05-09-page27.txt: [('Bok', 'Bor')] ADV19030901-V05-09-page28.txt: [('slovd', 'sloyd')] ADV19030901-V05-09-page29.txt: [('liis', 'his')] ADV19030901-V05-09-page4.txt: [('pajment', 'payment')] ADV19031001-V05-10-page15.txt: [('manj', 'many')] ADV19031001-V05-10-page17.txt: [('vin', 'yin')] ADV19031001-V05-10-page2.txt: [('iny', 'my')] ADV19031001-V05-10-page23.txt: [('li', 'h')] ADV19031001-V05-10-page25.txt: [('daj', 'day')] ADV19031001-V05-10-page30.txt: [('easj', 'easy')] ADV19031001-V05-10-page32.txt: [('studj', 'study')] ADV19031001-V05-10-page35.txt: [('Citv', 'City')] ADV19031101-V05-11-page2.txt: [('bv', 'by')] ADV19031101-V05-11-page27.txt: [('liogs', 'hogs')] ADV19031101-V05-11-page31.txt: [('onlj', 'only'), ('inuch', 'much'), ('ak', 'ar')] ADV19031101-V05-11-page35.txt: [('Citv', 'City')] ADV19031101-V05-11-page9.txt: [('Everj', 'Every')] ADV19031201-V05-12-page12.txt: [('bj', 'by')] ADV19031201-V05-12-page26.txt: [('ake', 'are')] ADV19031201-V05-12-page28.txt: [('ork', 'orr')] ADV19031201-V05-12-page32.txt: [('Thajer', 'Thayer')] ADV19031201-V05-12-page35.txt: [('tlie', 'the')] ADV19040101-V06-01-page10.txt: [('missionarj', 'missionary')] ADV19040101-V06-01-page12.txt: [('OP', 'OF')] ADV19040101-V06-01-page20.txt: [('tained', 'tamed')] ADV19040301-V06-03-page1.txt: [('oO', 'oG')] ADV19040301-V06-03-page10.txt: [('tlie', 'the')] ADV19040401-V06-04-page12.txt: [('quietlj', 'quietly')] ADV19040401-V06-04-page19.txt: [('PAC', 'FAC')] ADV19040401-V06-04-page3.txt: [('IP', 'IF')] ADV19040501-V06-05-page13.txt: [('tlie', 'the')] ADV19040501-V06-05-page3.txt: [('Gkaw', 'Graw')] ADV19040501-V06-05-page4.txt: [('OP', 'OF')] ADV19040601-V06-06-page11.txt: [('OP', 'OF')] ADV19040601-V06-06-page13.txt: [("boj's", "boy's")] ADV19040601-V06-06-page18.txt: [('jear', 'year')] ADV19040701-V06-07-page11.txt: [('Sabbath-scliool', 'Sabbath-school')] ADV19040701-V06-07-page16.txt: [('OP', 'OF')] ADV19040701-V06-07-page3.txt: [('Gkaw', 'Graw')] ADV19040701-V06-07-page8.txt: [("daj'S", "day'S")] ADV19040801-V06-08-page10.txt: [('twentv', 'twenty')] ADV19040801-V06-08-page2.txt: [('Dk', 'Dr')] ADV19040801-V06-08-page8.txt: [('Aint', 'Amt')] ADV19040901-V06-09-page15.txt: [('thein', 'them')] ADV19040901-V06-09-page2.txt: [('froin', 'from'), ('studv', 'study')] ADV19040901-V06-09-page3.txt: [('Dk', 'Dr'), ('Suthekland', 'Sutherland')] ADV19040901-V06-09-page8.txt: [('TIIE', 'THE')] ADV19041001-V06-10-page19.txt: [('ine', 'me'), ('ok', 'or')] ADV19041001-V06-10-page6.txt: [('illy', 'iffy')] ADV19041001-V06-10-page8.txt: [('thoroughlj', 'thoroughly')] ADV19041101-V06-11-page1.txt: [('ork', 'orr')] ADV19041101-V06-11-page18.txt: [('ine', 'me')] ADV19041101-V06-11-page8.txt: [('kands', 'rands')] ADV19050101-V07-01-page11.txt: [('Missionarj', 'Missionary')] ADV19050101-V07-01-page19.txt: [('ine', 'me')] ADV19050101-V07-01-page2.txt: [('ine', 'me'), ('wlio', 'who'), ('Ik', 'Ir')] ADV19050101-V07-01-page8.txt: [('iness', 'mess')]
# %load shared_elements/summary.py
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction11 Average verified rate: 0.9679373657943816 Average of error rates: 0.04783913701741106 Total token count: 1254139
# %load shared_elements/top_errors.py
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 10 )[:50]
[('e', 3207), ('t', 2060), ('w', 2041), ('m', 1502), ('r', 1375), ('f', 1338), ('n', 1287), ("'", 1262), ('d', 916), ('g', 637), ('u', 504), ('k', 432), ('co', 353), ('x', 280), ('th', 226), ('z', 129), ('q', 112), ('fr', 92), ('ment', 89), ('tion', 80), ('re', 79), ('ofthe', 74), ('pp', 71), ('ers', 68), ('ex', 68), ('ft', 56), ('io', 55), ('il', 47), ('ry', 47), ('mo', 44), ('mt', 43), ('ky', 41), ('si', 39), ('oi', 38), ('bo', 36), ('ol', 34), ('ucation', 34), ('--', 33), ('es', 33), ('va', 32), ('se', 31), ('tbe', 30), ('dren', 30), ('al', 28), ('jt', 28), ('ga', 28), ('fi', 27), ('pa', 27), ('ma', 26), ('pm', 26)]
Correction 12 -- Separate Squashed Words¶
# %load shared_elements/separate_squashed_words.py
import pandas as pd
from math import log
prev = cycle
cycle = "correction12"
directories = utilities.define_directories(prev, cycle, base_dir)
if not os.path.exists(directories['cycle']):
os.makedirs(directories['cycle'])
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
verified_tokens = []
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
clean.get_approved_tokens(content, spelling_dictionary, verified_tokens)
tokens_with_freq = dict(collections.Counter(verified_tokens))
words = pd.DataFrame(list(tokens_with_freq.items()), columns=['token','freq'])
words_sorted = words.sort_values('freq', ascending=False)
words_sorted_short = words_sorted[words_sorted.freq > 2]
sorted_list_of_words = list(words_sorted_short['token'])
wordcost = dict((k, log((i+1)*log(len(sorted_list_of_words)))) for i,k in enumerate(sorted_list_of_words))
maxword = max(len(x) for x in sorted_list_of_words)
corpus = (f for f in listdir(directories['prev']) if not f.startswith('.') and isfile(join(directories['prev'], f)))
for filename in corpus:
content = utilities.readfile(directories['prev'], filename)
text = utilities.strip_punct(content)
tokens = utilities.tokenize_text(text)
replacements = []
for token in tokens:
if not token.lower() in spelling_dictionary:
if len(token) > 17:
if re.search(r"[\-\-\'\"]", token):
pass
else:
split_string = clean.infer_spaces(token, wordcost, maxword)
list_split_string = split_string.split()
if clean.verify_split_string(list_split_string, spelling_dictionary):
replacements.append((token, split_string))
else:
pass
else:
pass
else:
pass
if len(replacements) > 0:
print("{}: {}".format(filename, replacements))
for replacement in replacements:
content = clean.replace_pair(replacement, content)
else:
pass
with open(join(directories['cycle'], filename), mode="w") as o:
o.write(content)
o.close()
ADV18981201-V02-01-page9.txt: [('comeswillseetoitthattheyare', 'comes will see to it that they are')] ADV18990101-V01-01-page13.txt: [('ThearticleintheAdvocate', 'The article in the Advocate'), ('fromthepenofSister', 'from the pen of Sister')] ADV18990101-V01-01-page21.txt: [('asifyouwereworkingforyourlifetosave', 'as if you were working for your life to save')] ADV18990101-V01-01-page45.txt: [('officewillbeopenfrom', 'office will be open from')] ADV18990101-V01-01-page50.txt: [('VocalandInstrumentalMusic', 'Vocal and In st ru mental Music'), ('SecretaryofFaculty', 'Secretary of Faculty')] ADV18990201-V01-02-page62.txt: [('AddresscorrespondencetoThe', 'Address correspondence to The')] ADV18990201-V01-02-page63.txt: [('BattleGreekPreparatorySchool', 'Battle Greek Preparatory School'), ('IndustrialDepartment', 'Industrial Department'), ('SecretaryofFaculty', 'Secretary of Faculty')] ADV18990201-V01-02-page65.txt: [('Westfollowingcities', 'West following cities')] ADV18990301-V01-03-page36.txt: [('yousitatthecloseofa', 'you sit at the close of a')] ADV18990301-V01-03-page57.txt: [('WhiletheMarchAdvocate', 'While the March Advocate'), ('hasmuchtosayonthework', 'has much to say on the work')] ADV18990301-V01-03-page58.txt: [('isdesignedespeciallytotrainyoung', 'is designed especially to train young'), ('andfollowinghisexample', 'and following his example')] ADV18990301-V01-03-page61.txt: [('officewillbeopenfrom', 'office will be open from')] ADV18990301-V01-03-page66.txt: [('VocalandInstrumentalMusic', 'Vocal and In st ru mental Music'), ('NaturalScienceandMathematics', 'Natural Science and Mathematics'), ('SecretaryofFaculty', 'Secretary of Faculty')] ADV18990401-V01-04-page39.txt: [('IntheJanuarynumberoftheAdvocate', 'In the January number of the Advocate')] ADV18990401-V01-04-page45.txt: [('previouslessonsinthe', 'previous lessons in the')] ADV18990401-V01-04-page61.txt: [('IndustrialDepartment', 'Industrial Department'), ('SecretaryofFaculty', 'Secretary of Faculty'), ('VocalandInstrumentalMusic', 'Vocal and In st ru mental Music')] ADV18990501-V01-05-page18.txt: [('amidacrowdthatthrongedthedailymart', 'amid a crowd that throng ed the daily mar t')] ADV18990501-V01-05-page20.txt: [('Inordertomaketheworkofthemost', 'In order to make the work of the most')] ADV18990501-V01-05-page42.txt: [('TheMaynumberoftheAdvocate', 'The May number of the Advocate')] ADV18990501-V01-05-page48.txt: [('TheAprilnumberoftheAdvocate', 'The April number of the Advocate')] ADV18990501-V01-05-page9.txt: [('whereshallwefindit', 'where shall we find it')] ADV18990601-V01-06-page118.txt: [('theonlytruehappiness', 'the only true happiness')] ADV18990601-V01-06-page122.txt: [('willbegivenforalimitednumberofstudents', 'will be given for a limited number of students'), ('iscalledtothefirstoftheAdvocate', 'is called to the first of the Advocate')] ADV18990601-V01-06-page127.txt: [('HaveYouReadtheCalendarNumberoftheAdvocate', 'Have You Read the Calendar Number of the Advocate')] ADV18990601-V01-06-page131.txt: [('Whatisfairtooneisfairtoanother', 'What is fair to one is fair to another')] ADV18990601-V01-06-page24.txt: [('topassagoodexaminationfor', 'to pass a good examination for')] ADV18990601-V01-06-page88.txt: [('totimeintheAdvocate', 'to time in the Advocate')] ADV18990701-V01-07-page10.txt: [('workandserviceofSatan', 'work and service of Satan')] ADV18990701-V01-07-page13.txt: [('reasonwhytherearesomany', 'reason why there are so many')] ADV18990701-V01-07-page14.txt: [('householdwillbeturnedtooneanother', 'household will be turned to one another')] ADV18990701-V01-07-page2.txt: [('PreparationofChurchesforChurchSchools', 'Preparation of Churches for Church Schools')] ADV18990701-V01-07-page20.txt: [('willsuggestlivetopicsforseminars', 'will suggest live topics for semi n a r s')] ADV18990701-V01-07-page23.txt: [('Theteachersdutytowardparentsand', 'The teachers duty toward parents and')] ADV18990701-V01-07-page27.txt: [('theyarewillingtostepbyfaith', 'they are willing to step by faith')] ADV18990701-V01-07-page29.txt: [('bemeasuredtoyouagain', 'be measured to you again')] ADV18990701-V01-07-page33.txt: [('writetotheTraining', 'write to the Training'), ('properpreparationsaremadeforthechurchschool', 'proper preparations are made for the church school')] ADV18990701-V01-07-page36.txt: [('theexperienceoftheteacherandsizeoftheschool', 'the experience of the teacher and size of the school')] ADV18990701-V01-07-page37.txt: [('Christianbusinessmanshouldlookafter', 'Christian business man should look after')] ADV18990701-V01-07-page38.txt: [('Nevercananirreligiousschool', 'Never can an ir religious school'), ('Icannotseethatyouhave', 'I cannot see that you have'), ('Itisamostsacredlegacysealedforeverbydivine', 'It is a most sacred legacy sealed forever by divine')] ADV18990701-V01-07-page43.txt: [('yeoftheLordraininthetimeofthelatter', 'ye of the Lord rain in the time of the latter')] ADV18990701-V01-07-page49.txt: [('iswisdomforhimwhoholdstheplow', 'is wisdom for him who holds the plow')] ADV18990701-V01-07-page5.txt: [('goodsstoreinBattleCreek', 'goods store in Battle Creek'), ('becausewesellforcashonly', 'because we sell for cash only')] ADV18990701-V01-07-page59.txt: [('seasonedwithashesandflavoredwithsmoke', 'season ed with ashes and flavor ed with smoke')] ADV18990701-V01-07-page62.txt: [('willbegivenforalimitednumberofstudentsto', 'will be given for a limited number of students to')] ADV18990701-V01-07-page63.txt: [('amonthlymagazinefilledwithuseful', 'a monthly magazine filled with useful')] ADV18990701-V01-07-page65.txt: [('Foolsalonecompletetheireducation', 'Fools alone complete their education')] ADV18990701-V01-07-page66.txt: [('ROYALBLUEBRANDCANNEDGOODSisthefinest', 'ROYAL BLUE BRAND CANNED GOODS is the finest'), ('Everytinguaranteed', 'Every tin guaranteed'), ('andpricesaremoderate', 'and prices are moderate')] ADV18990701-V01-07-page67.txt: [('Affordsexcellentfacilitiesforyouraccomodation', 'Affords excellent facilities for your a c com o d a t i o n')] ADV18990701-V01-07-page8.txt: [('Themindwillbeofthesame', 'The mind will be of the same'), ('Thequestionwhichshoulddecidewhether', 'The question which should decide whether')] ADV18991001-V01-09-page17.txt: [('reasonstheAdvocate', 'reasons the Advocate')] ADV18991001-V01-09-page23.txt: [('cameintoexistencewhenastrongspirit', 'came into existence when a strong spirit')] ADV18991001-V01-09-page24.txt: [('missionistospreadthetruthsofthat', 'mission is to spread the truths of that')] ADV18991001-V01-09-page38.txt: [('factthattheOctobernumberoftheAdvocate', 'fact that the October number of the Advocate')] ADV18991001-V01-09-page41.txt: [('delayintheappearanceofthisnumberoftheAdvocateis', 'delay in the appearance of this number of the Advocate is')] ADV18991101-V01-10-page23.txt: [('whatisthenextthingneeded', 'what is the next thing needed')] ADV18991101-V01-10-page33.txt: [('Christianeducationis', 'Christian education is')] ADV19000101-V02-01-page11.txt: [('Howmanyeducatedmothers', 'How many educated mothers')] ADV19000101-V02-01-page14.txt: [('InthefuturetheAdvocate', 'In the future the Advocate')] ADV19000101-V02-01-page17.txt: [('InthefuturetheAdvocate', 'In the future the Advocate')] ADV19000101-V02-01-page24.txt: [('verysensitivetotheinfluenceoflight', 'very sensitive to the influence of light'), ('bethoroughlyfamiliar', 'be thoroughly familiar'), ('lastnumberoftheAdvocate', 'last number of the Advocate')] ADV19000101-V02-01-page28.txt: [('parentstocomeinandvisitusthatday', 'parents to come in and visit us that day')] ADV19000101-V02-01-page29.txt: [('Noticethelabelonthewrapper', 'Notice the label on the wrapper'), ('ClubbedwiththeAdvocate', 'Club bed with the Advocate')] ADV19000201-V02-02-page14.txt: [('thisnumberoftheAdvocate', 'this number of the Advocate')] ADV19000201-V02-02-page19.txt: [('addressingtheCollege', 'addressing the College'), ('somemanwhochangesposition', 'some man who changes position')] ADV19000201-V02-02-page20.txt: [('noeasywaysthatlead', 'no easy ways that lead'), ('Buthethatstrivesin', 'But he that strives in')] ADV19000201-V02-02-page22.txt: [('DidGodgivehimasign', 'Did God give him a sign'), ('didhehavetowaitsolong', 'did he have to wait so long')] ADV19000201-V02-02-page23.txt: [('dayarefortunateinbeing', 'day are fortunate in being')] ADV19000201-V02-02-page25.txt: [('isnothingIenjoyquitesomuch', 'is nothing I enjoy quite so much'), ('amnowintheplacewherethe', 'am now in the place where the'), ('powerwhichisfarbettercompen', 'power which is far better com pen')] ADV19000201-V02-02-page27.txt: [('madebythechurchesofMichi', 'made by the churches of Mich i')] ADV19000201-V02-02-page28.txt: [('toeveryoneinterestedintheprin', 'to everyone interested in the p r i n')] ADV19000201-V02-02-page29.txt: [('readersoftheAdvocate', 'readers of the Advocate')] ADV19000201-V02-02-page3.txt: [('childrenshouldbeleftasfreeas', 'children should be left as free as')] ADV19000201-V02-02-page6.txt: [('wherethenecessitiesofthefamilycallforthe', 'where the necessities of the family call for the')] ADV19000301-V02-03-page20.txt: [('aboybadafterkilling', 'a boy bad after killing')] ADV19000301-V02-03-page26.txt: [('donotrealizehowfartheirchildren', 'do not realize how far their children'), ('thestrongermyfaith', 'the stronger my faith'), ('thethoughtcomestome', 'the thought comes to me')] ADV19000301-V02-03-page28.txt: [('greatinterestinthislineof', 'great interest in this line of')] ADV19000301-V02-03-page29.txt: [('takespleasureincallingthe', 'takes pleasure in calling the')] ADV19000301-V02-03-page30.txt: [('PARKERJOINTLESSFOUNTAINPEN', 'PARKER JOINTLESS FOUNTAIN PEN'), ('Forfurtherinformationaddress', 'For further information address')] ADV19000301-V02-03-page36.txt: [('TailoringDepartment', 'Tailoring Department')] ADV19000301-V02-03-page37.txt: [('CbeBattleCreekCollegeBookstandi', 'C be Battle Creek College Bookstand i')] ADV19000401-V02-04-page10.txt: [('Churcheswhichhavehadschoolsinthe', 'Churches which have had schools in the')] ADV19000401-V02-04-page11.txt: [('organofthechurchschools', 'organ of the church schools')] ADV19000401-V02-04-page2.txt: [('NowILayMeDowntoSleep', 'Now I Lay Me Down to Sleep'), ('TheMotherasaTeacher', 'The Mother as a Teacher')] ADV19000401-V02-04-page20.txt: [('Andastheshadowsroundmecreep', 'And as the shadows round me creep')] ADV19000401-V02-04-page33.txt: [('requiremuchsympathy', 'require much sympathy')] ADV19000401-V02-04-page45.txt: [('publishersoftheAdvocate', 'publishers of the Advocate')] ADV19000401-V02-04-page46.txt: [('matterbearinguponthesubject', 'matter bearing upon the subject')] ADV19000401-V02-04-page47.txt: [('Forfurtherinformationaddress', 'For further information address')] ADV19000401-V02-04-page48.txt: [('Forfurtherinformationaddress', 'For further information address')] ADV19000401-V02-04-page49.txt: [('PhysiologyandHygiene', 'Physiology and Hygiene'), ('PrinciplesandMethodsofChristianWork', 'Principles and Methods of Christian Work')] ADV19000401-V02-04-page50.txt: [('BibleHygieneandTreatmentofDiseases', 'Bible Hygiene and Treatment of Diseases'), ('PrinciplesandMethodsofChristianWork', 'Principles and Methods of Christian Work')] ADV19000401-V02-04-page61.txt: [('CorrespondingSolicited', 'Correspond ing Solicited')] ADV19000401-V02-04-page62.txt: [('CorrespondenceSolicited', 'Correspondence Solicited')] ADV19000401-V02-04-page64.txt: [('Wemanufacturethefamous', 'We manufacture the famous')] ADV19000401-V02-04-page66.txt: [('STANDARDTYPEWRITERCOMPANY', 'STANDARD TYPEWRITER COMPANY')] ADV19000501-V02-05-page18.txt: [('ThechurchatWolfLake', 'The church at Wolf Lake')] ADV19000501-V02-05-page26.txt: [('Uptothepresenttimeourchurchmem', 'Up to the present time our church mem')] ADV19000501-V02-05-page29.txt: [('andwhatisthepriceofthe', 'and what is the price of the')] ADV19000501-V02-05-page31.txt: [('Forfurtherinformationaddress', 'For further information address')] ADV19000601-V02-06-page13.txt: [('BooksforChristianschools', 'Books for Christian schools')] ADV19000601-V02-06-page2.txt: [('AWordfortheCaterpillar', 'A Word for the C ater pillar')] ADV19000601-V02-06-page22.txt: [('Nowisthehightideoftheyear', 'Now is the high tide of the year')] ADV19000601-V02-06-page25.txt: [('ofthelastissueoftheAdvocate', 'of the last issue of the Advocate'), ('whenthewholechurchworktogetherin', 'when the whole church work together in')] ADV19000601-V02-06-page27.txt: [('leftSundayeveningfor', 'left Sunday evening for'), ('thededicationofWoodland', 'the dedication of Woodland'), ('ofthedeathofMissFrancesWright', 'of the death of Miss Frances Wright')] ADV19000601-V02-06-page28.txt: [('graduatesofthedepartmentof', 'graduates of the department of')] ADV19000601-V02-06-page29.txt: [('IfyoureceivetheAdvocate', 'If you receive the Advocate')] ADV19000601-V02-06-page31.txt: [('UnitedStateswhichhas', 'United States which has')] ADV19000601-V02-06-page37.txt: [('RailroadAgentintheUnitedStatesorCanada', 'Railroad Agent in the United States or Canada')] ADV19000601-V02-06-page6.txt: [('thatthatisthewaytogo', 'that that is the way to go')] ADV19000701-V02-07-page12.txt: [('Godhasentrustedtoourcarea', 'God has entrusted to our care a')] ADV19000701-V02-07-page14.txt: [('lesshehimselfiseducated', 'less he himself is educated')] ADV19000701-V02-07-page15.txt: [('individualcloserandclosertoGodisthe', 'individual closer and closer to God is the')] ADV19000701-V02-07-page23.txt: [('acresweredeededtothe', 'acres were deed ed to the')] ADV19000701-V02-07-page24.txt: [('theheartofthechild', 'the heart of the child')] ADV19000701-V02-07-page27.txt: [('PresidentoftheIllinois', 'President of the Illinois')] ADV19000701-V02-07-page28.txt: [('thattheeditorsoftheAdvocate', 'that the editors of the Advocate')] ADV19000701-V02-07-page30.txt: [('isjustthepaperforthisplace', 'is just the paper for this place')] ADV19000701-V02-07-page33.txt: [('CheBattleCreekCollegeBookstand', 'Che Battle Creek College Bookstand')] ADV19000701-V02-07-page8.txt: [('hasneededandstillneeds', 'has needed and still needs')] ADV19000801-V02-08-page12.txt: [('PrincipalofCentral', 'Principal of Central')] ADV19000801-V02-08-page25.txt: [('PrincipaloftheSouth', 'Principal of the South')] ADV19000801-V02-08-page29.txt: [('theSeptemberissueoftheAdvocate', 'the September issue of the Advocate'), ('youreceivetheAdvocate', 'you receive the Advocate')] ADV19000801-V02-08-page33.txt: [('resultofeducationalprinciplesP', 'result of educational principles P')] ADV19000801-V02-08-page35.txt: [('carepreparedtofurnishcollegesand', 'care prepared to furnish colleges and')] ADV19000801-V02-08-page37.txt: [('CbeBattleCreekCollegeBookstand', 'C be Battle Creek College Bookstand')] ADV19000801-V02-08-page4.txt: [('deedsareheldinhonor', 'deeds are held in honor')] ADV19000801-V02-08-page6.txt: [('theIndependentofJuly', 'the Independent of July')] ADV19001001-V02-10-page15.txt: [('TheeditoroftheAdvocate', 'The editor of the Advocate')] ADV19001001-V02-10-page16.txt: [('necessarilyattended', 'necessarily attended'), ('chosensuperintendent', 'chosen superintendent')] ADV19001001-V02-10-page19.txt: [('steppingstonetotheministry', 'steppingstone to the ministry')] ADV19001001-V02-10-page2.txt: [('BooksforOurSchools', 'Books for Our Schools')] ADV19001001-V02-10-page24.txt: [('whohathledwilllead', 'who hath led will lead')] ADV19001001-V02-10-page26.txt: [('Thereareseveralpointstobeconsidered', 'There are several points to be considered')] ADV19001001-V02-10-page29.txt: [('youreceivetheAdvocate', 'you receive the Advocate'), ('assistanteditorSigns', 'assistant editor Signs'), ('IhavetakentheAdvocate', 'I have taken the Advocate')] ADV19001001-V02-10-page31.txt: [('ThereadersoftheAdvocate', 'The readers of the Advocate')] ADV19001001-V02-10-page33.txt: [('ofeducationalprinciples', 'of educational principles')] ADV19001101-V02-11-page13.txt: [('butifattheendofanhourtherewasa', 'but if at the end of an hour there was a')] ADV19001101-V02-11-page18.txt: [('Whatareyougoingtodowiththem', 'What are you going to do with them')] ADV19001101-V02-11-page26.txt: [('inginmanywaystosecuregoodorderand', 'ing in many ways to secure good order and')] ADV19001101-V02-11-page29.txt: [('reportsanattendance', 'reports an attendance'), ('hasgeneraloversightof', 'has general oversight of')] ADV19001101-V02-11-page31.txt: [('youreceivetheAdvocate', 'you receive the Advocate'), ('addressingtheAdvocate', 'addressing the Advocate'), ('PleasealwaysmentiontheAdvocateif', 'Please always mention the Advocate if')] ADV19001101-V02-11-page4.txt: [('youthmaybemembersofthe', 'youth may be members of the')] ADV19001201-V02-12-page16.txt: [('contaminatinginfluences', 'contaminating influences')] ADV19001201-V02-12-page2.txt: [('WhatShalltheTeacherTeach', 'What Shall the Teacher Teach'), ('AReviewoptheChurchSchoolWork', 'A Review o p the Church School Work')] ADV19001201-V02-12-page26.txt: [('greatthingintheworld', 'great thing in the world')] ADV19001201-V02-12-page28.txt: [('comestotelltheAdvocatefamilyofthe', 'comes to tell the Advocate family of the')] ADV19001201-V02-12-page30.txt: [('Campwasobligedtogiveupher', 'Camp was obliged to give up her')] ADV19001201-V02-12-page32.txt: [('Advocateandarenotalready', 'Advocate and are not already')] ADV19001201-V02-12-page33.txt: [('oftheunfortunatevictimsofdrughabits', 'of the unfortunate victims of drug habits')] ADV19001201-V02-12-page35.txt: [('mearepreparedtofurnish', 'me are prepared to furnish')] ADV19001201-V02-12-page37.txt: [('BattleCreekCollegeBookstand', 'Battle Creek College Bookstand')] ADV19010101-V03-01-page14.txt: [('thisnumbertheTrainino', 'this number the Train in o')] ADV19010101-V03-01-page20.txt: [('theywillfinditaverygreatblessing', 'they will find it a very great blessing')] ADV19010101-V03-01-page3.txt: [('TheAledoChurchSchool', 'The Aledo Church School')] ADV19010101-V03-01-page30.txt: [('thatthereadersoftheAdvocate', 'that the readers of the Advocate')] ADV19010101-V03-01-page33.txt: [('forateachertoconduct', 'for a teacher to conduct'), ('editoroftheAdvocate', 'editor of the Advocate')] ADV19010101-V03-01-page34.txt: [('IfyoureceivetheAdvocate', 'If you receive the Advocate'), ('hasbeenmadebySisterS', 'has been made by Sister S'), ('respondwiththeeditorofthe', 'respond with the editor of the'), ('PleasealwaysmentiontheAdvocate', 'Please always mention the Advocate'), ('Onebrotherhasgivenfiveacresof', 'One brother has given five acres of')] ADV19010101-V03-01-page35.txt: [('shouldhavetheTraining', 'should have the Training'), ('ontheKansasCityline', 'on the Kansas City line')] ADV19010101-V03-01-page38.txt: [('PARKERJOINTLESSFOUNTAINPEN', 'PARKER JOINTLESS FOUNTAIN PEN'), ('theBattleCreekCollegeBookstand', 'the Battle Creek College Bookstand')] ADV19010101-V03-01-page7.txt: [('thelifeoftheindividual', 'the life of the individual'), ('Thereasonswhyphysicaltrainingshould', 'The reasons why physical training should')] ADV19010201-V03-02-page15.txt: [('ThereadersoftheAdvocate', 'The readers of the Advocate')] ADV19010201-V03-02-page24.txt: [('wasneveradaysomistyandgray', 'was never a day so mist y and gray')] ADV19010201-V03-02-page27.txt: [('aknowledgeofhiswotks', 'a knowledge of his w o t k s')] ADV19010201-V03-02-page28.txt: [('thylifebylossinsteadofgain', 'thy life by loss instead of gain')] ADV19010201-V03-02-page3.txt: [('andEducationasaMeansofReform', 'and Education as a Means of Reform'), ('AnotherBandofMercyBoy', 'Another B and of Mercy Boy'), ('TheSheridanIndustrial', 'The Sheridan Industrial')] ADV19010201-V03-02-page31.txt: [('areplanningtotakeaday', 'are planning to take a day')] ADV19010201-V03-02-page34.txt: [('coursesinagricultureareto', 'courses in agriculture are to')] ADV19010201-V03-02-page38.txt: [('makerandaffordsmorereliefi', 'maker and affords more relief i'), ('thancanbederivedfromgallons', 'than can be derived from gallons')] ADV19010201-V03-02-page4.txt: [('paperissuedfromourpresses', 'paper issued from our presses')] ADV19010301-V03-03-page10.txt: [('Hehadthreesonsandagrandson', 'He had three sons and a grandson'), ('madenomistakeintheexampleinthegar', 'made no mistake in the example in the gar'), ('arguethatthechildislearningallthetime', 'argue that the child is learning all the time'), ('andisjustaswelloffinschoolasanywhere', 'and is just as well off in school as anywhere'), ('beinstilledintotheseyoung', 'be instilled into these young')] ADV19010301-V03-03-page12.txt: [('andmoreofthethreeR', 'and more of the three R')] ADV19010301-V03-03-page13.txt: [('hasbeentryingtoset', 'has been trying to set')] ADV19010301-V03-03-page15.txt: [('amanthinkethinhisheart', 'a man thinketh in his heart')] ADV19010301-V03-03-page16.txt: [('Andtohimonewhohasnotpassed', 'And to him one who has not passed'), ('coursearethefoundation', 'course are the foundation')] ADV19010301-V03-03-page18.txt: [('dentsareassignedtoupper', 'dents are assigned to upper'), ('havebeenacquaintedbeforeenteringthe', 'have been acquainted before entering the'), ('clothestothelaundry', 'clothes to the laundry'), ('thanthesystemoffightinginvoguewith', 'than the system of fighting in vogue with')] ADV19010301-V03-03-page2.txt: [('secureonewhilethesup', 'secure one while the sup')] ADV19010301-V03-03-page20.txt: [('Letusfollowthemandsee', 'Let us follow them and see'), ('theywerenotredaswehadsupposed', 'they were not red as we had supposed')] ADV19010301-V03-03-page22.txt: [('wordssomebeautifulstoriesaboutthegar', 'words some beautiful stories about the gar'), ('readintheBibleReader', 'read in the Bible Reader'), ('theyalltoldmetheywantedJesustocome', 'they all told me they wanted Jesus to come'), ('Afterthereadinglessonwasover', 'After the reading lesson was over'), ('thechurchschoolsforthem', 'the church schools for them'), ('Tellthemnottokilltheanimals', 'Tell them not to kill the animals'), ('butIhavenotroomenough', 'but I have not room enough')] ADV19010301-V03-03-page23.txt: [('eveningtowardthecloseof', 'evening toward the close of'), ('whileUnionsoldierslayin', 'while Union soldiers lay in'), ('camponahillsidenear', 'camp on a hillside near')] ADV19010301-V03-03-page24.txt: [('WhydidthebeastsloveAdam', 'Why did the beasts love Adam'), ('WhatdidGodmakeonthefifth', 'What did God make on the fifth'), ('WhydidGodmakethelight', 'Why did God make the light'), ('fourandahalfincheslongandthreeincheswide', 'four and a half inches long and three inches wide')] ADV19010301-V03-03-page25.txt: [('calledustoapartintheeducationalwork', 'called us to apart in the educational work')] ADV19010301-V03-03-page27.txt: [('normalclassistaughthowtomakeablack', 'normal class is taught how to make a black')] ADV19010301-V03-03-page28.txt: [('dotoadvancethecause', 'do to advance the cause')] ADV19010301-V03-03-page29.txt: [('ofthereadersoftheAdvocatetoasample', 'of the readers of the Advocate to a sample'), ('Thisisoneofseveral', 'This is one of several')] ADV19010301-V03-03-page3.txt: [('CauseofWeaknessintheSabbath', 'Cause of Weakness in the Sabbath'), ('TheJuniataIndustrialSchool', 'The Juniata Industrial School'), ('CedarLakeIndustrialSchool', 'Cedar Lake Industrial School')] ADV19010301-V03-03-page30.txt: [('ThisletterwaswrittenbyAbra', 'This letter was written by A bra'), ('astonishedgazeatthewriter', 'astonished gaze at the writer'), ('IsthereanythingIcandoforyou', 'Is there anything I can do for you'), ('Youmightwritealettertomymother', 'You might write a letter to my mother'), ('Manyadarkandcloudymombringsabrightand', 'Many a dark and cloudy mom brings a bright and')] ADV19010301-V03-03-page31.txt: [('SurelyiftheSpiritofthe', 'Surely if the Spirit of the')] ADV19010301-V03-03-page32.txt: [('ofstudentsduringthelast', 'of students during the last')] ADV19010301-V03-03-page33.txt: [('aresevenintheschool', 'are seven in the school'), ('arithmeticsbeready', 'arithmetics be ready')] ADV19010301-V03-03-page34.txt: [('thatreadsinitthinksitsuchadearbook', 'that reads in it thinks it such a dear book'), ('Ourdailyattendanceisnow', 'Our daily attendance is now')] ADV19010301-V03-03-page35.txt: [('hassincebeenamuchbetterboy', 'has since been a much better boy'), ('Ourschoolisprogressingnicely', 'Our school is progressing nicely'), ('insteadofreadingtheAdvocate', 'instead of reading the Advocate')] ADV19010301-V03-03-page36.txt: [('inMississippiwantsachurch', 'in Mississippi wants a church'), ('LandischeapintheSouth', 'Land is cheap in the South')] ADV19010301-V03-03-page37.txt: [('boardinghousewhichwouldtoleratethe', 'boarding house which would to le rate the'), ('Takethehousingofthestudents', 'Take the ho using of the students'), ('readinthepublicschools', 'read in the public schools')] ADV19010301-V03-03-page38.txt: [('sendyouaclubofAdvocates', 'send you a club of Advocates')] ADV19010301-V03-03-page39.txt: [('foroneyearandpamphlet', 'for one year and pamphlet'), ('ReadersfortheChildren', 'Readers for the Children'), ('SendfourcentsinstampstotheTr', 'Send four cents in stamps to the T r'), ('andthreecopiesofthe', 'and three copies of the'), ('appearedinthecolumnsofthe', 'appeared in the columns of the'), ('Itcombinestheveryfeatures', 'It combine s the very features'), ('Placeacopyinthehandsof', 'Place a copy in the hands of'), ('AdvocateshouldgointoeverySeventhmonths', 'Advocate should go into every Seventh months'), ('subscriptionpriceofthepaper', 'subscription price of the paper'), ('asyoucannotaffordtomissthevisitsofthe', 'as you cannot afford to miss the visits of the')] ADV19010301-V03-03-page4.txt: [('paperissuedfromourpresses', 'paper issued from our presses')] ADV19010301-V03-03-page40.txt: [('AddresstheAdvocate', 'Address the Advocate')] ADV19010301-V03-03-page6.txt: [('thecourageandhopetoeducatethemtotill', 'the courage and hope to educate them to till'), ('methodsofworkingthesoil', 'methods of working the soil')] ADV19010301-V03-03-page7.txt: [('SpecialTestimonies', 'Special Testimonies'), ('toteachpropermethodstotheyouth', 'to teach proper methods to the youth'), ('oneofthedepartmentsofagricultureandone', 'one of the departments of agriculture and one')] ADV19010301-V03-03-page8.txt: [('oldhewillnotdepartfromit', 'old he will not depart from it')] ADV19010301-V03-03-page9.txt: [('sonsmadethemselvesvile', 'sons made themselves vile')] ADV19010401-V03-04-page12.txt: [('inthePracticalEducator', 'in the Practical Educator')] ADV19010401-V03-04-page14.txt: [('studyofmankindisGod', 'study of mankind is God')] ADV19010401-V03-04-page18.txt: [('willconductaninstitute', 'will conduct an institute')] ADV19010401-V03-04-page20.txt: [('thepeopleofthenorth', 'the people of the north')] ADV19010401-V03-04-page27.txt: [('Ittookhimonlyamoment', 'It took him only a moment')] ADV19010401-V03-04-page34.txt: [('fifteencopiesoftheJanuaryAdvocate', 'fifteen copies of the January Advocate')] ADV19010401-V03-04-page37.txt: [('mentionedintheAdvocate', 'mentioned in the Advocate')] ADV19010401-V03-04-page38.txt: [('andtheirattentioniscalledtoChristian', 'and their attention is called to Christian')] ADV19010401-V03-04-page6.txt: [('theOnewhoisthetrueLight', 'the One who is the true Light')] ADV19010501-V03-05-page23.txt: [('firsttrainleavesat', 'first train leaves at'), ('thenexttrainstarts', 'the next train starts'), ('Keepwatchofthepassengers', 'Keep watch of the passengers')] ADV19010501-V03-05-page24.txt: [('wanttosendyouafewincidents', 'want to send you a few incidents')] ADV19010501-V03-05-page28.txt: [('butthatweshouldobservethelawofcolors', 'but that we should observe the law of colors')] ADV19010501-V03-05-page33.txt: [('thingfortheAdvocate', 'thing for the Advocate')] ADV19010501-V03-05-page34.txt: [('oftheHealdsburgCol', 'of the Healdsburg Col'), ('affordtobewithoutthe', 'afford to be without the'), ('aformerstudentofKeene', 'a former student of Keene')] ADV19010501-V03-05-page35.txt: [('weresenttoalargenumberofour', 'were sent to a large number of our')] ADV19010501-V03-05-page4.txt: [('walltoprintthischapter', 'wall to print this chapter')] ADV19010601-V03-06-page15.txt: [('Asystemofeducationwithoutfaithand', 'A system of education without faith and')] ADV19010601-V03-06-page21.txt: [('Youareamemberofthechurch', 'You are a member of the church')] ADV19010601-V03-06-page22.txt: [('andassignreadingsfor', 'and assign readings for')] ADV19010601-V03-06-page26.txt: [('theformationofpernicious', 'the formation of pernicious')] ADV19010601-V03-06-page3.txt: [('EducationalMatters', 'Educational Matter s')] ADV19010601-V03-06-page31.txt: [('theteacheroftheKan', 'the teacher of the Kan')] ADV19010601-V03-06-page32.txt: [('andBessieNicolahave', 'and Bessie Nicola have')] ADV19010601-V03-06-page33.txt: [('willbeissuedthelastofAu', 'will be issued the last of Au')] ADV19010601-V03-06-page8.txt: [('TothismanwillIlook', 'To this man will I look')] ADV19010801-V03-07-page11.txt: [('toreadonetextwhich', 'to read one text which')] ADV19010801-V03-07-page16.txt: [('prayforthespiritofrevelation', 'pray for the spirit of revelation')] ADV19010801-V03-07-page19.txt: [('befulloftheloveofGod', 'be full of the love of God')] ADV19010801-V03-07-page25.txt: [('tobuildthetabernacle', 'to build the tabernacle')] ADV19010801-V03-07-page26.txt: [('andallouryouthshouldbepermitted', 'and all our youth should be permitted')] ADV19010801-V03-07-page28.txt: [('mostofthedepartmentshave', 'most of the departments have'), ('beendroppedfromtheAdvocate', 'been dropped from the Advocate')] ADV19011001-V03-08-page11.txt: [('comestomesayingthat', 'comes to me saying that')] ADV19011001-V03-08-page16.txt: [('someofthewealthoflargefarmsbeturned', 'some of the wealth of large farms be turned')] ADV19011001-V03-08-page23.txt: [('individualsthroughcollege', 'individuals through college')] ADV19011001-V03-08-page29.txt: [('businessmanagerofWalla', 'business manager of Walla')] ADV19011001-V03-08-page32.txt: [('onlandwherethedeedofsaidlaud', 'on land where the deed of said laud')] ADV19011001-V03-08-page6.txt: [('CommencementAddressDeliveredatUnionCollegeby', 'Com men cement Address Deliver ed at Union College by')] ADV19011101-V03-09-page20.txt: [('thebabyasfastasyoucan', 'the baby as fast as you can')] ADV19011101-V03-09-page21.txt: [('spirituallessonofthepowerandworkof', 'spiritual lesson of the power and work of')] ADV19011101-V03-09-page32.txt: [('andourworkisprogressing', 'and our work is progressing'), ('thecluboftenAdvocates', 'the club often Advocates'), ('remitthepriceofthe', 'rem it the price of the')] ADV19011101-V03-09-page33.txt: [('issueoftheAdvocate', 'issue of the Advocate'), ('hasamissionoutside', 'has a mission outside')] ADV19011101-V03-09-page9.txt: [('educationbeganwiththemother', 'education began with the mother')] ADV19011201-V03-10-page12.txt: [('conservativeestimate', 'conservative estimate')] ADV19011201-V03-10-page28.txt: [('gravityofthesituation', 'gravity of the situation'), ('cluboftenAdvocates', 'club often Advocates'), ('livedinasecludedspot', 'lived in a secluded spot')] ADV19011201-V03-10-page29.txt: [('clubofSeptemberAdvocates', 'club of September Advocates')] ADV19011201-V03-10-page30.txt: [('fiftycentsforourcluboften', 'fifty cents for our club often'), ('willcontinueourclub', 'will continue our club')] ADV19011201-V03-10-page31.txt: [('EducationalSecretary', 'Educational Secretary'), ('wasfirstpublishedintheinterestsofa', 'was first published in the interests of a')] ADV19011201-V03-10-page32.txt: [('havebeenreceivedfromteacherswho', 'have been received from teachers who'), ('clublistinNovember', 'club list in November')] ADV19011201-V03-10-page35.txt: [('andincipienttuberculosiscured', 'and incipient tuberculosis cured')] ADV19020101-V04-01-page12.txt: [('EverypageoftheAdvocate', 'Every page of the Advocate')] ADV19020101-V04-01-page27.txt: [('thebusyweekisalmostover', 'the busy week is almost over')] ADV19020101-V04-01-page29.txt: [('makeththeheartsick', 'maketh the heart sick')] ADV19020101-V04-01-page31.txt: [('centsinpaymentfora', 'cents in payment for a')] ADV19020101-V04-01-page32.txt: [('EducationalSecretary', 'Educational Secretary'), ('isthetimesetfortheopening', 'is the time set for the opening')] ADV19020101-V04-01-page36.txt: [('EducationalDepartment', 'Educational Department')] ADV19020201-V04-02-page11.txt: [('amountofmoneydoesthena', 'amount of money does then a'), ('astepinadvancewouldbemade', 'a step in advance would be made')] ADV19020201-V04-02-page12.txt: [('anditshallbegivenhim', 'and it shall be given him')] ADV19020201-V04-02-page13.txt: [('WorkertogivetheAdvocate', 'Worker to give the Advocate')] ADV19020201-V04-02-page2.txt: [('CorrelationinArithmetic', 'Correlation in Arithmetic')] ADV19020201-V04-02-page25.txt: [('Wheretheschoolisfar', 'Where the school is far')] ADV19020201-V04-02-page30.txt: [('aresixchurchschoolsinthe', 'are six church schools in the')] ADV19020201-V04-02-page32.txt: [('EducationalSecretary', 'Educational Secretary')] ADV19020201-V04-02-page36.txt: [('Progressofonepupilisnotretardedbyothers', 'Progress of one pupil is not retarded by others')] ADV19020301-V04-03-page1.txt: [('PostoffioeatBerrienSprings', 'Post off i o eat Berrien Springs')] ADV19020301-V04-03-page15.txt: [('Inselectingofficers', 'In selecting officers')] ADV19020301-V04-03-page19.txt: [('ForquarterendingJune', 'For quarter ending June')] ADV19020301-V04-03-page2.txt: [('TheHaskellHomeTrainingSchool', 'The Haskell Home Training School')] ADV19020301-V04-03-page25.txt: [('whichismorehelpfulthantheAdvocate', 'which is more helpful than the Advocate'), ('inreadingtheAdvocate', 'in reading the Advocate')] ADV19020301-V04-03-page30.txt: [('posedofinlessthantenminutes', 'posed of in less than ten minutes')] ADV19020301-V04-03-page31.txt: [('TheclubofAdvocates', 'The club of Advocates')] ADV19020301-V04-03-page32.txt: [('takeacopyoftheAdvocate', 'take a copy of the Advocate'), ('PresidentoftheWest', 'President of the West'), ('andfeelsurethatourschools', 'and feel sure that our schools')] ADV19020301-V04-03-page33.txt: [('EducationalSecretary', 'Educational Secretary')] ADV19020401-V04-04-page14.txt: [('begivenintheAdvocate', 'be given in the Advocate')] ADV19020401-V04-04-page15.txt: [('anduntilweknowandexperience', 'and until we know and experience')] ADV19020401-V04-04-page22.txt: [('TheSeaandtheDryLand', 'The Sea and the Dry Land')] ADV19020401-V04-04-page27.txt: [('Norshouldthefirstlessonsbe', 'Nor should the first lessons be')] ADV19020401-V04-04-page31.txt: [('SouthLancasterAcademy', 'South Lancaster Academy'), ('nearlydoubledifwehadhadteachers', 'nearly doubled if we had had teachers')] ADV19020401-V04-04-page33.txt: [('pagesoftheAdvocate', 'pages of the Advocate'), ('arefriendstotheboysand', 'are friends to the boys and'), ('terestinthecauseforwhiththeAd', 'ter est in the cause for whit h the Ad'), ('sosurelytheAdvocate', 'so surely the Advocate'), ('InorderingtheAdvocate', 'In ordering the Advocate'), ('Awordofexplanationwillmakeitclearwhy', 'A word of explanation will make it clear why')] ADV19020401-V04-04-page34.txt: [('ofourbooksinthelibrary', 'of our books in the library')] ADV19020401-V04-04-page8.txt: [('taughtinapublicschool', 'taught in a public school')] ADV19020501-V04-05-page13.txt: [('throughthecolumnsoftheAdvocate', 'through the columns of the Advocate')] ADV19020501-V04-05-page16.txt: [('hewilllayuponusabur', 'he will lay upon us a bur')] ADV19020501-V04-05-page36.txt: [('Correspondenceisinvited', 'Correspondence is invited')] ADV19020601-V04-06-page15.txt: [('HOWSHOULDTHEWORKOFTHEHOMEDE', 'HOW SHOULD THE WORK OF THE HOME DE')] ADV19020601-V04-06-page2.txt: [('whenthatwhichisperfectiscome', 'when that which is perfect is come')] ADV19020601-V04-06-page26.txt: [('howshallIconductmyself', 'how shall I conduct myself')] ADV19020601-V04-06-page33.txt: [('Inclubsoftwoormoretooneaddress', 'In clubs of two or more to one address'), ('Toforeigncountries', 'To foreign countries')] ADV19020701-V04-07-page10.txt: [('thepublicschoolsofFrance', 'the public schools of France'), ('andthebronzewasgettingthe', 'and the bronze was getting the')] ADV19020701-V04-07-page11.txt: [('couldstartlethelivingofall', 'could start le the living of all'), ('Theeducationalproblemdoesnotbelong', 'The educational problem does not belong'), ('itisthefirstdutyofeveryparent', 'it is the first duty of every parent'), ('providedbythestate', 'provided by the state')] ADV19020701-V04-07-page13.txt: [('Careduringthefirstyear', 'Care during the first year'), ('theseproblemsaffectsevery', 'these problems affects every')] ADV19020701-V04-07-page15.txt: [('Nooneatallfittedfortheplacewould', 'No one at all fitted for the place would'), ('thathehascomefarshort', 'that he has come far short')] ADV19020701-V04-07-page16.txt: [('takegreaterpainstoimprovetheirmethods', 'take greater pains to improve their methods'), ('Andwhatdoesitmeanwhenap', 'And what does it mean when a p'), ('thosesamequestionswould', 'those same questions would'), ('iswhollyamistakenideathatquestions', 'is wholly a mistaken idea that questions'), ('oflifeandwhichinspirelife', 'of life and which inspire life'), ('questionswillbetheinevitable', 'questions will be the inevitable'), ('Tobeabletoaskquestionswhicharefull', 'To be able to ask questions which are full')] ADV19020701-V04-07-page17.txt: [('andtheawfuldangerofmakingthisourdwell', 'and the awful danger of making this our dwell')] ADV19020701-V04-07-page18.txt: [('PatriarchsandProphets', 'Patriarchs and Prophets')] ADV19020701-V04-07-page19.txt: [('PatriarchsandProphets', 'Patriarchs and Prophets'), ('Intheheartbringsintothehome', 'In the heart brings into the home')] ADV19020701-V04-07-page20.txt: [('themshowshowtrulywelovehim', 'them shows how truly we love him')] ADV19020701-V04-07-page21.txt: [('Whenthechildrenhavelearned', 'When the children have learned'), ('fewsquaresofmattingpaper', 'few squares of mat ting paper'), ('wenttothedesertalone', 'went to the desert alone'), ('Mammahastaughthimtopray', 'Mamma has taught him to pray'), ('andasksJesustohelphimbe', 'and asks Jesus to help him be'), ('Butatbreakfasthedoesjustthe', 'But at breakfast he does just the'), ('notlistenbecausebisheart', 'not listen because bis heart'), ('thelittleboywantsto', 'the little boy wants to'), ('andshowbythepicturesthat', 'and show by the pictures that'), ('peringinhishearttherightway', 'per ing in his heart the right way'), ('andhavingcometothesecond', 'and having come to the second'), ('Brighttacksandrustynailsmay', 'Bright tacks and rust y nails may'), ('mucheasierlittlechildren', 'much easier little children')] ADV19020701-V04-07-page22.txt: [('thechildrenimaginethethreethat', 'the children imagine the three that'), ('WhenAbrahamfoundout', 'When Abraham found out'), ('terlyreportabetterone', 'ter l y report a better one')] ADV19020701-V04-07-page23.txt: [('ForquarterendingDecember', 'For quarter ending December')] ADV19020701-V04-07-page24.txt: [('Joyisabroadintheworld', 'Joy is abroad in the world'), ('theraindropsgoldardgems', 'the rain drops gold ard gems'), ('ifwegivealittleandgetagreatdeal', 'if we give a little and get a great deal'), ('Makeobjectswithpeasandtoothpicks', 'Make objects with peas and tooth pick s')] ADV19020701-V04-07-page25.txt: [('Cutuppicturesandputinboxes', 'Cut up pictures and put in boxes')] ADV19020701-V04-07-page27.txt: [('Andtherewasayoungman', 'And there was a young man')] ADV19020701-V04-07-page29.txt: [('istingintheeducationalworldintheSouth', 'is ting in the educational world in the South'), ('haverequiredtheSouthernTrainingSchool', 'have required the Southern Training School'), ('thisyeargoneoutfromtheschoolasteach', 'this year gone out from the school as teach'), ('underareducedteachingforce', 'under a reduced teaching force')] ADV19020701-V04-07-page30.txt: [('gladtobeabletosaythatWiscon', 'glad to be able to say that Wis con'), ('andwehavehadaveryinteresting', 'and we have had a very interesting')] ADV19020701-V04-07-page32.txt: [('deliveredtheclosing', 'delivered the closing'), ('addressbeforethestudentsofSouthLan', 'address before the students of South L an')] ADV19020701-V04-07-page33.txt: [('OneofourmissionariesinSouth', 'One of our missionaries in South'), ('Asthemenpasstotheirwork', 'As the men pass to their work'), ('JennieWillamanwrites', 'Jennie Will a man writes')] ADV19020701-V04-07-page4.txt: [('youthforimmediateserviceisthecallofthe', 'youth for immediate service is the call of the'), ('onewhosedutyitistobringforwardthere', 'one whose duty it is to bring forward there'), ('voteshisenergiestothegoodofhumanity', 'vote s his energies to the good of humanity')] ADV19020701-V04-07-page5.txt: [('Nomancanbecomealeaderofmen', 'No man can become a leader of men')] ADV19020701-V04-07-page6.txt: [('poundsasheretofore', 'pounds as heretofore'), ('andonlyincreasestwo', 'and only increases two'), ('pupilsofthehigherthanthoseoftheelementary', 'pupils of the higher than those of the elementary')] ADV19020701-V04-07-page7.txt: [('isthefactthatthepupilssustained', 'is the fact that the pupils sustained'), ('Industrialtrainingappealsto', 'Industrial training appeals to'), ('theindustrialschools', 'the industrial schools')] ADV19020701-V04-07-page8.txt: [('oldoaksnativetothesoil', 'old oaks native to the soil')] ADV19020801-V04-08-page12.txt: [('perintendentsarenotborn', 'per in ten dents are not born'), ('exactlyasaremeninotherlinesoflife', 'exactly as are men in other lines of life')] ADV19020801-V04-08-page18.txt: [('operationofParentsandTeachers', 'operation of Parents and Teachers')] ADV19020801-V04-08-page19.txt: [('paymentofacluboftheadvocates', 'payment of a club of the advocates'), ('Allordersfortheadvocate', 'All orders for the advocate')] ADV19020801-V04-08-page2.txt: [('ChurchSchoolsinNebraska', 'Church Schools in Nebraska'), ('andknowallmysteriesand', 'and know all mysteries and'), ('DirectoryofSabbathSchool', 'Directory of Sabbath School')] ADV19020801-V04-08-page21.txt: [('SUGGESTEDBLACKBOARD', 'SUGGESTED BLACKBOARD')] ADV19020801-V04-08-page26.txt: [('ittaketomakethetrip', 'it take to make the trip')] ADV19020801-V04-08-page34.txt: [('SouthernCalifornia', 'Southern California'), ('MissNanetteUnderwood', 'Miss Nanette Underwood')] ADV19020801-V04-08-page8.txt: [('foundedforthepurposeofteach', 'founded for the purpose of teach')] ADV19020901-V04-09-page12.txt: [('aretheonesmostdirectlycon', 'are the ones most directly con')] ADV19020901-V04-09-page2.txt: [('TheEducationalConference', 'The Educational Conference'), ('HowChicagoUniversity', 'How Chicago University'), ('TheCapeTownChurchSchool', 'The Cape Town Church School'), ('lieDivinePlanofTeaching', 'lie Divine Plan of Teaching')] ADV19020901-V04-09-page20.txt: [('tobedrawnfromthischapter', 'to be drawn from this chapter'), ('littlechildrentocomfort', 'little children to comfort'), ('placeforrepentance', 'place for repentance'), ('Showthechildrenthat', 'Show the children that'), ('SUGGESTEDBLACKBOARD', 'SUGGESTED BLACKBOARD'), ('Jacobbeforehisfather', 'Jacob before his father')] ADV19020901-V04-09-page27.txt: [('Andthedoglookedsolemn', 'And the dog looked solemn'), ('andtriedtogetloose', 'and tried to get loose'), ('andsadlyhebeggedthem', 'and sadly he begged them'), ('andthehorsereplied', 'and the horse replied'), ('asyoudidinyourdockingof', 'as you did in your doc king of'), ('sincehisawkwardthumbsaregone', 'since his awkward thumbs are gone'), ('Whenyouboundmefast', 'When you bound me fast'), ('andtrimmedmyearsdownclosetothetopofmyhead', 'and trim med my ears down close to the top of my head'), ('Andthecruelhorseandthedoglookon', 'And the cruel horse and the dog look on'), ('andneverseemtocare', 'and never seem to care')] ADV19020901-V04-09-page33.txt: [('teachersontheIsland', 'teachers on the Island'), ('enjoythevisitsoftheAdvocate', 'enjoy the visits of the Advocate')] ADV19020901-V04-09-page34.txt: [('DirectoryofEducationalWorkers', 'Directory of Educational Workers'), ('DirectoryofSabbathSchool', 'Directory of Sabbath School')] ADV19020901-V04-09-page4.txt: [('fortheencouragementofmy', 'for the encouragement of my')] ADV19020901-V04-09-page7.txt: [('IntheimageofGodcreatedhe', 'In the image of God created he')] ADV19020901-V04-09-page8.txt: [('theresourcefulness', 'the resource fulness')] ADV19021001-V04-10-page11.txt: [('buttheiropportunity', 'but their opportunity')] ADV19021001-V04-10-page17.txt: [('Theteachermustbewhat', 'The teacher must be what'), ('ChristianEducationistheGospel', 'Christian Education is the Gospel')] ADV19021001-V04-10-page18.txt: [('Thechildfirstlearns', 'The child first learns')] ADV19021001-V04-10-page2.txt: [('DutiesandQualificationsofEducationalSuperintendents', 'Duties and Qualifications of Educational Superintendents'), ('TheChurchShouldEducateitsChildren', 'The Church Should Educate its Children'), ('DirectoryofEducational', 'Directory of Educational'), ('Toforeigncountries', 'To foreign countries'), ('butwhenthatwhichisperfectiscome', 'but when that which is perfect is come')] ADV19021001-V04-10-page23.txt: [('ProfessorCadyexplainedthe', 'Professor Cady explained the')] ADV19021001-V04-10-page24.txt: [('IntermediateSchool', 'Intermediate School'), ('Anychurchschoolhasarighttobecome', 'Any church school has aright to become')] ADV19021001-V04-10-page27.txt: [('Godhasblessedandprospered', 'God has blessed and prosper ed')] ADV19021001-V04-10-page28.txt: [('Aspiritofpermanencymust', 'A spirit of permanency must')] ADV19021001-V04-10-page3.txt: [('Wepayforschoolsnetsomuchoutof', 'We pay for schools net so much out of')] ADV19021001-V04-10-page34.txt: [('Hislifewassavedbygo', 'His life was saved by go')] ADV19021001-V04-10-page36.txt: [('BLACKBOARDILLUSTRATION', 'BLACKBOARD ILLUSTRATION')] ADV19021001-V04-10-page41.txt: [('LakeUnionConference', 'Lake Union Conference')] ADV19021001-V04-10-page42.txt: [('RecommendedbyBattleCreek', 'Recommended by Battle Creek'), ('Sanitariumphysicians', 'Sanitarium physicians'), ('beneficialindisease', 'beneficial in disease')] ADV19021001-V04-10-page44.txt: [('PrimaryLanguageLessons', 'Primary Language Lessons'), ('Forkeepingadailyrecordofattendance', 'For keeping a daily record of attendance'), ('Cashmustaccompanyordersforbooks', 'Cash must accompany orders for books'), ('Sendmoneybypostalorder', 'Send money by postal order'), ('orregisteredletter', 'or registered letter'), ('Donotsendloosecoin', 'Do not send loose coin')] ADV19021001-V04-10-page5.txt: [('Myfirstimpressions', 'My first impressions'), ('schoolwasestablished', 'school was established')] ADV19021001-V04-10-page7.txt: [('IsawtheneedofChris', 'I saw the need of Chris')] ADV19021101-V04-11-page12.txt: [('havefeltthatitwasourgreatest', 'have felt that it was our greatest')] ADV19021101-V04-11-page22.txt: [('ForquarterendingMarch', 'For quarter ending March')] ADV19021101-V04-11-page32.txt: [('seesomeoftheletters', 'see some of the letters')] ADV19021101-V04-11-page7.txt: [('nevertodoanythingwhich', 'never to do anything which')] ADV19021201-V04-12-page13.txt: [('mightbewellforthechurchtoawakento', 'might be well for the church to awaken to'), ('oftheProtestantBibleasarequiredexercise', 'of the Protestant Bible as a required exercise')] ADV19021201-V04-12-page2.txt: [('Toforeigncountries', 'To foreign countries'), ('ThePuritansasEducators', 'The Puritans as Educators'), ('andknowallmysteriesand', 'and know all mysteries and'), ('andifIhaveallfaith', 'and if I have all faith')] ADV19021201-V04-12-page22.txt: [('Asthepeopleweredependent', 'As the people were dependent'), ('Readwiththeclassverseseighteenand', 'Read with the class verses eighteen and')] ADV19021201-V04-12-page23.txt: [('Thesmallestdetails', 'The smallest details')] ADV19021201-V04-12-page25.txt: [('andtoldmeIcouldsay', 'and told me I could say')] ADV19021201-V04-12-page27.txt: [('Youwillfindaboxofchocolatedropsin', 'You will find a box of c ho col ate drops in'), ('thereissuchathingasmemory', 'there is such a thing as memory'), ('wonfirstprizeinacom', 'won first prize in a com'), ('awaythatappealstothe', 'away that appeals to the')] ADV19021201-V04-12-page29.txt: [('thecharacterofeverystudenttrainedwithin', 'the character of every student trained within')] ADV19021201-V04-12-page32.txt: [('readingtheadvocate', 'reading the advocate')] ADV19021201-V04-12-page35.txt: [('beneficialindisease', 'beneficial in disease'), ('GoodHealthPublishingCo', 'Good Health Publishing C o')] ADV19030101-V05-01-page15.txt: [('Thoughweareabusypeople', 'Though we are a busy people')] ADV19030101-V05-01-page31.txt: [('teacheroftheLincoln', 'teacher of the Lincoln')] ADV19030101-V05-01-page32.txt: [('isteachingthechurch', 'is teaching the church')] ADV19030201-V05-02-page17.txt: [('saidtomeafewdaysago', 'said to me a few days ago')] ADV19030201-V05-02-page19.txt: [('abundantsuggestion', 'abundant suggestion')] ADV19030201-V05-02-page22.txt: [('ForquarterendingJune', 'For quarter ending June')] ADV19030201-V05-02-page26.txt: [('experiencesofteachers', 'experiences of teachers')] ADV19030201-V05-02-page29.txt: [('Readersoftheadvocate', 'Readers of the advocate')] ADV19030201-V05-02-page30.txt: [('sonotallthatappearsiswheat', 'so not all that appears is wheat')] ADV19030201-V05-02-page31.txt: [('Ofthosenowattending', 'Of those now attending')] ADV19030201-V05-02-page4.txt: [('anditsKingrecognizebutonemethodof', 'and its King recognize but one method of'), ('themachineryinorder', 'the machinery in order')] ADV19030201-V05-02-page5.txt: [('Thenletthemovementbegin', 'Then let the movement begin')] ADV19030201-V05-02-page6.txt: [('drenallthattheyneed', 'dr en all that they need')] ADV19030201-V05-02-page7.txt: [('Anadequatemissionaryspiritwill', 'An adequate missionary spirit will')] ADV19030301-V05-03-page11.txt: [('andnumberlessotherlittle', 'and numberless other little'), ('Thepresshasagitatedthequestion', 'The press has agitated the question')] ADV19030301-V05-03-page18.txt: [('riseinthemorningveryweary', 'rise in the morning very weary')] ADV19030301-V05-03-page2.txt: [('tobeartestimonytothefactthatithasopenedmy', 'to bear testimony to the fact that it has opened my')] ADV19030301-V05-03-page24.txt: [('littlerulesweallshouldkeep', 'little rules we all should keep')] ADV19030301-V05-03-page7.txt: [('asofmanyothersciences', 'as of many other sciences'), ('perversionofthename', 'per version of the name')] ADV19030401-V05-04-page11.txt: [('withatleastonecopyof', 'with at least one copy of'), ('ThecurrentissueoftheAdvocate', 'The current issue of the Advocate'), ('copyoftheMayAdvocate', 'copy of the May Advocate')] ADV19030401-V05-04-page12.txt: [('letteraddressedtotheAdvocate', 'letter addressed to the Advocate')] ADV19030401-V05-04-page21.txt: [('ForquarterendingSept', 'For quarter ending Sept')] ADV19030401-V05-04-page25.txt: [('DirectoroftheOfficeofExperi', 'Director of the Office of E x p e r i')] ADV19030401-V05-04-page29.txt: [('terestedintheAdvocate', 'ter est ed in the Advocate'), ('issueoftheAdvocate', 'issue of the Advocate')] ADV19030401-V05-04-page7.txt: [('comegoodworkersinanyline', 'come good workers in any line')] ADV19030401-V05-04-page8.txt: [('EducationforStreetBoys', 'Education for Street Boys')] ADV19030501-V05-05-page11.txt: [('Astheplantthatblooms', 'As the plant that blooms')] ADV19030501-V05-05-page19.txt: [('andalsoseethateachchildthatisold', 'and also see that each child that is old'), ('aretryingtofindtherightwayto', 'are trying to find the right way to')] ADV19030501-V05-05-page31.txt: [('helpspreadthegospelmethodsofeducation', 'help spread the gospel methods of education')] ADV19030601-V05-06-page13.txt: [('bythecourtesyofitspub', 'by the courtesy of its pub')] ADV19030601-V05-06-page29.txt: [('Thattheeducationalsuperintendents', 'That the educational superintendents')] ADV19030601-V05-06-page30.txt: [('forthepurposeofestablishing', 'for the purpose of establishing')] ADV19030601-V05-06-page34.txt: [('forwhichmankindislooking', 'for which mankind is looking'), ('andwhichdevelopsaperfect', 'and which develops a perfect')] ADV19030701-V05-07-page13.txt: [('testoftheworthofalivingfaith', 'test of the worth of a living faith')] ADV19030701-V05-07-page24.txt: [('hemakesarainbowshine', 'he makes a rainbow shine'), ('climbingorfighting', 'climbing or fighting')] ADV19030701-V05-07-page29.txt: [('Thisisasplendidpeach', 'This is a splendid peach'), ('thepeachesbeallyourown', 'the peaches be all your own')] ADV19030701-V05-07-page8.txt: [('hasinthepastcalledthe', 'has in the past called the')] ADV19030701-V05-07-page9.txt: [('cannotbehappywithoutit', 'cannot be happy without it')] ADV19030801-V05-08-page10.txt: [('boybutoneattheJohnWorthy', 'boy but one at the John Worth y')] ADV19030801-V05-08-page22.txt: [('missionsduringtheyear', 'missions during the year')] ADV19030801-V05-08-page28.txt: [('UniformExaminationsforEle', 'Uniform Examinations for El e')] ADV19030801-V05-08-page29.txt: [('carriedonduringtheyear', 'carried on during the year')] ADV19030801-V05-08-page31.txt: [('TheeditorsoftheAdvocate', 'The editors of the Advocate')] ADV19030801-V05-08-page33.txt: [('principaloftheAnglo', 'principal of the Anglo')] ADV19030901-V05-09-page11.txt: [('ofChristianSchools', 'of Christian Schools')] ADV19030901-V05-09-page12.txt: [('IntheAugustissueof', 'In the August issue of')] ADV19030901-V05-09-page21.txt: [('werelikewhatwearesometimes', 'were like what we are sometimes'), ('himontheflooragain', 'him on the floor again'), ('withoutwhichnolessoncanbeimpressed', 'without which no lesson can be impressed')] ADV19030901-V05-09-page25.txt: [('andtesttheabilityofeventheyoungerones', 'and test the ability of even the younger ones')] ADV19030901-V05-09-page28.txt: [('Thatwhichcanbebreathedatone', 'That which can be breathed atone'), ('fromthePhysiologyClass', 'from the Physiology Class')] ADV19030901-V05-09-page33.txt: [('wonderwhatplacetheBibleshouldhold', 'wonder what place the Bible should hold'), ('youshouldbewellinformed', 'you should be well informed')] ADV19030901-V05-09-page6.txt: [('runningnorthandsouth', 'running north and south')] ADV19030901-V05-09-page9.txt: [('butforcertainchangesmadein', 'but for certain changes made in')] ADV19031001-V05-10-page14.txt: [('theheadsoffamilies', 'the heads of families'), ('stillcontinuestoinvite', 'still continues to invite')] ADV19031001-V05-10-page19.txt: [('GodDirectsourWarfare', 'God Direct sour War fare')] ADV19031001-V05-10-page30.txt: [('Findthecostoftherafters', 'Find the cost of the rafters')] ADV19031001-V05-10-page31.txt: [('Makeplansandestimatecostoffront', 'Make plans and estimate cost of front')] ADV19031001-V05-10-page33.txt: [('writesthatdefinitear', 'writes that definite ar')] ADV19031001-V05-10-page34.txt: [('TheeditoroftheLifeBoathas', 'The editor of the Life Boat has')] ADV19031101-V05-11-page13.txt: [('quotationsweregivenshow', 'quotations were given show')] ADV19031101-V05-11-page34.txt: [('HowMissionariesAreMade', 'How Missionaries Are Made')] ADV19031101-V05-11-page35.txt: [('ofthenewvolumebyMrs', 'of the new volume by Mrs')] ADV19031101-V05-11-page8.txt: [('pointwhereamancannotsharpenhisown', 'point where a man cannot sharpen his own')] ADV19031201-V05-12-page28.txt: [('RobertCollegeisone', 'Robert College is one')] ADV19031201-V05-12-page32.txt: [('laysholdonthepromise', 'lays hold on the promise')] ADV19031201-V05-12-page33.txt: [('WehavemissedtheAdvocate', 'We have missed the Advocate'), ('soldseveralcopiesofthe', 'sold several copies of the')] ADV19031201-V05-12-page8.txt: [('Thelessonsofgeographywillbe', 'The lessons of geography will be')] ADV19031201-V05-12-page9.txt: [('thewholecolonywascen', 'the whole colony was cen')] ADV19040101-V06-01-page10.txt: [('tothelastnumberoftheA', 'to the last number of the A')] ADV19040101-V06-01-page19.txt: [('schoolsbeforethenewyear', 'schools before the new year')] ADV19040101-V06-01-page2.txt: [('ThelessonsareInteresting', 'The lessons are Interesting'), ('IdidnotknowbeforethattheBibleissuchaninterestingbo', 'I did not know before that the Bible is such an interesting b o'), ('heartilyfavortheplan', 'heartily favor the plan'), ('Frompersonalcorrespondencewith', 'From personal correspondence with')] ADV19040101-V06-01-page20.txt: [('Theaboveisfromtheopeningparagraph', 'The above is from the opening paragraph'), ('ofthenewvolumebyMrs', 'of the new volume by Mrs'), ('Toshowhowthisenduringand', 'To show how this enduring and'), ('mayberealizedbytheyouth', 'may be realized by the youth'), ('Andafterreadingityourself', 'And after reading it yourself'), ('youwillwishtotalkwithyour', 'you will wish to talk with your'), ('Thisisaspecialoffer', 'This is a special offer')] ADV19040101-V06-01-page21.txt: [('inreligiousteaching', 'in religious teaching'), ('Fromthebeginningofhuman', 'From the beginning of human')] ADV19040101-V06-01-page3.txt: [('librarywhatyouwish', 'library what you wish'), ('buttheaddingofabooknowandthen', 'but the adding of a book now and then'), ('subscribeforthedoublepurposeofbringingtheADVOCATE', 'subscribe for the double purpose of bringing the ADVOCATE'), ('Weshallofferadditionalbooksaspremiumslater', 'We shall offer additional books as premium s later')] ADV19040101-V06-01-page4.txt: [('youthasameansofpreparingmissionariesfor', 'youth as a means of preparing missionaries for'), ('wecultivateadeephungerforthe', 'we cultivate a deep hunger for the')] ADV19040301-V06-03-page18.txt: [('writingfromLakeOdessa', 'writing from Lake O des s a')] ADV19040301-V06-03-page19.txt: [('PacificHealthJournalJ', 'Pacific Health Journal J')] ADV19040301-V06-03-page3.txt: [('istheharmoniousdevelop', 'is the harmonious develop')] ADV19040401-V06-04-page12.txt: [('Peachtreesshouldbepruned', 'P each trees should be pruned')] ADV19040401-V06-04-page17.txt: [('Theformerstudentswere', 'The former students were'), ('Heretoforetherehasbeenmoreorlesssep', 'Heretofore there has been more or less s e p')] ADV19040401-V06-04-page3.txt: [('dayisafreshbeginning', 'day is a fresh beginning')] ADV19040501-V06-05-page10.txt: [('Whatifyourboyormine', 'What if your boy or mine'), ('hilethedaysandtheyearsgohurryiugby', 'hi le the days and the years go hurry i u g b y'), ('IpIknewyouandyouknewme', 'I p I knew you and you knew me'), ('surethatwewoulddifferless', 'sure that we would differ less')] ADV19040501-V06-05-page11.txt: [('planwouldyousuggesttohelpmatters', 'plan would you suggest to help matters'), ('WherecanIfindastate', 'Where can I find a state'), ('sentouthundredsofteacherstoconduct', 'sent out hundreds of teachers to conduct')] ADV19040501-V06-05-page14.txt: [('isroomforthemanwhocanset', 'is room for the man who can set')] ADV19040501-V06-05-page15.txt: [('andothereducational', 'and other educational')] ADV19040501-V06-05-page2.txt: [('whenastudentinAndover', 'when a student in And over'), ('ifliehadbuthalfaminuteto', 'if lie had but half a minute to')] ADV19040501-V06-05-page3.txt: [('kingisthemanwhocan', 'king is the man who can')] ADV19040501-V06-05-page4.txt: [('Itbecomesthefirstdutyof', 'It becomes the first duty of')] ADV19040501-V06-05-page9.txt: [('Whereyourtreasureis', 'Where your treasure is')] ADV19040601-V06-06-page1.txt: [('Onemustbewillingtodie', 'One must be willing to die'), ('Thereissomethinginitakintomotherhood', 'There is something in it akin to mother hood')] ADV19040601-V06-06-page10.txt: [('Whatimprovementmaywe', 'What improvement may we'), ('haveanopportunitytoexhibit', 'have an opportunity to exhibit')] ADV19040601-V06-06-page16.txt: [('teachersthebenefitoftheAdvocate', 'teachers the benefit of the Advocate')] ADV19040601-V06-06-page2.txt: [('whenastudentinAndover', 'when a student in And over'), ('correspondencehelponeto', 'correspondence help one to'), ('WehavehadtheBiblelessons', 'We have had the Bible lessons')] ADV19040601-V06-06-page3.txt: [('studentsarelackinginthefacul', 'students are lacking in the fac u l'), ('howthingsofnaturegrow', 'how things of nature grow'), ('Thesecretofachieve', 'The secret of achieve'), ('everyscrapofknowledge', 'every scrap of knowledge')] ADV19040601-V06-06-page5.txt: [('lookeduponaseducation', 'looked upon as education')] ADV19040601-V06-06-page7.txt: [('agricultureapartofeveryschool', 'agriculture apart of every school'), ('agricultureshouldbe', 'agriculture should be'), ('referencetobroadening', 'reference to broadening'), ('thestreetwearsfinergar', 'the street wears finer gar'), ('Thecapacitytomakemoneyseemsthe', 'The capacity to make money seems the')] ADV19040601-V06-06-page8.txt: [('tothechildlaborlawofGer', 'to the child labor law of Ger')] ADV19040701-V06-07-page12.txt: [('Sendmarkedcopiesofpapers', 'Send marked copies of papers'), ('Tolovetheebettereveryday', 'To love thee better everyday')] ADV19040701-V06-07-page16.txt: [('shouldbereadbyeveryChris', 'should be read by every Chris')] ADV19040701-V06-07-page17.txt: [('hasbeenagreathelpto', 'has been a great help to')] ADV19040801-V06-08-page2.txt: [('shouldbeestablishedatsomepoint', 'should be established at some point')] ADV19040801-V06-08-page4.txt: [('AspecialfeatureoftheSeptemberAdvocate', 'A special feature of the September Advocate')] ADV19040801-V06-08-page8.txt: [('lustratingyourpartofthecountry', 'lust rat ing your part of the country')] ADV19040901-V06-09-page1.txt: [('WhenIthinkofthetrialsthroughwhich', 'When I think of the trials through which'), ('Especiallystrongin', 'Especially strong in')] ADV19040901-V06-09-page10.txt: [('athirdgradeboyspendsamin', 'a third grade boy spends am in'), ('fortyminutesadayoroftentwicethat', 'forty minutes a day or often twice that'), ('juvenilereadingofthehighest', 'juvenile reading of the highest'), ('thecityofWashington', 'the city of Washington')] ADV19040901-V06-09-page11.txt: [('ingconvincingreasonswhytheschoolgar', 'ing convincing reasons why the school gar'), ('Thereisamorbidaversiontosunburn', 'There is a morbid a version to sun burn'), ('whichtheschoolsotherwisetendtoproduce', 'which the schools otherwise tend to produce'), ('promoteconstructive', 'promote constructive')] ADV19040901-V06-09-page12.txt: [('Thesupremetestofamode', 'The supreme test of a mode'), ('theunfailingprogressfromcause', 'the unfailing progress from cause'), ('topromotewickedness', 'to promote wickedness'), ('theracehassuffered', 'the race has suffered'), ('knowwhatissignified', 'know what is signified'), ('butusedasarecreation', 'but used as are creation')] ADV19040901-V06-09-page14.txt: [('thingabouttheBible', 'thing about the Bible'), ('acreswehavesecuredisgood', 'acres we have secured is good')] ADV19040901-V06-09-page15.txt: [('Portlandonthewestsideof', 'Portland on the west side of'), ('thewesternpartofthestateofOregon', 'the western part of the state of Oregon')] ADV19040901-V06-09-page16.txt: [('planandthuscreatethefundwhichmakes', 'plan and thus create the fund which makes'), ('whoselittlegirlhadattendedtheschoolfor', 'whose little girl had attended the school for'), ('theirchildhadattendedtheschool', 'their child had attended the school'), ('thisprinciplemorefullyinto', 'this principle more fully into'), ('Iwanttosayafewwordstothereadersof', 'I want to say a few words to the readers of')] ADV19040901-V06-09-page17.txt: [('andIdaSaltoncametoMaidenRock', 'and Ida Salt on came to Mai den Rock'), ('Theyareabletomakerapidadvancrnent', 'They are able to make rapid ad van c r n e n t'), ('thechildofparentswhomakenoprofess', 'the child of parents who make no profess'), ('MissJennieLarmouth', 'Miss Jennie L ar mouth'), ('Shesaysthattherewould', 'She says that there would')] ADV19040901-V06-09-page18.txt: [('neverhadchildrenlearntoreadsoquick', 'never had children learn to read so quick'), ('oneoftheteacherswho', 'one of the teachers who'), ('Theyareforalittlebc', 'They are for a little b c'), ('ShehastakenuptheLife', 'She has taken up the Life')] ADV19040901-V06-09-page19.txt: [('planofthefirsttwobooksintheseries', 'plan of the first two books in the series'), ('ABookforLittleChildrenBible', 'A Book for Little Children Bible')] ADV19040901-V06-09-page2.txt: [('WhenIthinkofthetrialsthroughwhich', 'When I think of the trials through which'), ('ingmetbytheselessons', 'ing met by these lessons'), ('theIllinoisCentral', 'the Illinois Central')] ADV19040901-V06-09-page3.txt: [('nootherformofeducation', 'no other form of education')] ADV19040901-V06-09-page4.txt: [('thesysteminaschoolwheretheprinciples', 'the system in a school where the principles'), ('Collegebearsanamewhich', 'College bears a name which'), ('connectedwitheitherthefarmorthehome', 'connected with either the farm or the home')] ADV19040901-V06-09-page5.txt: [('membersarestudentteachers', 'members are student teachers'), ('thiswaythatmanualtrainingisplacedon', 'this way that manual training is placed on'), ('thecookingclasswereresponsiblefor', 'the cooking class were responsible for'), ('ofmeansinthevariousdepartmentsonthe', 'of means in the various departments on the')] ADV19040901-V06-09-page6.txt: [('Theschoolbuildings', 'The school buildings'), ('hasbeenchosenpresidentoftheWashing', 'has been chosen president of the Washing')] ADV19040901-V06-09-page7.txt: [('thefieldsandextensiveplains', 'the fields and extensive plains')] ADV19040901-V06-09-page8.txt: [('DOESTHESCHOOLMAKETHEPEO', 'DOES THE SCHOOL MAKE THE P E O'), ('departmentsoftheUnit', 'departments of the Uni t'), ('whohashadnoexperience', 'who has had no experience')] ADV19040901-V06-09-page9.txt: [('thousanddollarswasexpendedinthe', 'thousand dollars was expended in the'), ('oneofthemostinteresting', 'one of the most interesting')] ADV19041001-V06-10-page11.txt: [('dayonehotfoodmaybecookedandserved', 'day one hot food may be cooked and served')] ADV19041001-V06-10-page19.txt: [('planofthefirsttwobo', 'plan of the first two b o')] ADV19041001-V06-10-page3.txt: [('canwefindsogoodameans', 'can we find so good a means')] ADV19041101-V06-11-page12.txt: [('worldhelptocarryonthework', 'world help to carry on the work')] ADV19041101-V06-11-page13.txt: [('wasabouttomoveaway', 'was about to move away')] ADV19041101-V06-11-page14.txt: [('withusduringtheshorttimewespent', 'with us during the short time we spent')] ADV19041101-V06-11-page15.txt: [('movedoffsmoothlyfrom', 'moved off smooth l y from')] ADV19041101-V06-11-page16.txt: [('Arenotthepeopleinneedofthetruths', 'Are not the people in need of the truths'), ('Inthiswayhewillnotonlygetbetter', 'In this way he will not only get better'), ('ofourchurchschoolteacherswrites', 'of our church school teachers writes'), ('allouryoungpeoplecouldbeastrue', 'all our young people could be as true'), ('thepromisemadeinIsaiah', 'the promise made in Isaiah')] ADV19041101-V06-11-page17.txt: [('locationwasnotatthattimedetermined', 'location was not at that time determined'), ('MissionaryCollegeopened', 'Missionary College opened')] ADV19041101-V06-11-page18.txt: [('SomeStrongFeaturesofBibleReader', 'Some Strong Features of Bible Reader')] ADV19041101-V06-11-page3.txt: [('issomethingaboutthesmellof', 'is something about the smell of'), ('Thebenumbingpowerof', 'The be num b ing power of'), ('Thelaborofthefarmis', 'The labor of the farm is')] ADV19041101-V06-11-page4.txt: [('ofthechurchesandschoolsinthediocese', 'of the churches and schools in the di o c e s e'), ('andinthatcityIvisitedoneoftheparo', 'and in that city I visited one of the par o'), ('Protestantssettled', 'Protestants settled')] ADV19041101-V06-11-page5.txt: [('Multitudesstandreadytofollow', 'Multitudes stand ready to follow')] ADV19041101-V06-11-page7.txt: [('Howintenseistheeffortinthesedaysto', 'How intense is the effort in these days to'), ('displaymadebythisschoolattheWorld', 'display made by this school at the World'), ('Whenthevolumeiscompleted', 'When the volume is completed'), ('unitedeffortsofthepupils', 'united efforts of the pupils'), ('hibitoneseesachartwhichshowstherela', 'hi bit one sees a chart which shows there la')] ADV19041101-V06-11-page8.txt: [('Shouldmenofmeansgive', 'Should men of means give')] ADV19050101-V07-01-page13.txt: [('correctconclusionsfromwhathesees', 'correct conclusions from what he sees')] ADV19050101-V07-01-page7.txt: [('purposesintroducing', 'purposes introducing')]
# %load shared_elements/summary.py
summary = reports.overview_report(directories['cycle'], spelling_dictionary, title)
Directory: /Users/jeriwieringa/Dissertation/text/text/2017-01-31-corpus-with-utf8-split-into-titles-cleaning/ADV/correction12 Average verified rate: 0.9686531689441092 Average of error rates: 0.04687282361847085 Total token count: 1257990
# %load shared_elements/top_errors.py
errors_summary = reports.get_errors_summary( summary )
reports.top_errors( errors_summary, 10 )[:50]
[('e', 3215), ('t', 2066), ('w', 2042), ('m', 1502), ('r', 1381), ('f', 1338), ('n', 1292), ("'", 1262), ('d', 917), ('g', 638), ('u', 506), ('k', 433), ('co', 353), ('x', 281), ('th', 226), ('z', 129), ('q', 112), ('fr', 92), ('ment', 89), ('tion', 80), ('re', 79), ('ofthe', 74), ('pp', 71), ('ers', 68), ('ex', 68), ('ft', 56), ('io', 55), ('il', 47), ('ry', 47), ('mo', 44), ('mt', 43), ('ky', 41), ('si', 39), ('oi', 38), ('bo', 36), ('ol', 34), ('ucation', 34), ('--', 33), ('es', 33), ('va', 32), ('se', 31), ('tbe', 30), ('dren', 30), ('al', 28), ('jt', 28), ('ga', 28), ('fi', 27), ('pa', 27), ('ma', 26), ('pm', 26)]
System Information¶
# %load ../../shared_elements/system_info.py
import IPython
print (IPython.sys_info())
!pip freeze
{'commit_hash': '5c9c918', 'commit_source': 'installation', 'default_encoding': 'UTF-8', 'ipython_path': '/Users/jeriwieringa/miniconda3/envs/dissertation2/lib/python3.5/site-packages/IPython', 'ipython_version': '5.1.0', 'os_name': 'posix', 'platform': 'Darwin-16.4.0-x86_64-i386-64bit', 'sys_executable': '/Users/jeriwieringa/miniconda3/envs/dissertation2/bin/python', 'sys_platform': 'darwin', 'sys_version': '3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016, ' '17:52:12) \n' '[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]'} alabaster==0.7.10 anaconda-client==1.5.5 appnope==0.1.0 argh==0.26.1 Babel==2.3.4 beautifulsoup4==4.5.3 blinker==1.4 bokeh==0.12.4 boto==2.43.0 bz2file==0.98 chest==0.2.3 cleanOCR==0.1 cloudpickle==0.2.1 clyent==1.2.2 cycler==0.10.0 dask==0.12.0 datashader==0.4.0 datashape==0.5.2 decorator==4.0.11 docutils==0.13.1 doit==0.29.0 gensim==0.12.4 Ghost.py==0.2.3 ghp-import2==1.0.1 GoH==0.1 gspread==0.4.1 HeapDict==1.0.0 httplib2==0.9.2 husl==4.0.3 ijson==2.3 imagesize==0.7.1 ipykernel==4.5.2 ipython==5.1.0 ipython-genutils==0.1.0 ipywidgets==5.2.2 Jinja2==2.8 jsonschema==2.5.1 jupyter==1.0.0 jupyter-client==4.4.0 jupyter-console==5.0.0 jupyter-contrib-core==0.3.0 jupyter-contrib-nbextensions==0.2.2 jupyter-core==4.2.1 jupyter-highlight-selected-word==0.0.5 jupyter-latex-envs==1.3.5.4 jupyter-nbextensions-configurator==0.2.3 llvmlite==0.14.0 locket==0.2.0 Logbook==1.0.0 lxml==3.5.0 MacFSEvents==0.7 Mako==1.0.4 Markdown==2.6.7 MarkupSafe==0.23 matplotlib==2.0.0 memory-profiler==0.43 mistune==0.7.3 multipledispatch==0.4.9 natsort==4.0.4 nb-anacondacloud==1.2.0 nb-conda==2.0.0 nb-conda-kernels==2.0.0 nb-config-manager==0.1.3 nbbrowserpdf==0.2.1 nbconvert==4.2.0 nbformat==4.2.0 nbpresent==3.0.2 networkx==1.11 Nikola==7.7.7 nltk==3.2.2 notebook==4.2.3 numba==0.29.0 numpy==1.12.0 oauth2client==4.0.0 OCRreports==0.1 odo==0.5.0 pandas==0.19.2 partd==0.3.6 path.py==0.0.0 pathtools==0.1.2 pdfminer3k==1.3.1 pexpect==4.0.1 pickleshare==0.7.4 Pillow==3.4.2 plotly==2.0.1 ply==3.10 prompt-toolkit==1.0.9 psutil==4.3.0 ptyprocess==0.5.1 py==1.4.32 pyasn1==0.1.9 pyasn1-modules==0.0.8 pycrypto==2.6.1 Pygments==2.1.3 pyparsing==2.1.10 PyPDF2==1.25.1 PyRSS2Gen==1.1 pyshp==1.2.10 pytest==3.0.6 python-dateutil==2.6.0 pytz==2016.10 pyxDamerauLevenshtein==1.4.1 PyYAML==3.12 pyzmq==16.0.2 qtconsole==4.2.1 requests==2.13.0 rsa==3.4.2 scipy==0.18.1 seaborn==0.7.1 simplegeneric==0.8.1 six==1.10.0 smart-open==1.3.5 snowballstemmer==1.2.1 Sphinx==1.5.1 sphinx-rtd-theme==0.2.0 terminado==0.6 textblob==0.11.1 toolz==0.8.1 tornado==4.4.2 traitlets==4.3.1 Unidecode==0.4.19 verifyOCR==0.1 watchdog==0.8.3 wcwidth==0.1.7 webassets==0.11.1 wget==2.2 widgetsnbextension==1.2.6 ws4py==0.3.4 xarray==0.8.2 Yapsy==1.11.223