Create SpellCheck Wordlists
Create a Word List for OCR verification¶
Goals¶
In order to calculate the rough accuracy of the OCR on the SDA periodicals, I created a word bank against which each token in each periodical page could be compared. While this approach does not capture errors of recognition that result in a valid word, it provides a generalized picture of the percentage of each text that is nonsensical and provides an entry point for determining the ways the OCR has failed.
# Creating a Wordlist against which to evaluate the OCR
spelling_dictionary = []
# Function for pulling words from a CSV file
import csv
def load_from_csv(file_name, column_name):
word_list = []
with open(file_name, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
word_list.append(row[column_name])
word_list = [w.lower() for w in word_list]
return(word_list)
# Function for pulling words from a txt file
def load_from_txt(file_name):
with open(file_name, "r") as txt:
words = txt.read().splitlines()
word_list = [w for w in words]
return(word_list)
# Function for getting the unique set of words when adding new list
def get_unique_words(word_list, existing_list):
return(set(word_list)-set(existing_list))
# Helper function to add to spelling dictionary
def add_to_list(word_list, dictionary):
for each in word_list:
dictionary.append(each)
Sources¶
There are many word lists that are used in general digital humanities work as a source against which to compare, but very little written on the sources form which those lists are drawn.
One of the most common is the NLTK wordlist. This word list is often the default in language processing work. The source of this wordlist and the broad scope of the words it includes, however, makes the dataset problematic for my purposes here. As noted in the README file included with the "words" corpus, the words list is the same as the list of words included by default on Unix operating system (http://en.wikipedia.org/wiki/Words_(Unix)).
This suggests a uniformity in the words list included with all Unix systems. However, the words list included with MacOS Sierra and the list included with Ubuntu 16.04.1 are, in fact, quite different. MacOS uses a word list derrived from the 2nd edition of Websters International Dictionary (according to the README in /usr/share/dict/), this word list is very generous, including all single letters and many uncommon words. Given that our context here is to identify words that are misspelled and are likely OCR errors, this extensive of a list is actually detrimental to our project. A comparison between the two lists reveals that the NLTK word list is, in fact, the list derrived from Websters International Dictionary with 6 additions (see compare_nltk_to_web2.ipynb).
Ubuntu, by contrast, relies upon the SCOWL (Spell Checking Oriented Word Lists) package, version 7.1 (2011) for its wordlists (http://packages.ubuntu.com/source/precise/scowl). This package provides a series of wordlists compiled by Kevin Atkinson, broken into different packages for creating and supporting spell-check software. These packages
http://app.aspell.net/create?defaults=en_US
In addition, because of the rich biblical language with which Seventh-day Adventism understood and expressed their world, I have added a wordlist created from the text of the King James Bible. The process by which the text from the Christian Ethereal Classics Library was converted into a list of words is documented in /drafts/code/Create_Scriptural_Word_List.ipynb.
Processing the word lists¶
Adding the SCOWL Custom list¶
scowl = load_from_txt('/Users/jeriwieringa/Dissertation/drafts/data/word-lists/SCOWL-wl/words.txt')
len(scowl)
Filtering the SCOWL Custom List¶
import re
# Identify the abbreviations, as these are not relevant to the SDA corpus.
regex_1 = re.compile('\w[A-Z]+')
scowl2 = [x for x in scowl if not regex_1.match(x)]
len(scowl2)
scowl2[:10]
scowl3 = [x for x in scowl2 if len(x) > 2]
len(scowl3)
scowl3[:10]
scowl4 = [x.lower() for x in scowl3]
len(set(scowl4))
add_to_list(list(set(scowl4)), spelling_dictionary)
len(spelling_dictionary)
spelling_dictionary[:10]
Adding word list created from the KJV translation of the Bible¶
Wordlist created in /drafts/code/Create_Scriptural_Word_List.ipynb from transcription of the KJV Bible.
biblical_language = load_from_txt("/Users/jeriwieringa/Dissertation/drafts/data/word-lists/kjv_bible_wordlist.txt")
len(biblical_language)
biblical_language[:10]
unique_biblical_words = get_unique_words(biblical_language, spelling_dictionary)
len(unique_biblical_words)
list(unique_biblical_words)[:10]
add_to_list(list(unique_biblical_words), spelling_dictionary)
len(spelling_dictionary)
spelling_dictionary[:10]
Checking the word bank¶
import nltk
def calculate_error_totals(errors_list, all_tokens):
count = 0
freq_distribution = nltk.FreqDist(all_tokens)
for each in errors_list:
frequency = freq_distribution[each]
count = count + int(frequency)
return(count)
import re
from nltk import word_tokenize
def check_words(text, file, spell_dictionary):
text_cleaned = re.sub(r"[0-9,.!?$\"\(\)]", "", text)
'''
Making the choice for spell-check purposes to remove the '-' of hyphenated words. This allows me to check value of
each part of the combined word, without having to expand the dictionary too much. Also allows for greater variability
in the construction of hyphenated words (as was often the case in 19th century writing.)
'''
text_cleaned = re.sub(r"[-]", " ", text_cleaned)
tokens = word_tokenize(text_cleaned)
tokens_lower = [w.lower() for w in tokens]
errors = set(tokens_lower)-set(spelling_dictionary)
error_total = calculate_error_totals(errors, tokens_lower)
print(error_total)
overview = {'doc_id': file, 'num_tokens' : len(tokens), 'num_errors': error_total, 'errors' : ', '.join(list(errors)) }
return(overview)
def test_process(file):
with open(input_dir + file, "r") as f:
print(file)
content = f.read()
print(content)
stats = check_words(content, file, spelling_dictionary)
print("Errors: {}".format(stats['errors']))
input_dir = '/Users/jeriwieringa/Dissertation/text/text-current/2016-11-16-corpus-with-preliminary-cleaning/'
test_process('ADV19000601-V02-06-page13.txt')
Saving out the results for use and re-use¶
import datetime
outdir = "/Users/jeriwieringa/Dissertation/drafts/data/word-lists/"
with open("{}{}-Base-Word-List-SCOWL&KJV.txt".format(outdir, str(datetime.date.today())), 'w') as outfile:
for each in spelling_dictionary:
outfile.write("{}\n".format(each))
System Info at time of last run¶
# %load shared_elements/system_info.py
import IPython
print (IPython.sys_info())
!pip freeze