Create SpellCheck Wordlists

Create a Word List for OCR verification

Goals

In order to calculate the rough accuracy of the OCR on the SDA periodicals, I created a word bank against which each token in each periodical page could be compared. While this approach does not capture errors of recognition that result in a valid word, it provides a generalized picture of the percentage of each text that is nonsensical and provides an entry point for determining the ways the OCR has failed.

In [1]:
# Creating a Wordlist against which to evaluate the OCR
spelling_dictionary = []
In [2]:
# Function for pulling words from a CSV file

import csv

def load_from_csv(file_name, column_name):
    word_list = []
    with open(file_name, "r") as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            word_list.append(row[column_name])
    word_list = [w.lower() for w in word_list]
    return(word_list)
In [3]:
# Function for pulling words from a txt file

def load_from_txt(file_name):
    with open(file_name, "r") as txt:
        words = txt.read().splitlines()
        word_list = [w for w in words]
    return(word_list)
In [4]:
# Function for getting the unique set of words when adding new list

def get_unique_words(word_list, existing_list):
    return(set(word_list)-set(existing_list))
In [5]:
# Helper function to add to spelling dictionary

def add_to_list(word_list, dictionary):
    for each in word_list:
        dictionary.append(each)

Sources

There are many word lists that are used in general digital humanities work as a source against which to compare, but very little written on the sources form which those lists are drawn.

One of the most common is the NLTK wordlist. This word list is often the default in language processing work. The source of this wordlist and the broad scope of the words it includes, however, makes the dataset problematic for my purposes here. As noted in the README file included with the "words" corpus, the words list is the same as the list of words included by default on Unix operating system (http://en.wikipedia.org/wiki/Words_(Unix)).

This suggests a uniformity in the words list included with all Unix systems. However, the words list included with MacOS Sierra and the list included with Ubuntu 16.04.1 are, in fact, quite different. MacOS uses a word list derrived from the 2nd edition of Websters International Dictionary (according to the README in /usr/share/dict/), this word list is very generous, including all single letters and many uncommon words. Given that our context here is to identify words that are misspelled and are likely OCR errors, this extensive of a list is actually detrimental to our project. A comparison between the two lists reveals that the NLTK word list is, in fact, the list derrived from Websters International Dictionary with 6 additions (see compare_nltk_to_web2.ipynb).

Ubuntu, by contrast, relies upon the SCOWL (Spell Checking Oriented Word Lists) package, version 7.1 (2011) for its wordlists (http://packages.ubuntu.com/source/precise/scowl). This package provides a series of wordlists compiled by Kevin Atkinson, broken into different packages for creating and supporting spell-check software. These packages

http://app.aspell.net/create?defaults=en_US

In addition, because of the rich biblical language with which Seventh-day Adventism understood and expressed their world, I have added a wordlist created from the text of the King James Bible. The process by which the text from the Christian Ethereal Classics Library was converted into a list of words is documented in /drafts/code/Create_Scriptural_Word_List.ipynb.

Processing the word lists

Adding the SCOWL Custom list

In [6]:
scowl = load_from_txt('/Users/jeriwieringa/Dissertation/drafts/data/word-lists/SCOWL-wl/words.txt')
In [7]:
len(scowl)
Out[7]:
171131

Filtering the SCOWL Custom List

In [8]:
import re

# Identify the abbreviations, as these are not relevant to the SDA corpus.

regex_1 = re.compile('\w[A-Z]+')
In [9]:
scowl2 = [x for x in scowl if not regex_1.match(x)]

len(scowl2)
Out[9]:
169996
In [10]:
scowl2[:10]
Out[10]:
['A',
 "A's",
 'Aachen',
 "Aachen's",
 'Aalborg',
 'Aalesund',
 'Aaliyah',
 "Aaliyah's",
 'Aalst',
 "Aalst's"]
In [11]:
scowl3 = [x for x in scowl2 if len(x) > 2]

len(scowl3)
Out[11]:
169543
In [12]:
scowl3[:10]
Out[12]:
["A's",
 'Aachen',
 "Aachen's",
 'Aalborg',
 'Aalesund',
 'Aaliyah',
 "Aaliyah's",
 'Aalst',
 "Aalst's",
 'Aalto']
In [13]:
scowl4 = [x.lower() for x in scowl3]
In [14]:
len(set(scowl4))
Out[14]:
166169
In [15]:
add_to_list(list(set(scowl4)), spelling_dictionary)
In [16]:
len(spelling_dictionary)
Out[16]:
166169
In [17]:
spelling_dictionary[:10]
Out[17]:
['orienteer',
 'olen',
 'legree',
 "spica's",
 'diageotropism',
 'eunuchs',
 'measurelessly',
 "columbus's",
 'carnotites',
 'pyramiding']

Adding word list created from the KJV translation of the Bible

Wordlist created in /drafts/code/Create_Scriptural_Word_List.ipynb from transcription of the KJV Bible.

In [18]:
biblical_language = load_from_txt("/Users/jeriwieringa/Dissertation/drafts/data/word-lists/kjv_bible_wordlist.txt")
In [19]:
len(biblical_language)
Out[19]:
14275
In [20]:
biblical_language[:10]
Out[20]:
['alexandrians',
 'chastised',
 'murdered',
 'ezri',
 'desire',
 'obadiah',
 'betonim',
 'knocketh',
 'disgrace',
 'lendest']
In [21]:
unique_biblical_words = get_unique_words(biblical_language, spelling_dictionary)
In [22]:
len(unique_biblical_words)
Out[22]:
5116
In [23]:
list(unique_biblical_words)[:10]
Out[23]:
['hakupha',
 'jeshua',
 'nebuzar',
 'izri',
 'hodaviah',
 'ephrathites',
 'shaashgaz',
 'patheus',
 'reasoneth',
 'phaldaius']
In [24]:
add_to_list(list(unique_biblical_words), spelling_dictionary)
In [25]:
len(spelling_dictionary)
Out[25]:
171285
In [26]:
spelling_dictionary[:10]
Out[26]:
['orienteer',
 'olen',
 'legree',
 "spica's",
 'diageotropism',
 'eunuchs',
 'measurelessly',
 "columbus's",
 'carnotites',
 'pyramiding']

Checking the word bank

In [27]:
import nltk

def calculate_error_totals(errors_list, all_tokens):
    count = 0
    freq_distribution = nltk.FreqDist(all_tokens)
    for each in errors_list:
        frequency = freq_distribution[each]
        count = count + int(frequency)
    return(count)
In [28]:
import re
from nltk import word_tokenize

def check_words(text, file, spell_dictionary):
    
    text_cleaned = re.sub(r"[0-9,.!?$\"\(\)]", "", text)
    
    '''
    Making the choice for spell-check purposes to remove the '-' of hyphenated words. This allows me to check value of 
    each part of the combined word, without having to expand the dictionary too much. Also allows for greater variability
    in the construction of hyphenated words (as was often the case in 19th century writing.)
    ''' 
    text_cleaned = re.sub(r"[-]", " ", text_cleaned)
    
    tokens = word_tokenize(text_cleaned)
    tokens_lower = [w.lower() for w in tokens]
    
    errors = set(tokens_lower)-set(spelling_dictionary)    
    error_total = calculate_error_totals(errors, tokens_lower)
    
    print(error_total)
                          
    overview = {'doc_id': file, 'num_tokens' : len(tokens), 'num_errors': error_total, 'errors' : ', '.join(list(errors)) }
     
    return(overview)
In [29]:
def test_process(file):
    with open(input_dir + file, "r") as f:
        print(file)
        content = f.read()
        print(content)
        stats = check_words(content, file, spelling_dictionary)
        print("Errors: {}".format(stats['errors']))
In [30]:
input_dir = '/Users/jeriwieringa/Dissertation/text/text-current/2016-11-16-corpus-with-preliminary-cleaning/'
In [31]:
test_process('ADV19000601-V02-06-page13.txt')
ADV19000601-V02-06-page13.txt
 existence. Can you afford to postpone the starting of a church school ? If you do not know what has been written concerning Christian education, insist that the subject be presented. If there are church-school teachers on the grounds, question them.
Literature on the subject will be for sale, See that you have it all.
The words of Jesus are important: Ò Yet a little while is the light with you. Walk while ye have theiight, lest darkness come upon you.Ó
ONE might think that Seventh-day Adventists advocated the slow plod ding of the tortoise, and it may be that there are cases when the slow, steady movement will accomplish more than the rapid one; but whether that applies to educational reform or not, is a question. We are told to make a rush for the king dom of heaven ; and while some of those who counsel to Ò move slowly,Ó Ò take time to consider,Ó are arriving at a final decision, the children are growing to ma turity, and not only growing up, but growing away from the hotne and the church.
Solomon says there is Ò a time for every purpose under heaven.ÕÕ The time has come to start schools for the children. The Lord has told this in the most positive manner. Those who do not take up this duty now will awake some day to find that the work of warning the world has passed on to an other people. It will go to those who are willing to do an educational work, for the third angelÕs message is an educational re
form. There are to-day men of the world who recognize the evils of modern education, and who will take up this work if you let it pass. It is time to have a church school, and to understand why you have one.
TIIE ADVOCATE
187
THE OPPORTUNE TIME.
TEACHERSÕ CONFERENCE BULLETIN.
AFULL report of the proceedings of the TeachersÕ Conference will be issued under cover of the T r a in in g -
School Advocate. Two numbersinJuly
and the regular August issue will be
devoted to this matter. This will give
seventy-five or oue hundred pages of read
ing matter on the subject of Christian
education, which no one who is interested
in the subject can afford to miss. The time
of the Conference will be devoted to the
discussion of such subjects as Ò Educational
work the basis for all Christian growth ; Ó
ÔÔ Character and scope of work of the institu
tions belonging to the system of Christian
education ; Ó Ò Is it possible and practicable
for each church to maintain a school ? ÕÕ live matter in convenient form, see that
Ò Financial support of church schools;Ó they have the three special numbers of the ÒBooksforChristianschools;Ó ÒChange Advocate, knownastheTeachersÕCon of methods necessary in church schools ; Ó f e r e n c e B u l l e t i n . For price see Review 44Does a public school teacher require a and Herald. Send at once.
training in methods of Christian educa tion? Ó and many other subjects of equal importance.
Each subject will be opened by a paper. The names of Elder A. T. Jones, Wm. Covert, N. W. Kauble, and Drs. Kellogg, Paulson, Edwards, and Holden, are among those that appear on the program.
We believe that these papers, together with a stenographic report of the discus sions, will offer material which will prove inestimable.
Subscribers to the A d v o c a te will receive these special numbers without extra charge ; but if you wish others to obtain much in formation on Christian education which is.

75
Errors: òbooksforchristianschools, teachersõ, ôô, paulson, numbersinjuly, l, consideró, :, hotne, tions, youó, ó, sions, kauble, c, f, r, theiight, educa, institu, afull, õõ, w, tion, d, drs, g, wm, u, tiie, turity, òchange, n, slowlyó, ;, t, v, te, ò, knownastheteachersõcon, oue, heavenõõ, ma, e, angelõs, re

Saving out the results for use and re-use

In [32]:
import datetime
outdir = "/Users/jeriwieringa/Dissertation/drafts/data/word-lists/"
with open("{}{}-Base-Word-List-SCOWL&KJV.txt".format(outdir, str(datetime.date.today())), 'w') as outfile:
    for each in spelling_dictionary:
        outfile.write("{}\n".format(each))

System Info at time of last run

In [33]:
# %load shared_elements/system_info.py
import IPython
print (IPython.sys_info())
!pip freeze
{'commit_hash': '5c9c918',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': '/Users/jeriwieringa/miniconda3/envs/dissertation2/lib/python3.5/site-packages/IPython',
 'ipython_version': '5.1.0',
 'os_name': 'posix',
 'platform': 'Darwin-16.4.0-x86_64-i386-64bit',
 'sys_executable': '/Users/jeriwieringa/miniconda3/envs/dissertation2/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, '
                '17:52:12) \n'
                '[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]'}
anaconda-client==1.5.5
appnope==0.1.0
argh==0.26.1
beautifulsoup4==4.5.3
blinker==1.4
bokeh==0.12.4
boto==2.43.0
bz2file==0.98
chest==0.2.3
cleanOCR==0.1
cloudpickle==0.2.1
clyent==1.2.2
cycler==0.10.0
dask==0.12.0
datashader==0.4.0
datashape==0.5.2
decorator==4.0.10
docutils==0.12
doit==0.29.0
gensim==0.12.4
Ghost.py==0.2.3
ghp-import2==1.0.1
GoH==0.1
gspread==0.4.1
HeapDict==1.0.0
httplib2==0.9.2
husl==4.0.3
ijson==2.3
ipykernel==4.5.2
ipython==5.1.0
ipython-genutils==0.1.0
ipywidgets==5.2.2
Jinja2==2.8
jsonschema==2.5.1
jupyter==1.0.0
jupyter-client==4.4.0
jupyter-console==5.0.0
jupyter-contrib-core==0.3.0
jupyter-contrib-nbextensions==0.2.2
jupyter-core==4.2.1
jupyter-highlight-selected-word==0.0.5
jupyter-latex-envs==1.3.5.4
jupyter-nbextensions-configurator==0.2.3
llvmlite==0.14.0
locket==0.2.0
Logbook==1.0.0
lxml==3.5.0
MacFSEvents==0.7
Mako==1.0.4
Markdown==2.6.7
MarkupSafe==0.23
matplotlib==2.0.0
memory-profiler==0.43
mistune==0.7.3
multipledispatch==0.4.9
natsort==4.0.4
nb-anacondacloud==1.2.0
nb-conda==2.0.0
nb-conda-kernels==2.0.0
nb-config-manager==0.1.3
nbbrowserpdf==0.2.1
nbconvert==4.2.0
nbformat==4.2.0
nbpresent==3.0.2
networkx==1.11
Nikola==7.7.7
nltk==3.2.2
notebook==4.2.3
numba==0.29.0
numpy==1.12.0
oauth2client==4.0.0
OCRreports==0.1
odo==0.5.0
pandas==0.19.2
partd==0.3.6
path.py==0.0.0
pathtools==0.1.2
pexpect==4.0.1
pickleshare==0.7.4
Pillow==3.4.2
prompt-toolkit==1.0.9
psutil==4.3.0
ptyprocess==0.5.1
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycrypto==2.6.1
Pygments==2.1.3
pyparsing==2.1.10
PyPDF2==1.25.1
PyRSS2Gen==1.1
pyshp==1.2.10
python-dateutil==2.6.0
pytz==2016.10
PyYAML==3.12
pyzmq==16.0.2
qtconsole==4.2.1
requests==2.12.3
rsa==3.4.2
scipy==0.18.1
seaborn==0.7.1
simplegeneric==0.8.1
six==1.10.0
smart-open==1.3.5
terminado==0.6
textblob==0.11.1
toolz==0.8.1
tornado==4.4.2
traitlets==4.3.1
Unidecode==0.4.19
verifyOCR==0.1
watchdog==0.8.3
wcwidth==0.1.7
webassets==0.11.1
widgetsnbextension==1.2.6
ws4py==0.3.4
xarray==0.8.2
Yapsy==1.11.223
In [ ]: