2018-02-CreateZipsOfCorpusSubsets
The output of this notebook is eight zip files that represent four configurations of the corpus to test the stability of the mallet model as well as the value added by controlling the quality of the documents used to create the models (per [cite]).
Variations are:
- Control - Random sample of documents
- Target - Includes only documents with at least 300 tokens and an error rate under 10%
- Test1 - Includes documents with at least 300 tokens, but ignores error rate
- Test2 - Includes documents with at least 300 tokens and an error rate under 20%
In [31]:
%load_ext autoreload
%autoreload 2
In [32]:
from text2topics import utilities
import math
import numpy as np
import os
import pandas as pd
import tarfile
In [2]:
fullCorpus = "/Users/jeriwieringa/Dissertation/text/text/2017-04-Final-Corpus.tar.gz"
statsDir = "/Users/jeriwieringa/Dissertation/drafts/data/module-3/2017-05-corpus-stats/"
corpusDir = "/Users/jeriwieringa/Dissertation/text/text/2018-02-CorpusSubSets/"
In [3]:
df = pd.read_csv(os.path.join(statsDir, '2017-05-Composite-OCR-statistics.csv'))
In [26]:
df
Out[26]:
In [5]:
fullCorpusObject = tarfile.open(fullCorpus)
Quick sanity check to makes sure the lists match.
In [6]:
tarPathNames = fullCorpusObject.getnames()[1:]
tarFileNames = []
for path in tarPathNames:
tarFileNames.append(os.path.basename(path))
In [7]:
statsFileNames = df['doc_id'].tolist()
In [14]:
# print(statsFileNames[:10])
# print(tarFileNames[:10])
print (tarPathNames[:10])
In [9]:
len(list(set(tarFileNames)-set(statsFileNames)))
Out[9]:
Create Random Sample¶
In [10]:
sampleSize = math.ceil(.4*len(statsFileNames))
# print(sampleSize)
In [11]:
randomSample = np.random.choice(statsFileNames, sampleSize, replace=False).tolist()
# print(randomSample[:10])
In [25]:
#https://stackoverflow.com/questions/17616340/add-files-from-one-tar-into-another-tar-in-python
# randomSampleTar = tarfile.open(os.path.join(corpusDir, 'randomSample.tar.gz'), 'w:gz')
# randomHoldoutTar = tarfile.open(os.path.join(corpusDir, 'randomHoldout.tar.gz'), 'w:gz')
# for member in fullCorpusObject.getmembers()[1:]:
# if os.path.basename(member.name) in randomSample:
# randomSampleTar.addfile(member, fullCorpusObject.extractfile(member))
# else:
# randomHoldoutTar.addfile(member, fullCorpusObject.extractfile(member))
# randomSampleTar.close()
# randomHoldoutTar.close()
In [ ]:
# Abstracted function to library. Use this version if run second time.
# utilities.create_tar_files(corpusDir, 'random', fullCorpusObject, randomSample)
Create Target Subset¶
In [28]:
target_df = df[(df['num_tokens'] >= 300) & (df['error_rate'] < 0.1)]
In [29]:
len(target_df)
Out[29]:
In [34]:
# utilities.create_tar_files(corpusDir, 'target_300_10_', fullCorpusObject, target_df['doc_id'].tolist())
Create Test Set¶
For this test, I filtered only by min number of tokens
In [41]:
test_df = df[df['num_tokens'] >= 300]
In [42]:
targetList = test_df['doc_id'].tolist()
In [43]:
utilities.create_tar_files(corpusDir, 'test_300_noMax_', fullCorpusObject, targetList)
Create Second Test Set¶
For the second test, I filtered by a 25% error rate
In [44]:
test_df2 = df[(df['num_tokens'] >= 300) & (df['error_rate'] < 0.25)]
In [45]:
testList = test_df2['doc_id'].tolist()
In [47]:
utilities.create_tar_files(corpusDir, 'test_300_25_', fullCorpusObject, testList)
In [ ]: