2018-10-18-Top-Docs-by-Topic-Year
In [1]:
import os
import pandas as pd
from text2topics import models
In [2]:
%load_ext autoreload
%autoreload 2
In [3]:
data_dir = "/Users/jeriwieringa/Dissertation/data/"
In [4]:
model = models.MalletModel(os.path.join(data_dir, 'model_outputs', 'target_300_10.18497.state.gz'))
In [5]:
df = model.model()
In [6]:
params = model.extract_params()
In [7]:
# Load metadata
metadata = pd.read_csv(os.path.join(data_dir,'corpus_metadata', "meta.csv"), header=None).reset_index()
metadata.columns = ['doc_id', 'filename', 'citation', 'author',
'periodical_name', 'volume', 'issue',
'date', 'page', 'url','abrev']
metadata['date_formatted'] = pd.to_datetime(metadata['date'])
In [8]:
# Load Labels
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# Load data from Google Doc
scope = ['https://spreadsheets.google.com/feeds']
secrets = "/Users/jeriwieringa/Dissertation/dev/code/secrets/dissertation-881847769b13.json"
credentials = ServiceAccountCredentials.from_json_keyfile_name(secrets, scope)
gc = gspread.authorize(credentials)
dts = gc.open('Topic Labels').sheet1
labels = pd.DataFrame(dts.get_all_records())
In [9]:
doc_topic = df.groupby(['#doc', 'topic'])['type'].count().reset_index(name="token_count")
In [10]:
doc_topic[:3]
Out[10]:
In [11]:
doc_topic['topic'] = doc_topic['topic'].apply(pd.to_numeric)
In [12]:
dt = models.pivot_smooth_norm(doc_topic, params[0], '#doc', 'topic', 'token_count')
In [13]:
docs = dt.unstack().reset_index(name="topic_proportion")
In [14]:
docs = docs.merge(metadata, how='left', left_on="#doc", right_on="doc_id")
In [15]:
docs[:3]
Out[15]:
Question of how do I want to distinguish "top" documents. For interpretation, I think I want those documents where the topic is prevalent, so accounts for more than 20% (?) of the content. The goal here is to see the large markers for use in the historical interpretation, rather than the more subtle markers that can be used to show development over time.
Step 1 then becomes to filter this frame by the those topics that have a proportion >= .20 for each document.
In [16]:
top_doc_topics = docs[docs['topic_proportion'] >= 0.2]
In [17]:
top_doc_topics[:3]
Out[17]:
In [18]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()
Browse Top Documents per Topic and Year(s)¶
View which documents have a greater than 10% prevalence of a given topic in a time range.
In [19]:
import ipywidgets as widgets
from ipywidgets import interactive
In [20]:
years = top_doc_topics.date_formatted.dt.year.sort_values().unique().tolist()
topic_ids = top_doc_topics['topic'].unique().tolist()
def top_docs(start_year='', end_year='', topic=''):
docs = top_doc_topics[(top_doc_topics.date_formatted.dt.year >= start_year) & (top_doc_topics.date_formatted.dt.year <= end_year) & (top_doc_topics['topic'] == topic)]
# display_df = topics[topics['topic', 'date_formatted', 'topic_proportion', 'topic_label']]
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(docs.sort_values('date_formatted'))
In [21]:
start = widgets.Select(options=years)
end = widgets.Select(options=years)
topic_id = widgets.Select(options=topic_ids)
interactive(top_docs, start_year=start, end_year=end, topic=topic_id)
In [ ]:
In [ ]:
In [ ]: