2018-10-16-Cluster-Topics

The goal for this notebook is to cluster together (using shared word frequencies) the topics from the generated topic model. This will assist with:

  • Gaining another view on the overall model
  • Identifying associated topics with those that I have selected through interpretive measures

Few decision:

In [1]:
import pandas as pd
import os
In [2]:
data_dir = "/Users/jeriwieringa/Dissertation/data/model_analysis/"
In [3]:
# tw_target... generated in 
tw = pd.read_csv(os.path.join(data_dir, 'tw_target_300_10.18497.csv'))
In [4]:
tw[:10]
Out[4]:
topic type word_counts
0 0 aad 2
1 0 aand 3
2 0 abegg 9
3 0 ability 53
4 0 able 1123
5 0 abrams 1
6 0 absent 45
7 0 abundant 41
8 0 abundantly 38
9 0 accepted 168
In [5]:
top_topic_words = tw.groupby('topic').apply(lambda x: x.nlargest(50, 'word_counts')).reset_index(drop=True)
top_topic_words[:10]
Out[5]:
topic type word_counts
0 0 book 36113
1 0 canvasser 16637
2 0 order 15259
3 0 canvassing 13014
4 0 brother 10430
5 0 week 9752
6 0 sold 9408
7 0 field 8662
8 0 agent 7780
9 0 report 7049
In [6]:
import json

with open(os.path.join(data_dir, 'params_target_300_10.18497.json')) as json_data:
    d = json.load(json_data)

beta = d['beta']
In [7]:
# Load Labels

import gspread
from oauth2client.service_account import ServiceAccountCredentials

# Load data from Google Doc
scope = ['https://spreadsheets.google.com/feeds']
secrets = "/Users/jeriwieringa/Dissertation/dev/code/secrets/dissertation-881847769b13.json"
credentials = ServiceAccountCredentials.from_json_keyfile_name(secrets, scope)
gc = gspread.authorize(credentials)
dts = gc.open('Topic Labels').sheet1

labels = pd.DataFrame(dts.get_all_records())
In [8]:
tw_matrix = pd.pivot_table(top_topic_words, 
                           index='topic', 
                           columns="type", 
                           values='word_counts', 
                           fill_value=0)
In [9]:
# tw_matrix_smooth = tw_matrix + beta
tw_matrix_normed = tw_matrix.div(tw_matrix.sum(axis=1), axis=0)
In [10]:
tw_matrix_normed[:2]
Out[10]:
type aaron abdomen abdominal abel ability able abolished abraham abram academia ... yes york young young_girl young_man young_people young_woman youth zealand zion
topic
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 3063 columns

In [11]:
tw_matrix_normed.sum(axis=1)[:2]
Out[11]:
topic
0    1.0
1    1.0
dtype: float64

compute distance between the rows in the matrix

In [12]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import plotly.figure_factory as ff

import cufflinks as cf

init_notebook_mode(connected=True)
cf.go_offline()