Topic-modeling-with-MALLET

Operating details:

  • Mallet loads entire corpus into memory, requiring a larger machine. These were run on a Digital Ocean server
  • Run via the commandline using a bash script. Copied here for ease of sharing.

Create corpus files for MALLET

  • Main corpus and holdout corpus (docs with low token count and high error rates)
In [ ]:
%%bash
#!/bin/bash

# https://stackoverflow.com/questions/41218622/mallet-topic-inference
# http://mallet.cs.umass.edu/topics.php

corpus="test_300_noMax"
sample="test_300_noMax_Sample"
holdout="test_300_noMax_Holdout"
seed=$RANDOM

# Import sample corpus
~/lib/mallet/bin/mallet import-file --input data/$sample.txt  --output data/output/$corpus.Sample.$seed.mallet --keep-sequence --stoplist-file data/finalCorpus_filterList.txt

echo "Sample corpus imported"

# Import holdout corpus
~/lib/mallet/bin/mallet import-file --input data/$holdout.txt --output data/output/$corpus.Holdout.$seed.mallet --use-pipe-from data/output/$corpus.Sample.$seed.mallet --keep-sequence --stoplist-file data/finalCorpus_filterList.txt

Create and apply topic model

  • Train model on main corpus
  • Use model to infer topics on holdout documents
In [ ]:
%%bash
#!/bin/bash

# https://stackoverflow.com/questions/41218622/mallet-topic-inference
# http://mallet.cs.umass.edu/topics.php

corpus="test_300_noMax"
trainingSeed=18040
seed=$RANDOM
topics=250


# Train model on Sample
~/lib/mallet/bin/mallet train-topics --input data/output/$corpus.Sample.$trainingSeed.mallet --num-topics $topics --optimize-interval 20 --optimize-burn-in 50 --random-seed $seed --num-threads 8 --output-state data/output/$corpus.$seed.t.$topics.state.gz --output-model data/output/$corpus.$seed.t.$topics.model --output-doc-topics data/output/$corpus.$seed.t.$topics.docTopics.txt --output-topic-keys data/output/$corpus.$seed.t.$topics.topicKeys.txt --diagnostics-file data/output/$corpus.$seed.t.$topics.diagnostics.xml --inferencer-filename data/output/$corpus.Sample.$seed.t.$topics.inferencer

echo "Model trained"

# Infer topics on holdout documents
~/lib/mallet/bin/mallet infer-topics --inferencer data/output/$corpus.Sample.$seed.t.$topics.inferencer --input data/output/$corpus.Holdout.$trainingSeed.mallet --output-doc-topics data/output/$corpus.Holdout.$seed.t.$topics.docTopics.txt --random-seed $seed 

echo "Holdouts classified"