Topic-modeling-with-MALLET
Operating details:
- Mallet loads entire corpus into memory, requiring a larger machine. These were run on a Digital Ocean server
- Run via the commandline using a bash script. Copied here for ease of sharing.
Create corpus files for MALLET¶
- Main corpus and holdout corpus (docs with low token count and high error rates)
In [ ]:
%%bash
#!/bin/bash
# https://stackoverflow.com/questions/41218622/mallet-topic-inference
# http://mallet.cs.umass.edu/topics.php
corpus="test_300_noMax"
sample="test_300_noMax_Sample"
holdout="test_300_noMax_Holdout"
seed=$RANDOM
# Import sample corpus
~/lib/mallet/bin/mallet import-file --input data/$sample.txt --output data/output/$corpus.Sample.$seed.mallet --keep-sequence --stoplist-file data/finalCorpus_filterList.txt
echo "Sample corpus imported"
# Import holdout corpus
~/lib/mallet/bin/mallet import-file --input data/$holdout.txt --output data/output/$corpus.Holdout.$seed.mallet --use-pipe-from data/output/$corpus.Sample.$seed.mallet --keep-sequence --stoplist-file data/finalCorpus_filterList.txt
Create and apply topic model¶
- Train model on main corpus
- Use model to infer topics on holdout documents
In [ ]:
%%bash
#!/bin/bash
# https://stackoverflow.com/questions/41218622/mallet-topic-inference
# http://mallet.cs.umass.edu/topics.php
corpus="test_300_noMax"
trainingSeed=18040
seed=$RANDOM
topics=250
# Train model on Sample
~/lib/mallet/bin/mallet train-topics --input data/output/$corpus.Sample.$trainingSeed.mallet --num-topics $topics --optimize-interval 20 --optimize-burn-in 50 --random-seed $seed --num-threads 8 --output-state data/output/$corpus.$seed.t.$topics.state.gz --output-model data/output/$corpus.$seed.t.$topics.model --output-doc-topics data/output/$corpus.$seed.t.$topics.docTopics.txt --output-topic-keys data/output/$corpus.$seed.t.$topics.topicKeys.txt --diagnostics-file data/output/$corpus.$seed.t.$topics.diagnostics.xml --inferencer-filename data/output/$corpus.Sample.$seed.t.$topics.inferencer
echo "Model trained"
# Infer topics on holdout documents
~/lib/mallet/bin/mallet infer-topics --inferencer data/output/$corpus.Sample.$seed.t.$topics.inferencer --input data/output/$corpus.Holdout.$trainingSeed.mallet --output-doc-topics data/output/$corpus.Holdout.$seed.t.$topics.docTopics.txt --random-seed $seed
echo "Holdouts classified"