lightlda.sh: a simple wrapper script for LightLDA.

Daichi Mochihashi
The Institute of Statistical Mathematics
$Id: index.html,v 1.3 2018/10/23 03:09:46 daichi Exp $

Introduction

lightlda.sh is a package of a wrapper script for LightLDA (Yuan+, WWW 2015) that can be used easily.
LightLDA is a quite useful program for efficiently training huge topic models, thanks to a number of mathematical and algorithmic techniques. However, it employs specific data formats both for the input and the output: lightlda.sh wraps them to enable using standard input and output formats.

Content

Download: lightlda.sh-0.1.tar.gz

Install

Usage

using lightlda2.sh is simple:
% ./lightlda.sh
lightlda.sh: wrapper to execute LightLDA in a standard way.
usage: lightlda.sh K iters train model [alpha] [beta]
$Id: lightlda.sh,v 1.10 2018/10/23 02:57:28 daichi Exp
General usage:
% lightlda.sh K iters train model [alpha] [beta]
Currently, this script does not support distributed training of LightLDA: for advanced usage, please use the original program directly.

Example

For example, using a sample "train" data file included in this package, you can run lightlda.sh as:
% ./lightlda.sh 10 100 train model
alpha = 0.1 beta = 0.01 topics = 10 iters = 100
preparing data at model ..
There are totally 1324 words in the vocabulary
There are maximally totally 16054 tokens in the data set
The number of tokens in the output block is: 16054
Local vocab_size for the output block is: 1323
Elapsed seconds for dump blocks: 0.0150371
docs  = 200
vocab = 1400
size  = 1
executing LightLDA ..
[INFO] [2018-10-23 12:07:46] INFO: block = 0, the number of slice = 1
[INFO] [2018-10-23 12:07:46] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-10-23 12:07:46] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
...
[INFO] [2018-10-23 12:08:46] Server 0: Dump model...
[INFO] [2018-10-23 12:08:46] Server 0 closed.
[INFO] [2018-10-23 12:08:47] Rank 0/1: Multiverso closed successfully.
converting to standard parameters..
reading model..
saving model..
done.
finished.

Data formats

Training data format is almost the same as lda or SVMlight without labels for each line. Typical data file is as follows:
1:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1

Output is stored in the model directory; in addition to the outputs of LightLDA, a file "model" is created there for easy use.
"model" is a gzipped pickled data that can be loaded in Python as follows:

import gzip
import cPickle as pickle
with gzip.open ("model", "rb") as gf:
    model = pickle.load (gf)
Then "model['alpha']" is a scalar alpha parameter used for training, and "model['beta']" is VxK matrix of beta parameter, and "model['gamma']" is a NxK matrix of Dirichlet posteriors for each of the training documents.


daichi<at>ism.ac.jp
Last modified: Tue Oct 23 12:13:20 2018