## lda.py: LDA in Python.

Daichi Mochihashi

The Institute of Statistical Mathematics, Tokyo

$Id: index.html,v 1.3 2018/12/09 16:14:16 daichi Exp $
lda.py is a Python/Cython implementation of a standard Gibbs sampling
for the latent Dirichlet allocation (Blei+, 2003).
This is a package basically for learning and extension; however,
since it is written in Cython, it runs much faster than a pure Python
implementation and thus amenable to medium sized data.

This package is based on a part of the codes by Ryan P. Adams (Princeton).

### Download

### Requirements

- Python (developed with Python 3.9.13)
- Cython (developed with Cython 0.29.32)
- Numpy & Scipy

### Install

First, build a Cython module "ldac.so" with Cython by just type
% make

to execute "./setup.py build_ext --inplace".
That's all!
### Usage

% ./lda.py -h
usage: % lda.py OPTIONS train model
OPTIONS
-K topics number of topics in LDA
-N iters number of Gibbs iterations (default 1)
-a alpha Dirichlet hyperparameter on topics (default auto)
-b beta Dirichlet hyperparameter on words (default auto)
-h displays this help
$Id: lda.py,v 1.14 2018/11/17 03:01:53 daichi Exp

"train" is a data file whose format is the same as
lda or SVMlight (see below),
and "model" is a name of model file to be generated.
### Getting started

You can start using lda.py as a following example with the
train datafile enclosed in this package:
% ./lda.py -K 10 -N 100 train model
LDA: K = 10, iters = 100, alpha = 5, beta = 0.01
loading data.. documents = 100, lexicon = 1325, nwords = 16054
initializing..
Gibbs iteration [100/100] PPL = 712.9421
saving model..
done.

This generates a model file "model" in a gzipped pickle format.
It can be loaded as follows:
import gzip
import cPickle as pickle
with gzip.open ("model", "rb") as gf:
model = pickle.load (gf)

Then "model['alpha']" is a scalar alpha parameter
used for training, and "model['beta']" is VxK matrix of beta parameter,
and "model['theta']" is a NxK matrix of estimated topic proportion of
each of the training documents.

In addition to the model file "model", it also yields a log file "model".log
in the same directory.
### Data formats

Training data format is almost the same as
lda or SVMlight
without labels for each line.
Typical data file is as follows:
0:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1

- Each line consists of pairs of <feature_id>:<count>.
Here,
*feature_id* is an integer from 0;
*count* is a positive integer.
- There can be unused feature id; however, probabilities are normalized
within the maximum feature id used.
- <feature_id>:<count> pairs are separated by (possibly
multiple) white spaces.
The program is coded to work even if there are any empty lines, but
it is preferable that there are no such unnecessary lines.
- For a complete specification, please refer to SVMlight's page.

daichi<at>ism.ac.jp
Last modified: Tue Jun 20 20:14:21 2023