LDAb.py: Latent Dirichlet Allocation with a background distribution.

Daichi Mochihashi
The Institute of Statistical Mathematics, Tokyo
$Id: index.html,v 1.2 2020/08/03 12:07:07 daichi Exp $

LDAb.py is a Cython implementation of LDA with an automatic estimation of background distribution (i.e. function words) described in [1] (but [1] lacks necessary sampling details).


All these requirements are satisfied by a standard installation of Anaconda 3.



Just type 'make' in this directory. After some warnings, *.so is generated by Cython.

How to start

Type 'ldab.py --help' for a command line usage.
% ldab.py --help
LDAb: LDA with a background distribution.
$Id: ldab.py,v 1.1 2020/08/02 14:03:49 daichi Exp
usage: % ldab.py OPTIONS train.txt model [stopwords]
 -K topics  number of topics in LDA
 -N iters   number of Gibbs iterations (default 1)
 -t freq    threshold of word frequency (default 10)
 -a alpha   Dirichlet hyperparameter on topics (default 50/K)
 -b beta    Dirichlet hyperparameter on words (default 0.01)
 -e eta     Dirichlet hyperparameter on background (default 0.005)
 -h         displays this help
To start with, try an initial run with a sample text contained in this package (cran.txt in English and knbc.txt in Japanese):
% ldab.py -K 10 -N 50 cran.txt model stopwords.en
% ldab.py -K 20 -N 200 -t 5 knbc.txt model stopwords.ja
where "stopwords.en" or "stopwords.ja" are optional.

To see the learned model "model", use "viewtopic.py" as:

% viewtopic.py model
for a complete usage, just type "viewtopic.py".
viewtopic.py lists words based on probability for background distribution, and normalized PMI [2] for ordinary topics. To change this behavior, take a look on the code of viewtopic.py.

Sample text

cran.txt (English)
Cranfield collection for information retrieval. http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
knbc.txt (Japanese)
Kyoto University Blog corpus. http://nlp.ist.i.kyoto-u.ac.jp/kuntt/

Data format

ldab.py is an unsupervised algorithm that requires only a raw text of documents. Each document is separated by blank line(s), and each word is separated by whitespaces. It is not concerned with the underlying text: it does not lowercase each word nor unify numerals. If you wish to adequately handle them, please preprocess the input beforehand.
If you have data where each line represents a document, double-space it by:
% sed G input.txt > output.txt
then use "output.txt" for learning.


"model" is a gzipped pickle file of Python dictionary, which have keys:
scalar of Dirichlet prior over \theta
VxK matrix of \beta parameter of LDA, i.e. {p(v|k)}
scalar of Dirichlet prior over background distribution
NxK matrix of \theta posterior for each document
Vx1 vector of background distribution
Vx1 vector of indicators of hinted background words (1/0)
dictionary to map each word to its id


[1] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. "Modeling general and specific aspects of documents with a probabilistic topic model". NIPS 2005, pp. 241-248.
[2] Gerlof Bouma. "Normalized (pointwise) mutual information in collocation extraction". Proceedings of GSCL (2009): 31-40.

Last modified: Mon Aug 3 21:06:27 2020