NTT Communication Science Laboratories

$Id: lwlm.html,v 1.1 2010/03/19 10:15:06 daichi Exp $

lwlm is an exact, full Bayesian implementation of the Latent Words Language
Model (Deschacht and Moens, 2009). It automatically learns synonymous words
to infer context-dependent "latent word" for each word appearance,
in a completely unsupervised fashion.

Technically, LWLM is a higher-order hidden Markov model over words where the
latent dimension equals the dimension of lexicon, or a "class-based" n-gram
model where the number of classes equals the number of word types.

This high dimensionality and higher-order model structure preclude ordinary
EM algorithms, and necessitates a Gibbs sampler for learning.

In this toolkit, I further extended it as wholly Bayesian, and newly introduced
"sliced" Gibbs sampler to cope with possible difficulty of the high
dimensionality of latent words that must be considered.

Additionally, lwlm includes a complete implementation of the hierarchical Pitman-Yor language model (HPYLM) (Teh, 2006) as its submodel.

- GNU C++ compiler. (developed with g++ 4.1.1)
- Note the implementation uses <ext/hash_map>, which seems deleted in the latest g++.

- Google sparsehash must be installed beforehand. (built with sparsehash-1.6)

**Source:**lwlm-0.1.tar.gz**Precomputed Model:**- New York Times model:
- model-nyt.n3.tar.gz
(
*recommended*) [4.9MB] - This is a model learned from random 100K sentences from the New York Times, 2006 January (about 2.1 million words). Penn tokenization. For details, see the enclosed README.

- model-nyt.n3.tar.gz
(
- Japanese "Mainichi" model:
- model-mai2007.n3.tar.gz
(
*recommended*) [4.7MB] - This is a model built from the 2007 texts of the standard Japanese newspaper of Mainichi Newspaper, which contains 2.4 million words, 100K sentences. For details, see the enclosed README.

- model-mai2007.n3.tar.gz
(
- FSNLP "austen" model:
- model-austen.tar.gz [3.2MB]
- This is a model built from the text of Jane Austen's novels
available from
FSNLP (Foundations of Statistical Natural Language
Processing, MIT Press) site for
language modeling chapter.
This includes both the original text and learned model
after 400 iterations of the Gibbs sampler.

Note this model was built on a very small text.

- Japanese "Kyoto" model:
- model-kyoto.tar.gz [2.2MB]
- This is a model built from standard Japanese corpus of the
"Kyoto corpus", which contains 1 million words,
38400 sentences.

For copyright issues, only the learned model is included in this file (original text and learned latent words are excluded). Note this model is still small, but suffices to demonstrate the LWLM ability for natural language processing.

- New York Times model:

- Install Google-sparsehash if necessary.
- Take a glance at Makefile, and edit it accordingly.
- Type "make all". This builds both "lwlm" (the learner) and "lwlm-decoder" (the decoder).
- If any problem occurs, please inform it to the author.

% ./lwlm -n 3 -N 10 persuation.txt model.austen latent ngram order = 3, alpha = 0.01 reading persuation.txt.. done. vocabulary = 5756 parsing 1005/ 1005 ..done. [10] : sampling 84149/ 84148.. ETA: 0:00:00 (6678 words/sec) PPL = 88.1006 done. saving model.. done.Or, just download the precomputed model from above and unpack it:

% tar xvfz model-austen.tar.gzTo decode latent words in a text, invoke lwlm-decode as (in zsh):

% ./lwlm-decode -K 5 -o out =(tail -10 persuation.txt) model.austenThis yields a file "out" which contains decoded latent word distributions.

For complete command-line options, try "lwlm -h" and "lwlm-decode -h" or see the texts below.

`lwlm`

works on any vanilla text files. Each line in the text represents a sentence, which consists of words separated with (possibly multiple) white spaces. Empty lines are ignored.

Each model directory has a following file structure:

+ model --+-- lexicon Words beginning from internal id = 0. +-- latent Sampled latent words corresponding to the training text. +-- table Learned table of hidden -> observed translations. +-- param Model parameters (order and alpha) +-- log Log file during computation, esp. PPL (see below) +-- lm HPYLM +-- nodes Nodes and CRP restaurants +-- tree Dependencies between nodes +-- param HPYLM parameters (order, d, theta, lexicon)

% ./lwlm -h lwlm, The Latent Words Languge Model. $Id: lwlm.html,v 1.1 2010/03/19 10:15:06 daichi Exp $ Copyright (C) 2010 Daichi Mochihashi, All rights reserved. usage: lwlm OPTIONS train model OPTIONS -n n-gram order (default 3, trigram) -a Dirichlet hyperparameter of translation table (default 0.01) -N number of Gibbs iterations -h help

- -n [n]
- n-gram order of latent words (default 3). Generally, Larger n requires larger data for plausible inference.
- -a [a]
- Dirichlet pseudo count for word translation matrix. Smaller counts will induce peaky latent word distributions, while larger counts will explore more possibilities for them.
- -N [N]
- Number of Gibbs iterations. Generally, several hundreds of iterations will suffice for approximate convergence. For assessing convergence, see the section below.

% ./lwlm-decode -h lwlm-decoder, The Latent Words Languge Model decoder. $Id: lwlm.html,v 1.1 2010/03/19 10:15:06 daichi Exp $ Copyright (C) 2010 Daichi Mochihashi, All rights reserved. usage: lwlm-decode OPTIONS test model OPTIONS: -B [b] burn-in iterations (default 100) -N [n] number of iterations to collect samples (default 100) -I [i] interval to collect posterior samples (default 1) -K [K] number of possible candidates (default 5) -e [eps] threshold for candidate probability (default 0) -f [fmt] printf format of candidates (default "%s (%.03f) ") -o [file] filename for output -h this help

`lwlm-decode`

finds latent words for the input text by running Gibbs iterations
to collect posterior samples.Possible parameters are shown above: if "-o" option is not specified, decoded results are printed into stdout.

**Note:** in order to change the output format, e.g. to create a data file
for another processing task, use "-f format" options.
This format takes two arguments,
a latent word (char *) and its probability (double), thus '-f "%s:%.04f "'
yields an output like "foo bar:0.6704 qux:0.2000 quux:0.1250".

However, naive thresholding would still incur computing probabilities over all words, thus lwlm maintains a special data structure (record *) for each n-gram node that enables to prune the candidates efficiently.

- "Using the Latent Words Language Model for Semi-Supervised Semantic Role Labeling". Koen Deschacht and Marie-Francine Moens. EMNLP 2009. [PDF]
- "The Latent Words Language Model". Koen Deschacht and Marie-Francine Moens. In Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning (Benelearn 09), 2009. [PDF]
- "Slice sampling". Radford M. Neal. Annals of Statistics, 31(3), pp.705-767, 2003.
- "A Bayesian Interpretation of Interpolated Kneser-Ney". Yee Whye Teh. Technical Report TRA2/06, School of Computing, NUS, 2006.

daichi<at>cslab.kecl.ntt.co.jp Last modified: Thu Mar 25 15:44:58 2010