YamCha: Yet Another Multipurpose CHunk Annotator

$Id: index.html,v 1.37 2005/12/24 14:18:58 taku Exp $;

Introduction

YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.

YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

Table of contents

Features

News

Download

Installation

Usage

Training and Test file formats

Both the training file and the test file need to be in a particular format for YamCha to work properly. Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. The definition of tokens depends on tasks, however, in most of typical cases, they simply correspond to words. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. To identify the boundary between sentences, just put an empty line (or just put 'EOS').

You can give as many columns as you like, however the number of columns must be fixed through all tokens. Furthermore, there are some kinds of "semantics" among the columns. For example, 1st column is 'word', second column is 'POS tag' third column is 'sub-category of POS' and so on.

The last column represents a true answer tag which is going to be trained by SVMs.

Here's an example of such a file: (data for CoNLL shared task)

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

He        PRP  B-NP
reckons   VBZ  B-VP
..

There are 3 columns for each token.



The following data is invalid, since the number of columns of second and third are 2. (They have no POS column.) The number of columns should be fixed.

He        PRP  B-NP
reckons   B-VP
the       B-NP
current   JJ   I-NP
account   NN   I-NP
..

Here is an example of English POS-tagging.
There are total 12 columns; 1: word, 2: contains number(Y/N), 3: capitalized(Y/N), 4:contains symbol (Y/N)
5..8 (prefixes from 1 to 4) 9..12 (suffixes from 1 to 4).
If there is no entry in a column, dummy field ("__nil__") is assigned.

Rockwell N Y N R Ro Roc Rock l ll ell well NNP
International N Y N I In Int Inte l al nal onal NNP
Corp. N Y N C Co Cor Corp . p. rp. orp. NNP
's N N N ' 's __nil__ __nil__ s 's __nil__ __nil__ POS
Tulsa N Y N T Tu Tul Tuls a sa lsa ulsa NNP
unit N N N u un uni unit t it nit unit NN
said N N N s sa sai said d id aid said VBD
..

Training and Testing

The first step in using the YamCha is to create training and test files. Here, I take the Base NP Chunking task as a case study.

Assume a data set like this. First column represents a word. Second column represents a POS tag associated with the word. Third column is true answer tag associated with the word (I,O or B). The chunks are represented using IOB2 model. The sentences are presumed to be separated by one blank line.

First of all, run yamcha-config with --libexecdir option. The location of Makefile which is used for training is output. Please copy the Makefile to the local working directory.

% yamcha-config --libexecdir
/usr/local/libexec/yamcha
% cp /usr/local/libexec/yamcha/Makefile .

There are two mandatory parameters for training.

Here is an example in which CORPUS is set as 'train.data' and MODEL is set as 'case_study'.

% make CORPUS=train.data MODEL=case_study train
/usr/bin/yamcha  -F'F:-2..2:0.. T:-2..-1' < train.data > case_study.data
perl -w /usr/local/libexec/yamcha/mkparam   case_study < case_study.data
perl -w /usr/local/libexec/yamcha/mksvmdata case_study
.. omit

After training, the following files are generated.

% ls case_study.*
case_study.log           : log of training
case_study.model         : model file (binary, architecture dependent)
case_study.txtmodel.gz   : model file (text, architecture independent)
case_study.se            : support examples
case_study.svmdata       : training data for SVMs

OK, let's parse this test data using above generated model file (case_study.model). You simply use the command:

% yamcha -m case_study.model < test.data 
Rockwell        NNP     B       B
International   NNP     I       I
Corp.   NNP     I       I
's      POS     B       B
Tulsa   NNP     I       I
unit    NN      I       I
said    VBD     O       O
...

The last column is given (estimated) tag. If the 3rd column is true answer tag , you can evaluate the accuracy by simply seeing the difference between the 3rd and 4th columns.

Parameter Tuning

Enable Fast Chunking

Classification costs of SVMs are much larger than those of other algorithms, such as maximum entropy or decision lists. To realize FAST chunking, two algorithms, PKI and PKE, are applied in YamCha. PKI and PKE are about 3-12 and 10-300 holds faster than the original SVMs respectively. By default, PKI is used. To enable PKE, please recompile model files with -e option:

% yamcha-mkmodel -e foo.txtmodel.gz foo.model
% yamcha -m foo.model < ...

If -e is not given, PKI is employed.

PKI and PKE have the following properties:

Here is an example where NUM and SIGMA are set to be 1 and 0.0001 respectively.

% yamcha-mkmodel -e -n 1 -s 0.0001 foo.txtmodel.gz foo.model

Please see our paper for details.

Other options

See here.

Bibliography

Acknowledgments

I would like to appreciate all the people that were involved in the development of this software: the members in Computational Linguistics Laboratory at NAIST, and also to particular individuals:

Links


$Id: index.html,v 1.23 2003/01/06 13:11:21 taku-ku Exp $;

taku@chasen.org