YamCha: Yet Another Multipurpose CHunk Annotator

$Id: index.html,v 1.37 2005/12/24 14:18:58 taku Exp $;

Introduction

YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.

YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

Features
News
Download
Installation
Usage
Bibliography
Acknowledgments
Links

Features

Moderately high performance chunker based on Support Vector Machines
Independent from the given task, training/testing with any data which can be seen as a "generic" text chunking task
Use PKE/ PKI, whcih make the classification (chunking) speed faster than the original SVMs. For details, please see here.
Can redefine feature sets (window-size), parsing-direction (forward/backward) and algorithms of multi-class problem (pair wise/one vs rest)
Practical chunking time (1 or 2 sec./sentence. it highly depends on the task)
Can perform partial chunking
C/C++ library

News

2005-09-05: yamcha 0.33 Released
- Fix bugs
- Support bag-of-words feature (experimental)
- Support 64 bit machine (experimental)
2004-11-29: yamcha 0.32
- Fix FATAL bugs in feature_index.cpp
  Multiple feature templates (e.g, F:-2..-1:4..15 F:1..2:4..15 F:0..0:4..) didn't work in the previous releases.
  Mr. Sameer Pradhan gave me a detailed report on this bug.
- Support NULL '\0' string in the parameters. Mr. Kazuma Takaoka. gave me a patch to fix the bug.
2004-10-03: yamcha 0.31 Released
- Fix bugs in mkmodel.
  mkmodel did not work well with Perl 5.8.1 or higer version.
- Perl/Python/Ruby modules are ready (experimental release)
2004-03-28: yamcha 0.30 Released
- Change file formats of binary models.
  The compatibility of binary models is broken. You need to recompile model files as follows:
```
% yamcha-mkmodel foo.txtmodel.gz foo.model
```
  where foo.txtmodel.gz is a text model file generated with the option, MODEL=foo. Text model files will be found in your working directories.
- Support PKE (Polynomial Kernel Extended) which makes chunking (classification) speed significantly faster than the original yamcha.
  To enable PKE, create model files with -e option.
```
% yamcha-mkmodel -e foo.txtmodel.gz foo.model
```
  Note that PKE is an approximation of the original SVMs.
  Please see here for details.
- C API is ready.
2003-08-15: yamcha 0.27 Released
- Update darts library
2003-07-04: yamcha 0.26 Released
- Fix the padding problem (bus error) arising in SPARC processor
2003-04-06: yamcha 0.25 Released
- Fix FATAL error on compiling ONE-VS-REST model file. With this bug, YamCha would output an internal reserved tag (__OTHER__) instead of correct answer tag.
  If you have *.txtmodel.gz files learned with ONE-VS-REST mode, please try the following command to re-create a correct model file
```
% /usr/local/libexec/yamcha foo.txtmodel.gz foo.model
```
2003-03-29: yamcha 0.24 Released
- Fix inefficiencies of mkmodel and mkdarts
2003-01-31: yamcha 0.23 Released
- Fix memory leak bugs
- Use Visual Studio .NET instead of MinGW to build Windows binary
2003-01-06: yamcha 0.22 Released
- Update param.cpp, yet another command line parser
2003-01-01: yamcha 0.21 Released
- Update darts library
2002-11-11: yamcha 0.2 Released
- Modify many old-fashioned codes
- Can select strategies for multi-class problem: pair-wise or one vs rest
- Save memory when a large model file is generated
- Fix bug about column_size parameter
- Use mmap(3) to make the time for initialization faster
- Supports gcc 3.2, Borland C++, and Visual Studio .NET
- Change the default sentence boundary marker from "EOS" to empty.
- Delete Perl/Ruby modules (I will rewrite them using SWIG)
- Delete -f option, use -V option insted.
2001-7-09: yamcha 0.1
- Initial Release!

Download

YamCha is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License.
YamCha is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See GNU Lesser General Public License. the for more details.
Please let me know if you use YamCha for research purpose or find any research publication where YamCha is applied.
Source
- yamcha-0.33.tar.gz: HTTP
Binary package for MS-Windows
- HTTP
  Windows Version does not contain the training programs.
  Model files generated under 'i386 Linux' can be used in MS-Windows version.

Installation

Requirements
- perl 5.00x or higher
- GNU make
- sort, uniq, rm, cat (they are fundamental UNIX tools)
- TinySVM
- C++ compiler (gcc 2.95 or higher)
How to make
```
% ./configure 
% make
% make check
% su
# make install
```
You can change default install path by using --prefix option of configure script.
Try --help option for finding out other options.

Usage

Training and Test file formats

Both the training file and the test file need to be in a particular format for YamCha to work properly. Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. The definition of tokens depends on tasks, however, in most of typical cases, they simply correspond to words. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. To identify the boundary between sentences, just put an empty line (or just put 'EOS').

You can give as many columns as you like, however the number of columns must be fixed through all tokens. Furthermore, there are some kinds of "semantics" among the columns. For example, 1st column is 'word', second column is 'POS tag' third column is 'sub-category of POS' and so on.

The last column represents a true answer tag which is going to be trained by SVMs.

Here's an example of such a file: (data for CoNLL shared task)

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

He        PRP  B-NP
reckons   VBZ  B-VP
..

There are 3 columns for each token.

The word itself (e.g. reckons);
part-of-speech associated with the word (e.g. VBZ);
Chunk(Answer) tag represented in IOB2 format;

The following data is invalid, since the number of columns of second and third are 2. (They have no POS column.) The number of columns should be fixed.

He        PRP  B-NP
reckons   B-VP
the       B-NP
current   JJ   I-NP
account   NN   I-NP
..

Here is an example of English POS-tagging.
There are total 12 columns; 1: word, 2: contains number(Y/N), 3: capitalized(Y/N), 4:contains symbol (Y/N)
5..8 (prefixes from 1 to 4) 9..12 (suffixes from 1 to 4).
If there is no entry in a column, dummy field ("__nil__") is assigned.

Rockwell N Y N R Ro Roc Rock l ll ell well NNP
International N Y N I In Int Inte l al nal onal NNP
Corp. N Y N C Co Cor Corp . p. rp. orp. NNP
's N N N ' 's __nil__ __nil__ s 's __nil__ __nil__ POS
Tulsa N Y N T Tu Tul Tuls a sa lsa ulsa NNP
unit N N N u un uni unit t it nit unit NN
said N N N s sa sai said d id aid said VBD
..

Training and Testing

The first step in using the YamCha is to create training and test files. Here, I take the Base NP Chunking task as a case study.

Assume a data set like this. First column represents a word. Second column represents a POS tag associated with the word. Third column is true answer tag associated with the word (I,O or B). The chunks are represented using IOB2 model. The sentences are presumed to be separated by one blank line.

First of all, run yamcha-config with --libexecdir option. The location of Makefile which is used for training is output. Please copy the Makefile to the local working directory.

% yamcha-config --libexecdir
/usr/local/libexec/yamcha
% cp /usr/local/libexec/yamcha/Makefile .

There are two mandatory parameters for training.

CORPUS: The location of file which is written in the training/test format.
MODEL: Prefix name of model file(s)

Here is an example in which CORPUS is set as 'train.data' and MODEL is set as 'case_study'.

% make CORPUS=train.data MODEL=case_study train
/usr/bin/yamcha  -F'F:-2..2:0.. T:-2..-1' < train.data > case_study.data
perl -w /usr/local/libexec/yamcha/mkparam   case_study < case_study.data
perl -w /usr/local/libexec/yamcha/mksvmdata case_study
.. omit

After training, the following files are generated.

% ls case_study.*
case_study.log           : log of training
case_study.model         : model file (binary, architecture dependent)
case_study.txtmodel.gz   : model file (text, architecture independent)
case_study.se            : support examples
case_study.svmdata       : training data for SVMs

OK, let's parse this test data using above generated model file (case_study.model). You simply use the command:

% yamcha -m case_study.model < test.data 
Rockwell        NNP     B       B
International   NNP     I       I
Corp.   NNP     I       I
's      POS     B       B
Tulsa   NNP     I       I
unit    NN      I       I
said    VBD     O       O
...

The last column is given (estimated) tag. If the 3rd column is true answer tag , you can evaluate the accuracy by simply seeing the difference between the 3rd and 4th columns.

Parameter Tuning

Parsing Direction

DIRECTION is used to change the parsing direction. The default setting is forward parsing mode (LEFT to RIGHT). If "-B" is specified, backward parsing mode (RIGHT to LEFT) is used. Please see my paper for more detail about the parsing direction.

% make CORPUS=train.data MODEL=case_study DIRECTION="-B" train

Re-definition of features (changing window-size)

FEATURE is used to change the feature sets (window-size) for chunking.
The default setting is "F:-2..2:0.. T:-2..-1".

"F:-2..2:0.. T:-2..-1" implies that contexts in the blue box are used as feature sets to identify the tag in the red box.

features

More specifically, the contexts in the blue box can be divided into two parts -- green box (static feature F:) and light-blue box (dynamic feature T:).
F: and T: should be written in the following format:

F:[beginning pos. of token]..[end pos. of token]:[beginning pos. of column]..[end pos. of column]
T:[beginning pos. of tag]..[end pos. of tag]

Static Features F:
In this figure, the tokens at -2, -1, 0, 1, and 2 position are used as features. (green box).
It means that [beginning positing of token] is -2 and [end position of token] is +2.
In addition, this figure shows that 0-th and 1-st columns in these tokens are taken as features.
It means that [beginning position of column] is 0 and [the end position of column] is 1.
You can omit the [end position of column]. If omitted, the last column is set as [end position of column].
Note that column for answer tag is not regarded as [end position of column].
By taking tokens as well as columns, final expression of static feature becomes "F:-2..2:0..1".
In this case, you can use "F:-2..2:0.." which means same as "F:-2..2:0..1".

Dynamic Features T:
Dynamic features are decided dynamically during the tagging of chunk labels.
In this figure, the tags at -2 and -1 position are used as features. (light-blue box)
It means that [beginning positing of tag] is -2 and [end position of tag] is -1.
Note that [end potion of tag] must smaller than -1, since the right-side tags (0,+1,+2,+3...)
have not been identified yet and cannot be used as features.

You can use the expression F: and T: repeatably. All duplicate entries are deleted.

Here are more complicated examples.

F:-3..3:0.. T:-3..-1	F:-2..2:1..1 F:0..0:0..1 T:-1..-1
F:-3..-2:0.. F:0..0:0.. F:2..3:0.. T:-3..-2	F:-3..-2:1..1 F:-1..0:0..0 F:2..3:1..1 T:-3..-1

Here is an example of setting "F:-3..3:0.. T:-3..-1" to the FEATURE parameter.

% make CORPUS=train.data MODEL=case_study FEATURE="F:-3..3:0.. T:-3..-1" train

The expression "-2..2" can be also expressed as "-2,-1,0,-1,2". In addition, if the beginning position and end position are same, you can omit the end position. Here are some alternative expressions:

"F:-2..2:0..0" -> "F:-2,-1,0,1,2:0"
"F:0..0:0..1" -> "F:0:0,1"

Note that the expression of "-2,0,2" is different from "-2..2".
".." represents a range between beginning and end position.

Call-back function to rewrite features in detail (require C++ knowledge)

You can define some call-back function which re-writes or adds task-dependent specific features. For more detail, see example/example.cpp.

Multi-class methods

MULTI_CLASS is used to change the strategy for the multi-class problem. The default setting is pair wise method. If "2" is specified, 'one vs rest' is used.

% make CORPUS=train.data MULTI_CLASS=2 MODEL=case_study

Training conditions of SVMs

SVM_PARAM is used to change the training condition of SVMs. Default setting is "-t1 -d2 -c1", which means the 2nd degree of polynomial kernel and 1 slack variable are used. Note that YamCha only supports polynomial kernels.

Here is an example of using the 3rd degree of polynomial kernel:

% make CORPUS=train.data MODEL=case_study SVM_PARAM="-t1 -d3 -c1" train

Please use -m SIZE option to increase the memory for training if possible. This option drastically reduce the computational cost and time.
Here is an example of assigning 512 Mb memory to the SVMs:

% make CORPUS=train.data MODEL=case_study SVM_PARAM="-t1 -d2 -c1 -m 512" train

Output format

The -V option sets verbose mode, where yamcha outputs tag and scores of all candidates.
The meaning of score varies with multi-class methods.

one vs rest: distance from the separating hyperplane
pair wise: summation of distances of this class

# without -V
% yamcha  -m case_study.model < test.data
Rockwell        NNP     B       B
International   NNP     I       I
Corp.   NNP     I       I
's      POS     B       B
Tulsa   NNP     I       I
unit    NN      I       I
said    VBD     O       O
..

# with -V
% yamcha -V -m case_study.model
Rockwell        NNP     B       B       B/0.630616      I/-0.974367     O/-0.721942
International   NNP     I       I       B/-0.789851     I/0.561522      O/-0.833703
Corp.   NNP     I       I       B/-0.934675     I/0.486497      O/-0.584086
's      POS     B       B       B/0.418284      I/-0.760627     O/-0.794485
Tulsa   NNP     I       I       B/-0.987653     I/1.06272       O/-1.16405
unit    NN      I       I       B/-0.783824     I/0.845213      O/-1.04919
said    VBD     O       O       B/-1.29512      I/-1.02006      O/0.956885
...

Sentence boundary marker

The -e option sets the sentence boundary marker. Default setting is empty ("").
Here is an example of changing the sentence boundary marker to "EOS"

% yamcha -e EOS -m case_study.model < test.data

Partial Chunking

If you know in advance the candidates of answer tags by using some 'prior' knowledge, you may want to select answer only from these candidates. Here is a concrete example. If the 1st token must be B tag and the 2nd token must be selected only from B and I, you give yamcha the following test data:

Rockwell        NNP     B
International   NNP     B      I

Generally speaking, in the partial chunking mode, candidates are listed instead of last column.
In the partial parsing mode, yamcha must be run with -C option.

% yamcha -C -m case_study.model < test.data

Note that the interpretation of test data varies according to the -C option.

With -C option: the last (or more) columns are interpreted as candidates.
Without -C option: the last (or more) columns are ignored.

Enable Fast Chunking

Classification costs of SVMs are much larger than those of other algorithms, such as maximum entropy or decision lists. To realize FAST chunking, two algorithms, PKI and PKE, are applied in YamCha. PKI and PKE are about 3-12 and 10-300 holds faster than the original SVMs respectively. By default, PKI is used. To enable PKE, please recompile model files with -e option:

% yamcha-mkmodel -e foo.txtmodel.gz foo.model
% yamcha -m foo.model < ...

If -e is not given, PKI is employed.

PKI and PKE have the following properties:

PKI is not an approximation of SVMs. It performs the same results as the original SVMs.
PKI uses less disk space compared to PKE.
PKE is much faster than PKI.
As PKE is an approximation of SVMs, different results will be obtained. The approximation rates can be controlled by the following two parameters.
- -n NUM (minimum support): Use features which occur no less than NUM times in support vectors. Default value is 2. Smaller value gives a better approximation.
- -s SIGMA (weight threshold): Use features whose weights are between -SIGMA and SIGMA. Default value is 0.005. Smaller value gives a better approximation.

Here is an example where NUM and SIGMA are set to be 1 and 0.0001 respectively.

% yamcha-mkmodel -e -n 1 -s 0.0001 foo.txtmodel.gz foo.model

Please see our paper for details.

Other options

See here.

Bibliography

YamCha itself:

Taku Kudo, Yuji Matsumoto (2003)
Fast Methods for Kernel-Based Text Analysis, ACL 2003 [PDF]
Taku Kudo, Yuji Matsumoto (2001)
Chunking with Support Vector Machines, NAACL 2001 [ PDF]
Taku Kudo, Yuji Matsumoto (2000)
Use of Support Vector Learning for Chunk Identification, CoNLL-2000 [ PS]

Publications where YamCha is applied:

Hiroyasu Yamada, Taku Kudo, Yuji Matsumoto (2002)
Japanese Named Entity Extraction using Support Vector Machine'', Transactions of IPSJ, Vol. 43, No. 1, pages 44-53, 2002. (in Japanese)
Tatsumi Yoshida and Kiyonori Ohtake and Kazuhide Yamamoto (2002)
Comparative Experiments of Chinese Analyzers between Support Vector Machines and Minimum Connective Costs Method, IPSJ SIG NL-150 (in Japanese) [ PDF]
Koichi Takeuchi and Nigel Collier (2002)
Use of support vector machines in extended named entity, CoNLL-2002
Kadri Hacioglu and Wayne Ward (2003)
Target Word Detection and Semantic Role Chunking using Support Vector Machines, HLT-NAACL 2003 Short Parpers
Masayuki Asahara and Yuji Matsumoto (2003)
Filler and Disfluency Identification Based on Morphological Analysis and Chunking ISCA and IEEE Workshop on Spontaneous Speech Processing and Recognition 2003 [PDF]
Masayuki Asahara and Yuji Matsumoto (2003)
Japanese Named Entity Extraction with Redundant Morphological Analysis, HLT-NAACL 2003 [PDF]
Goh Chooi Ling and Masayuki Asahara and Yuji Matsumoto (2003)
Chinese Unknown Word Identification Using Position Tagging and Chunking ACL 2003 Interractive Posters/Demo [PDF]
Pradhan, S., Hacioglu, K., Ward, W., Martin, J., Jurafsky, D., (2003)
Semantic Role Parsing: Adding Semantic Structure to Unstructured Text, ICDM 2003 [PDF]
Pradhan, S., Sun, H., Ward, W., Martin, J., Jurafsky, D., (2003)
Parsing Arguments of Nominalizations in English and Chinese, HLT-NAACL 2004 [PDF]
Pradhan, S., Ward, W., Hacioglu, K., Martin, J., Jurafsky, D., (2004)
Shallow Semantic Parsing using Support Vector Machines HLT-NAACL 2004 [PDF]

Acknowledgments

I would like to appreciate all the people that were involved in the development of this software: the members in Computational Linguistics Laboratory at NAIST, and also to particular individuals:

Kiyonori OHTAKE who gives me a number of patches to fix bugs.
Kaoru Yamamoto who reviews this manual.
Yuji MATSUMOTO who is my supervisor.

YamCha: Yet Another Multipurpose CHunk Annotator

Introduction

Table of contents