GENIA Tagger软件包

17 篇文章 1 订阅
11 篇文章 25 订阅

GENIA Tagger

  • part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text -

What’s New

20 Oct. 2006

A demo page is available.

6 Oct. 2006

Version 3.0: The tagger now performs named entity recognition.

Overview

The GENIA tagger analyzes English sentences and outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts. If you need to extract information from biomedical documents, this tagger might be a useful preprocessing tool. You can try the tagger on a demo page.

How to use the tagger

You need gcc to build the tagger.

1. Download the latest version of the tagger

Feburary 9 2016 geniatagger-3.0.2.tar.gz (source package for Unix)

2. Expand the archive

tar xvzf geniatagger.tar.gz

3. Make

cd geniatagger/
make

4. Tag sentences

Prepare a text file containing one sentence per line, then

./geniatagger < RAWTEXT > TAGGEDTEXT

The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.

word1   base1   POStag1 chunktag1 NEtag1
word2   base2   POStag2 chunktag2 NEtag2
  :        :         :       :        :

Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).

Example

echo “Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin.” | ./geniatagger

Inhibition      Inhibition      NN      B-NP     O
of              of              IN      B-PP     O
NF-kappaB       NF-kappaB       NN      B-NP     B-protein
activation      activation      NN      I-NP     O
reversed        reverse         VBD     B-VP     O
the             the             DT      B-NP     O
anti-apoptotic  anti-apoptotic  JJ      I-NP     O
effect          effect          NN      I-NP     O
of              of              IN      B-PP     O
isochamaejasmin isochamaejasmin NN      B-NP     O
.               .               .       O        O

You can easily extract four noun phrases (“Inhibition”, “NF-kappaB activation”, “the anti-apoptotic effect”, and “isochamaejasmin”) from this output by looking at the chunk tags. You can also find a protein name with the named entity tags.

Part-of-Speech Tagging Performance

General-purpose part-of-speech taggers do not usually perform well on biomedical text because lexical characteristics of biomedical documents are considerably different from those of newspaper articles, which are often used as the training data for a general-purpose tagger. The GENIA tagger is trained not only on the Wall Street Journal corpus but also on the GENIA corpus and the PennBioIE corpus [1], so the tagger works well on various types of biomedical documents. The table below shows the tagging accuracies of a tagger trained with different sets of documents. For details of the performance, see [2] (the latest version uses a different tagging algorithm [3] and gives slightly better performance than reported in the paper).

GENIA tagger 98.26%

toolWall Street JournalGENIA corpus
A tagger trained on the WSJ corpus97.05%85.19%
A tagger trained on the GENIA corpus78.57%98.49%
GENIA tagger96.94%98.26%

Chunking Performance

(to be evaluated)

Named Entity Recognition Performance

The named entity tagger is trained on the NLPBA data set. The featuers and parameters were tuned using the training data. The final performance on the evaluation set is as follows.

Col1Col2Col3
field1field2field3
Entity TypeRecallPrecisionF-score
Protein81.4165.8272.79
DNA66.7665.6466.20
RNA68.6460.4564.29
Cell Line59.6056.1257.81
Cell Type70.5478.5174.31
Overall75.7867.4571.37

References

[1] S. Kulick, A. Bies, M. Liberman, M. Mandel, R. McDonald, M. Palmer, A. Schein and L. Ungar. Integrated Annotation for Biomedical Information Extraction, HLT/NAACL 2004 Workshop: Biolink 2004, pp. 61-68.
[2] Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi Tsujii, Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005 (pdf)
[3] Yoshimasa Tsuruoka and Jun’ichi Tsujii, Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data, Proceedings of HLT/EMNLP 2005, pp. 467-474. (pdf)

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值