GENIA语料库:http://www.nactem.ac.uk/genia/genia-corpus
GENIA corpus
The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology. GENIA语料库是为GENIA项目编写并标注的最初的生物医学文献集合。这个语料库是为了发展和评估分子生物学信息检索及文本挖掘系统而创建的。 The corpus contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated with various levels of linguistic and semantic information. PubMed 是一个免费的搜寻引擎,提供生物医学方面的论文搜寻以及摘要。它的数据库来源为MEDLINE。其核心主题为医学,但亦包括其他与医学相关的领域,像是护理学或者其他健康学科。它同时也提供对于相关生物医学资讯上相当全面的支援,像是生化学与细胞生物学。该搜寻引擎是由美国国立医学图书馆提供,作为 Entrez 资讯检索系统的一部分。PubMed 的资讯并不包括期刊论文的全文,但可能提供指向全文提供者(付费或免费)的连结。 这个语料库包含1999条Medline的摘要,这些摘要是由PubMed按照human、blood cells以及transcription factors三个医学主题词(medical subject heading terms )为搜索条件搜索到的。这个语料库已经被按照不同级别的语言信息、语义信息进行标注。 The primary categories of annotation in the GENIA corpus and the corresponding subcorpora are 最初始的GENIA语料库标注类别以及对应的资料如下: |
词性标注: http://www.nactem.ac.uk/genia/genia-corpus/pos-annotation
Overview
综述
Part-of-speech (POS) tagging is an initial step of natural language processing which is often performed right after or together with tokenization. After tokenization, every token is assigned a POS label. The GENIA POS annotation generally follows the Penn Treebank POS tagging scheme. The following modifications of this scheme were introduced for the GENIA part-of-speech annotation:
POS标注是自然语言处理的初始步骤,通常在分词之后或与分词同时进行。分词之后,每个词都被分配一个POS标签。GENIA POS标注大体上遵循滨州树库POS标签体系。为了使这个体系适用于GENIA,做了以下修改。,
- The NNP and NNPS (proper name) tag is used only for the names of journals, authors, research institutes, and initials of patients. Especially, (discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern blotting) are not tagged with NNP tags.
- NNP和NNPS(专有名词)标签仅用于期刊、作者、研究机构以及患者(?)首写字母。特别需要注意的是,专业术语中的名字不会被标记上NNP标签。
- We tried to eliminate SYM tags as much as possible.
- 我们尽可能的淘汰掉了SYM标签。
See the annotation guideline for the detail. The abstracts are first tagged by the JunK tagger and then corrected by human annotators.
可以从标注指南中看出更多细节。这些摘要先由JunK标记,然后由标注人员进行更正。
Examples
Corpus format
语料库格式
The corpus is available in two formats, both included in the package available for download below.
这个语料库可以有以下两种格式,都包括在下边供下载的包中。
- PTB-like format: The file contains one token/POS pair per line, and a "==========" line (ten equal signs) is put between sentences.
- PTB-like格式:这个文件中每一行都有一对token/POS,以及每两句中间都有一个“==========”(10个等号)
- "Merged" gpml format: The POS information is merged into GENIA corpus ver 3.02 using <w> tag which surrounds the token, where the POS is represented as the value of "c" attribute.
- “Merged” gpml 格式:POS信息被合并到GENIA语料库3.02版(用<w>标签将分词括起来),POS被表示为C属性。
In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.
在合并格式,并非PTB-like格式中,当一个分词被由原始GENIA语料库标注器给出的<term>标签分割,它的POS就是“*”。这种情况下,一个分词的最后一段被POS标注器分配一个初始POS标签,而其他片段被标注为”*”。例如:<w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.
Documentation
文献
Annotation guidelines
标注准则
- Tateisi, Yuka and Jun'ichi Tsujii. GENIA Annotation Guidelines for Tokenization and POS tagging. Technical Report (TR-NLP-UT-2006-4). Tsujii Laboratory, University of Tokyo, 2006.
Publications
出版物
- Tateisi, Yuka and Jun'ichi Tsujii. Part-of-Speech Annotation of Biology Research Abstracts. In the Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. Lisbon, Portugal, pp. 1267-1270, May 2004.
Download
下载
- GENIA corpus version 3.02 POS annotation: GENIAcorpus3.02p.tgz (4.6M)
Acknowledgments
Yuka Tateisi: GENIA part-of-speech corpus annotation coordinator
See also GENIA Project acknowledgments page