Allennlp入门2021-03-19

最新推荐文章于 2021-08-12 09:51:26 发布

KAila_Lucky

最新推荐文章于 2021-08-12 09:51:26 发布

阅读量244

点赞数

分类专栏： nlp 文章标签：自然语言处理 allennlp

本文链接：https://blog.csdn.net/qq_39894692/article/details/115015596

版权

nlp 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

According to your own research project, you only need to implement DatasetReader and Model, and then run your various experiments with config files. Basically, we need to understand three features below to start our project with AllenNLP

Define Your DatasetReader
Define Your Model
Setup Your Config Files

来自 <https://towardsdatascience.com/allennlp-startup-guide-24ffd773cd5b>

The DatasetReader takes raw dataset as input and applies the preprocessing like lowercasing, tokenization and so on. Finally, it outputs the list of the Instance object which holds preprocessed each data as attributes. In this post, the Instance object has the document and the label information as attributes.

First, we should inherit(继承) the DatasetReader class to make our own. Then we need to implement the three methods: __init__ ,_read andtext_to_instance. So let’s look at the way how to implement our own DatasetReader. I’ll skip the implementation of the read method because it doesn’t relate to the usage of AllenNLP so much. But if you’re interested in it, you can refer to this link though.

The implementation __init__ will be as follows. We can control the arguments of this method via config files.

@DatasetReader.register('imdb')
ImdbDatasetReader(DatasetReader):
    def __init__(self, token_indexers, tokenizer):
        self._tokenizer = tokenizer
        self._token_indexers = token_indexers

In this post, I set token_indexers and tokenizer as the arguments because I assume that we change the way of indexing or tokenization in the experiments. The token_indexers performs indexing and the tokenizer performs tokenization. The class I implemented has the decorator (DatasetReader.register('imdb')) which enables us to control it by config files.

The implementation text_to_instance will be as follows. This method is the main process of DatasetReader. The text_to_instance takes each raw data as input, applies some preprocessing and output each raw data as a Instance. In IMDB, it takes the review string and the polarity label as input. (评论字符串和极性标签作为输入)

def text_to_instance(self, string:str, label:int) -> Instance:
    fields:Dict[str,Field] = {}
    tokens = self._tokenizer.tokenize(string)
    fields["tokens"] = TextField(tokens, self._token_indexers)
    fields["label"] = LabelField(label, skip_indexing=True)
    return Instance(fields)

KAila_Lucky

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Allennlp入门2021-03-19

According to your own research project, you only need to implement DatasetReader and Model, and then run your various experiments with config files. Basically, we need to understand three features below to start our project with AllenNLPDefine Your Da...
复制链接

扫一扫

专栏目录