Allennlp入门2021-03-19

According to your own research project, you only need to implement DatasetReader and Model, and then run your various experiments with config files. Basically, we need to understand three features below to start our project with AllenNLP

  1. Define Your DatasetReader
  2. Define Your Model
  3. Setup Your Config Files

 

来自 <https://towardsdatascience.com/allennlp-startup-guide-24ffd773cd5b>

 

The DatasetReader takes raw dataset as input and applies the preprocessing like lowercasing, tokenization and so on. Finally, it outputs the list of the Instance object which holds preprocessed each data as attributes. In this post, the Instance object has the document and the label information as attributes.

 

First, we should inherit(继承) the DatasetReader class to make our own. Then we need to implement the three methods: __init__ ,_read andtext_to_instance. So let’s look at the way how to implement our own DatasetReader. I’ll skip the implementation of the read method because it doesn’t relate to the usage of AllenNLP so much. But if you’re interested in it, you can refer to this link though.

 

The implementation __init__ will be as follows. We can control the arguments of this method via config files.

@DatasetReader.register('imdb')
ImdbDatasetReader(DatasetReader):
    def __init__(self, token_indexers, tokenizer):
        self._tokenizer = tokenizer
        self._token_indexers = token_indexers

In this post, I set token_indexers and tokenizer as the arguments because I assume that we change the way of indexing or tokenization in the experiments. The token_indexers performs indexing and the tokenizer performs tokenization. The class I implemented has the decorator (DatasetReader.register('imdb')) which enables us to control it by config files.

The implementation text_to_instance will be as follows. This method is the main process of DatasetReader. The text_to_instance takes each raw data as input, applies some preprocessing and output each raw data as a Instance. In IMDB, it takes the review string and the polarity label as input. (评论字符串和极性标签作为输入)

def text_to_instance(self, string:str, label:int) -> Instance:
    fields:Dict[str,Field] = {}
    tokens = self._tokenizer.tokenize(string)
    fields["tokens"] = TextField(tokens, self._token_indexers)
    fields["label"] = LabelField(label, skip_indexing=True)
    return Instance(fields)

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值