[AllenNLP框架下工程化的AI实验01]以语言模型为例

本文链接：https://blog.csdn.net/oksupersonic/article/details/103975646

本文深入讲解AllenNLP的架构和核心概念，包括Instance、Vocabulary、DataReader、DataIterator及Model的详细工作流程，旨在帮助读者理解如何搭建语言模型。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

摘要

搭建语言模型，说明AllenNLP的疑点和潜规则。
默认读者了解NLP基本概念，本文从NLP基本概念到AllenNLP抽象概念，预计阅读15分钟。

核心概念

instance和vocabulary是字典的字典，请在阅读的过程中体会：
Instance第一层映射是属性名到属性，第二层映射是属性到vocabulary
Instance：dict[str:dict[str:tensor]]
vocabulary第一层是字典名到字典，第二层是字符串到整数
Vocabulary：dict[str:dict[str:int]]

DataReader

class DataReader(DatasetReader):
    def __init__(self,toker,tokindexer,targetindexer,lazy=False):
        super().__init__(lazy=lazy)
        self.toker=toker
        self.tokindexer=tokindexer
        self.targetindexer=targetindexer
    @overrides
    def text_to_instance(self,tokens,target)->Instance:
        field={"tokens":TextField(tokens,self.tokindexer)}
        field["target"]=TextField(target,self.targetindexer)
        return Instance(field)
    @overrides
    def _read(self,path:str,debug=True)->Iterable[Instance]:
        eva=True
        if (path.find("valid") ==-1):
           eva=False
        lines=0
        with open(path,'r') as f:
            for line in f:
            	#空行不训练
				if len(tokens) == 0:
                    continue
                line+="<EOS>"
                tokens=self.toker.tokenize(text=line)
                
                lines += 1
                if eva and lines == 10:
                     break
                if debug and lines==1000:
                    break
                #这里至少instance长度为2，否则会导致空
                source=tokens[:-1]
                target=tokens[1:]
                yield self.text_to_instance(source,target)

在这里插入图片描述
DataReader读入数据集，将其转化为Instance集合

Instance

Instance是一个字典，是allennlp中基本数据单位，也是在model中保持联系的最小单位。
在生成任务中，一个instance可以来自于一句话，也可以是一段话，最方便的是直接在raw text里取一行。

cat likes catching mouse

在训练中，我们不会使用raw text，而是将其分为tokens，由token组成field，由field组成Instance。

Field

一个instance由不同的filed组成

tokens filed：cat，likes，catching，
targets filed：likes，catching，mouse

显然，cat，cat，likes，mouse这些单词就是token。
在实作中，使用tokenizer将一句raw text切分为token，从token组合为filed，将不同的filed放到字典里，就是一个Instance。

回到DataReader，除了init方法外，它至少实现两个方法：_read和text_to_instance
_read读取数据集，产生Iterable[Instance]，这里返回列表或者实现生成函数都可以，但是为了lazy训练（每次只使用一个instance），推荐实现生成函数。
text_to_instance负责具体产生instance（不要被名字迷惑，一般这个函数不仅用于text_to_instance，用于各种data到instance的转化）。
注意这里是实现_read，调用时一般使用read，从源码中可以发现，read调用_read，添加控制功能。

Token_indexer&Vocabulary &Namespace

Vocabulary是处理字符串到整数的映射的最高层抽象，在同一个model中可能用到多个字典。例如word level，character level等，因此Vocabulary实际上是字典的字典
{“vocab1”:vocab1,“vocab2”:vocab2}
一个字典所占据的空间称为一个namespace，namespace用于区分字典，并且用于构造Token_indexer，如果两个indexer参数namespace相同，他们就是相同的indexer。
可以从文件中构造Vocabulary，也可以从instance列表中构造。

V.
	-voca1.txt
	-voca2.txt

voca1.txt.
ax
bx
css
dee

其中voca1和voca2每行一词的txt文件
Vocabulary.from_files(V)，会构造两个名为voca1和voca2的字典。

A=datareader.read(xxx)#A是由datareader构造的instance列表

Vocabulary.from_instances(A)，则会根据A构造过程中传入的indexer构造字典。

需要注意的是，

ti={"www":SingleIdTokenIndexer(namespace1),"zzz":SingleIdTokenIndexer(namespace2)}

将indexer作为字典传入，key不是命名空间，构造参数才是，key仅仅标明indexer的名字，这是由于field中支持多个indexer导致的。

DataIterator

DataIterator读取DataReader输出的instance集合，将其转化为batch集合。

松耦合的DataIterator

AllenNLP中每个模块都是松耦合的，这意味着你可以独立考虑每个模块。

对于DataReader来说这个特性可能还不明显，因为DataReader的输入是各种格式（txt，csv等）的DataSet，你至少需要考虑数据格式。

然而，DataIterator的输入是恒久不变的：Iterable[Instance]，这意味着你只需要考虑怎么将这个集合切成一些小的batch，而不用管他们内部到底是什么。

因此，可以让DataIterator变得fancy，例如padding，lazy，shuffle等。
AllenNLP优雅的实现了这些功能，推荐阅读源码。

如果不使用任何功能：最简单的一种写法：

@DataIterator.register('whole_set_iterator')
class WholeSetIterator(DataIterator):
    def __call__(self,
                 instances: Iterable[Instance],
                 num_epochs: int = None,
                 shuffle: bool = True) -> Iterator[TensorDict]:
        Batch=self._create_batches(instances,shuffle)
        for batch in Batch:
            batch.index_instances(self.vocab)
            yield batch.as_tensor_dict()
    def _create_batches(self, instances: Iterable[Instance], shuffle: bool) -> Iterable[Batch]:
        yield Batch(instances)

Batch对齐

    trainer=Trainer(model=KILM,
                    optimizer=optim.Adam(KILM.parameters(),lr=0.01),
                    iterator=it,
                    train_dataset=A,
                    cuda_device=0 if use_gpu else -1,
                    num_epochs=500,
                    shuffle=False
                    )

出现AN训练的对齐，call->forward。
call生成一个batch，交予model.forward。

Model

NLP模型结构相对固定

embedding层，获得词语表示
seqEncoder，例如（LSTM，GRU，Transformer），获得序列表示
下游任务

在下面的toy model，下游任务分为两个，训练和生成。
训练时，在encoder之后使用decoder（我使用Linear），获得embedding对应的词语类别，相当于一个分类任务，然后使用crossentropy获得loss。
生成时，在encoder之后使用多个单层lstm，每次获取下一个单词，直到达到要求长度为止，由于model结构简单，生成效果微乎其微，我会持续改进。


@Model.register("KILM")
class KILM(Model):
    def __init__(self,
                 vocab:Vocabulary=None,
                 emb:TextFieldEmbedder=None,
                 rnn:Seq2SeqEncoder=None,
                 decoder=None,
                 loss_function=None,
                 generator=None):
        super(KILM,self).__init__(vocab)
        self.emb=emb
        self.rnn=rnn
        self.decoder=decoder
        self.loss_function=loss_function
        self.generator=generator or nn.GRU(emb_length,emb_length,batch_first=True)
    def forward(self,
                tokens:Dict[str,torch.Tensor],#[batch,seq]
                target:Dict[str,torch.Tensor]=None,#如果训练 传入target
                length:int=0 #如果预测 传入length
                )->Union[List[str],Dict[str,torch.Tensor]]:
		#[batch,seq]
        mask=get_text_field_mask(tokens)
        #[batch,seq,emb]
        emb_value=self.emb(tokens)
        # [batch,seq,emb]
        out=self.rnn(emb_value,mask)
        #如果是生成
        if length >0:
            batch_word=[]
            #[batch,1,emb]
            prev=out[:,-2:-1]
            #[1,batch,emb]
            #hidden必须要求batch在二维，做了permute又必须contiguous
            hidden=prev.permute(1,0,2).contiguous()
            for i in range(length):

                #[batch,emb]
                word=f.softmax(prev,dim=2)
                #[batch,1]
                word_inx=word.max(dim=2)[1]
                batch_word.append([self.vocab.get_token_from_index(index.item())for index in word_inx])

                #TODO Attention Decoder
                prev,hidden=self.generator(prev,hidden)
            return batch_word
        #如果是训练
        else:
	        logit=self.decoder(out)
	        #target_value=self.emb(target)
	        target=target['tokens']
	        loss=self.loss_function(logit,target,mask)
	
	        return {"logit":out,"loss":loss}

forward有三点不同

参数
mask
返回值

参数

参数是命名关键字字典，由instance获得。
Instance是字典的字典，batch.as_tensor_dict()将batch instance拆包为字典并传入forward。

Mask

我们希望同一个batch的序列长度相同，作为一个二维Tensor([batch,seq_length])传入。iterator一个重要的功能就是将instance按照sequence length排序，将长度相近的instance归于一个batch，用zero tensor补足长度。
在应用时，则需要先获取mask，0表示padding tensor，在计算中将mask作为weight传入，起到忽略padding的效果。

返回值

返回值是一个至少包含“loss” key的字典，将loss计算包含在forward中，可以集成到train中一步完成。

Trainer

Train对训练过程封装。

    trainer=Trainer(model=KILM,
                    optimizer=optim.Adam(KILM.parameters(),lr=0.01),
                    iterator=it,
                    train_dataset=A,
                    cuda_device=0 if use_gpu else -1,
                    num_epochs=500,
                    shuffle=False
                    )

    trainer.train()

Predictor

Predictor利用训练好的模型完成下游任务，不在Train循环内，不需要，也不建议使用AN内置的predictor，可定制化较差。
只要遵循forward的原则，可以很方便的实现自己的predictor。

class LMPredictor(Predictor):
    def __init__(self,
                 model:Model,
                 Source:Iterable[Instance],
                 it:DataIterator,
                 device=-1):
        self.model=model
        self.it=it
        self.source=Source
        self.device=device

    def predict(self,
                length,
                outputpath:str):
        #首先拿到一个batch,这里必须手动设置num epoch，否则不会停止
        batches=self.it(self.source,num_epochs=1)
        batches=tqdm(batches,total=self.it.get_num_batches(self.source))
        with torch.no_grad() and open(outputpath,'w') as f:
            for batch in batches:
                #在batch创建出来的时候需要move
                batch=util.move_to_device(batch,self.device)
                #batch长度的列表，列表：[generating_legnth]个str
                words=self.model.forward(**batch,length=length)
                f.writelines(''.join(sequence)+'\n' for sequence in words)