bert结构模型的转换及[unusedxx]的不拆token

文章介绍了如何将PyTorch格式的模型转换为ONNC,提供了两种不同的方法,并详细解释了如何在使用transformers模块时保留[unusedx]不被分词的处理方式。此外,还提到了tensorflow模块中的相应操作。
摘要由CSDN通过智能技术生成

前沿

业界主流的模型结构包括tensorflow和pytorch,很多时候两者的模型需要转换成中间格式,比如onnx,另外在tokenized的时候需要保留[unusedx]不被分词,但默认的是会分词的,这里记录一下处理方式。

torch格式转onnc

torch转onnx方法很多,这里介绍两种方式

方法1

# model_path为torch保存的文件,onnx_path为保存的文件路径
def lower_level(model_path, onnx_path="bert_std.onnx"):
    # load model and tokenizer
    added_token = ["[unused%s]" % i for i in range(100)]
    print("added_token:", added_token[:10])
    tokenizer = AutoTokenizer.from_pretrained(model_path, additional_special_tokens=added_token)
    dummy_model_input = tokenizer("hello bert", return_tensors="pt")
    unused_input = tokenizer("hello bert[unused17]", return_tensors="pt")

    print("dummy_model_input", dummy_model_input)
    print("unused_input:", unused_input)
    model = AutoModelForMaskedLM.from_pretrained(model_path)

    # export
    torch.onnx.export(
        model,
        tuple(dummy_model_input.values()),
        f=onnx_path,
        input_names=['input_ids', 'attention_mask'],
        output_names=['logits'],
        dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'},
                      'attention_mask': {0: 'batch_size', 1: 'sequence'},
                      'logits': {0: 'batch_size', 1: 'sequence'}},
        do_constant_folding=True,
        opset_version=13,
    )
    print("over")

方法2

def middle_level(model_path, onnx_path="bert_std.onnx"):
    from pathlib import Path
    import transformers
    from transformers.onnx import FeaturesManager
    from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification

    # load model and tokenizer
    feature = "sequence-classification"
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # load config
    model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
    onnx_config = model_onnx_config(model.config)

    # export
    onnx_inputs, onnx_outputs = transformers.onnx.export(
        preprocessor=tokenizer,
        model=model,
        config=onnx_config,
        opset=13,
        output=Path(onnx_path)
    )
    print("onnx_inputs:", onnx_inputs)
    print("onnx_outputs:", onnx_outputs)
    print("over")

保留[unused9]不分词

transformers模块

在AutoTokenizer.from_pretrained增加additional_special_tokens参数,如:

    added_token = ["[unused%s]" % i for i in range(100)]
    print("added_token:", added_token[:10])
    tokenizer = AutoTokenizer.from_pretrained(model_path, additional_special_tokens=added_token)

完整代码如下:

def lower_level(model_path, onnx_path="bert_std.onnx"):
    # load model and tokenizer
    added_token = ["[unused%s]" % i for i in range(100)]
    print("added_token:", added_token[:10])
    tokenizer = AutoTokenizer.from_pretrained(model_path, additional_special_tokens=added_token)
    dummy_model_input = tokenizer("hello bert", return_tensors="pt")
    unused_input = tokenizer("hello bert[unused17]", return_tensors="pt")

    print("dummy_model_input", dummy_model_input)
    print("unused_input:", unused_input)
    model = AutoModelForMaskedLM.from_pretrained(model_path)
    print("over")

tensorflow模块

		preprocessor = hub.load(bert_preprocess_path)
		okenize = tfm.nlp.layers.BertTokenizer(vocab_file=vocab_path, lower_case=True, 
                tokenizer_kwargs=dict(preserve_unused_token=True, token_out_type=tf.int32))

        bert_pack_inputs = hub.KerasLayer(
            preprocessor.bert_pack_inputs,
            arguments=dict(seq_length=seq_length))  # Optional argument.
       encoder = TFAutoModel.from_pretrained(checkpoint_dir, from_pt=True)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值