SentencePiece python 实战

子燕若水

已于 2022-11-21 14:15:47 修改

阅读量2.1k

点赞数 1

文章标签： python 人工智能

于 2022-11-21 11:42:38 首次发布

本文链接：https://blog.csdn.net/u010087338/article/details/127961432

版权

Model Training

Training is performed by passing parameters of spm_train to SentencePieceTrainer.train() function.

    import sentencepiece as spm

    # train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
    # `m.vocab` is just a reference. not used in the segmentation.
    spm.SentencePieceTrainer.train('--input=examples/hl_data/botchan.txt --model_prefix=m --vocab_size=500')

训练时可传的参数:

--input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
--model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
--vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
--character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
--model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

Segmentation

    # makes segmenter instance and loads the model file (m.model)
    sp = spm.SentencePieceProcessor()
    sp.load('m.model')

    text = """
    If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
    """
    # encode: text => id
    print(sp.encode_as_pieces(text))
    print(sp.encode_as_ids(text))

    # decode: id => text
    # print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est', 'ly']))
    # print(sp.decode_ids([209, 31, 9, 375, 586, 34]))

官方参考连接:

sentencepiece/README.md at master · google/sentencepiece · GitHub