SentencePiece python 实战

Model Training

Training is performed by passing parameters of spm_train to SentencePieceTrainer.train() function.

    import sentencepiece as spm

    # train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
    # `m.vocab` is just a reference. not used in the segmentation.
    spm.SentencePieceTrainer.train('--input=examples/hl_data/botchan.txt --model_prefix=m --vocab_size=500')

训练时可传的参数:

  • --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
  • --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
  • --model_type: model type. Choose from unigram (default), bpechar, or word. The input sentence must be pretokenized when using word type.

 

Segmentation

    # makes segmenter instance and loads the model file (m.model)
    sp = spm.SentencePieceProcessor()
    sp.load('m.model')

    text = """
    If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
    """
    # encode: text => id
    print(sp.encode_as_pieces(text))
    print(sp.encode_as_ids(text))

    # decode: id => text
    # print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est', 'ly']))
    # print(sp.decode_ids([209, 31, 9, 375, 586, 34]))

官方参考连接:

sentencepiece/README.md at master · google/sentencepiece · GitHub

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

子燕若水

吹个大气球

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值