分词语料实用格式处理脚本

格式参考

切分格式示例如下:
在这里插入图片描述
格式描述:每一句话或一段话换一次行,句子中的单词之间用空格间隔,标点等其它符号也用空格隔开。

BMES格式示例如下:

在这里插入图片描述
格式描述:句子中的每个字(包括标点等其它符号)单独占一行,后面跟随这个字的标签(BMES),二者之间用制表符隔开。

生语料格式如下:

在这里插入图片描述
格式描述:每一句或一段话换一次行,即未经过任何预处理和标注的语料,比如我们平常上网爬取的语料内容。

从BMES格式向切分格式转换:

#! /usr/bin/python

-*- coding:UTF-8 -*

fo = open("train", "r" , encoding = "UTF-8")

fo2 = open("train.txt", "a" , encoding = "UTF-8")

str = fo.readlines()

for st in str :

if len(st) == 1 :

    fo2.write("\n")

    continue

i = st.split("	")

if i[1].strip() == "S" : fo2.write(i[0]+" ")

    elif i[1].strip() == "B" or i[1].strip() == "M" : fo2.write(i[0])

    else : fo2.write(i[0]+" ")

从切分格式向BMES格式转换:

#! /usr/bin/python

-*- coding:UTF-8 -*

fo = open("test", "r" , encoding = "UTF-8")

fo2 = open("test.txt", "a" , encoding = "UTF-8")

st = fo.readline()

while st != "" :

    str = st.strip()
for i in str.split(" "):

     m = 1

     for char in i:

         if i == "。" : fo2.write(char+"	"+"S"+"\n\n")

         elif i == "【" or i == "】" : continue

         elif len(i)==1 : fo2.write(char+"	"+"S"+"\n")

         elif m == 1 : fo2.write(char+"	"+"B"+"\n")

         elif m == len(i) : fo2.write(char+"	"+"E"+"\n")

         else : fo2.write(char+"	"+"M"+"\n")

         m = m + 1

st = fo.readline()

通过语料建立词典(文件“words”)、建立词典并统计统计词频(文件“word_for_trainning”),代码如下:(过滤标点符号)

#! /usr/bin/python
# -*- coding:UTF-8 -*-

from zhon.hanzi import punctuation as punc

import string

epunc = string.punctuation

fo = open("train","r")

fo2 = open("words_for_training","a")

fo3 = open("words","a")

st = fo.readline()

vocab = {}

while st != "" :

    for string in st.split(" "):

        i = string.rstrip()

        if len(i) == 1 and (i in punc or i in epunc) : continue

        if i not in vocab : vocab[i] = 1

        else : vocab[i] = vocab[i] + 1

    st = fo.readline()

keys = vocab.keys()

for i in keys :

    fo2.write("NONE " + i + " " + str(vocab[i]) + "\n")

    fo3.write(i+"\n")

从切分格式向BMES格式转换:(过滤标点符号)

#! /usr/bin/python
# -*- coding:UTF-8 -*-

from zhon.hanzi import punctuation as punc

import string

epunc = string.punctuation

fo = open("test", "r" , encoding = "UTF-8")

fo2 = open("test2", "a" , encoding = "UTF-8")

st = fo.readline()

while st != "" :

    string = st.rstrip()

    for i in string.split(" "):

         if len(i) == 1 and (i in punc or i in epunc): continue

         m = 1

         for char in i:

             if len(i)==1 : fo2.write(char+"#"+"\<NONE>"+"#"+"S"+"_NONE\n")

             elif m == 1 : fo2.write(char+"#"+"\<NONE>"+"#"+"B"+"_NONE\n")

             elif m == len(i) : fo2.write(char + "#" + "\<NONE>" + "#" + "E" + "_NONE\n")

             else : fo2.write(char + "#" + "\<NONE>" + "#" + "M" + "_NONE\n")

             m = m + 1

    st = fo.readline()


从切分格式转换到生语料(方便测试使用):

! /usr/bin/python
# -*- coding:UTF-8 -*-

fo = open("test","r")

fo2 = open("test_raw","a")

str = fo.readline()

while str != "" :

    for i in str.split(" "):

        fo2.write(i)

    str = fo.readline()
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值