格式参考
切分格式示例如下:
格式描述:每一句话或一段话换一次行,句子中的单词之间用空格间隔,标点等其它符号也用空格隔开。
BMES格式示例如下:
格式描述:句子中的每个字(包括标点等其它符号)单独占一行,后面跟随这个字的标签(BMES),二者之间用制表符隔开。
生语料格式如下:
格式描述:每一句或一段话换一次行,即未经过任何预处理和标注的语料,比如我们平常上网爬取的语料内容。
从BMES格式向切分格式转换:
#! /usr/bin/python
-*- coding:UTF-8 -*
fo = open("train", "r" , encoding = "UTF-8")
fo2 = open("train.txt", "a" , encoding = "UTF-8")
str = fo.readlines()
for st in str :
if len(st) == 1 :
fo2.write("\n")
continue
i = st.split(" ")
if i[1].strip() == "S" : fo2.write(i[0]+" ")
elif i[1].strip() == "B" or i[1].strip() == "M" : fo2.write(i[0])
else : fo2.write(i[0]+" ")
从切分格式向BMES格式转换:
#! /usr/bin/python
-*- coding:UTF-8 -*
fo = open("test", "r" , encoding = "UTF-8")
fo2 = open("test.txt", "a" , encoding = "UTF-8")
st = fo.readline()
while st != "" :
str = st.strip()
for i in str.split(" "):
m = 1
for char in i:
if i == "。" : fo2.write(char+" "+"S"+"\n\n")
elif i == "【" or i == "】" : continue
elif len(i)==1 : fo2.write(char+" "+"S"+"\n")
elif m == 1 : fo2.write(char+" "+"B"+"\n")
elif m == len(i) : fo2.write(char+" "+"E"+"\n")
else : fo2.write(char+" "+"M"+"\n")
m = m + 1
st = fo.readline()
通过语料建立词典(文件“words”)、建立词典并统计统计词频(文件“word_for_trainning”),代码如下:(过滤标点符号)
#! /usr/bin/python
# -*- coding:UTF-8 -*-
from zhon.hanzi import punctuation as punc
import string
epunc = string.punctuation
fo = open("train","r")
fo2 = open("words_for_training","a")
fo3 = open("words","a")
st = fo.readline()
vocab = {}
while st != "" :
for string in st.split(" "):
i = string.rstrip()
if len(i) == 1 and (i in punc or i in epunc) : continue
if i not in vocab : vocab[i] = 1
else : vocab[i] = vocab[i] + 1
st = fo.readline()
keys = vocab.keys()
for i in keys :
fo2.write("NONE " + i + " " + str(vocab[i]) + "\n")
fo3.write(i+"\n")
从切分格式向BMES格式转换:(过滤标点符号)
#! /usr/bin/python
# -*- coding:UTF-8 -*-
from zhon.hanzi import punctuation as punc
import string
epunc = string.punctuation
fo = open("test", "r" , encoding = "UTF-8")
fo2 = open("test2", "a" , encoding = "UTF-8")
st = fo.readline()
while st != "" :
string = st.rstrip()
for i in string.split(" "):
if len(i) == 1 and (i in punc or i in epunc): continue
m = 1
for char in i:
if len(i)==1 : fo2.write(char+"#"+"\<NONE>"+"#"+"S"+"_NONE\n")
elif m == 1 : fo2.write(char+"#"+"\<NONE>"+"#"+"B"+"_NONE\n")
elif m == len(i) : fo2.write(char + "#" + "\<NONE>" + "#" + "E" + "_NONE\n")
else : fo2.write(char + "#" + "\<NONE>" + "#" + "M" + "_NONE\n")
m = m + 1
st = fo.readline()
从切分格式转换到生语料(方便测试使用):
! /usr/bin/python
# -*- coding:UTF-8 -*-
fo = open("test","r")
fo2 = open("test_raw","a")
str = fo.readline()
while str != "" :
for i in str.split(" "):
fo2.write(i)
str = fo.readline()