分词语料实用格式处理脚本

最新推荐文章于 2024-01-22 16:09:20 发布

Pennyyu0214

最新推荐文章于 2024-01-22 16:09:20 发布

阅读量886

点赞数 1

文章标签： nlp 自然语言处理 python

本文链接：https://blog.csdn.net/pennyyu123/article/details/93231553

版权

格式参考

切分格式示例如下：
在这里插入图片描述
格式描述：每一句话或一段话换一次行，句子中的单词之间用空格间隔，标点等其它符号也用空格隔开。

BMES格式示例如下：

在这里插入图片描述
格式描述：句子中的每个字（包括标点等其它符号）单独占一行，后面跟随这个字的标签（BMES），二者之间用制表符隔开。

生语料格式如下：

在这里插入图片描述
格式描述：每一句或一段话换一次行，即未经过任何预处理和标注的语料，比如我们平常上网爬取的语料内容。

从BMES格式向切分格式转换：

#! /usr/bin/python

-*- coding:UTF-8 -*

fo = open("train", "r" , encoding = "UTF-8")

fo2 = open("train.txt", "a" , encoding = "UTF-8")

str = fo.readlines()

for st in str :

if len(st) == 1 :

    fo2.write("\n")

    continue

i = st.split("	")

if i[1].strip() == "S" : fo2.write(i[0]+" ")

    elif i[1].strip() == "B" or i[1].strip() == "M" : fo2.write(i[0])

    else : fo2.write(i[0]+" ")

从切分格式向BMES格式转换：

#! /usr/bin/python

-*- coding:UTF-8 -*

fo = open("test", "r" , encoding = "UTF-8")

fo2 = open("test.txt", "a" , encoding = "UTF-8")

st = fo.readline()

while st != "" :

    str = st.strip()
for i in str.split(" "):

     m = 1

     for char in i:

         if i == "。" : fo2.write(char+"	"+"S"+"\n\n")

         elif i == "【" or i == "】" : continue

         elif len(i)==1 : fo2.write(char+"	"+"S"+"\n")

         elif m == 1 : fo2.write(char+"	"+"B"+"\n")

         elif m == len(i) : fo2.write(char+"	"+"E"+"\n")

         else : fo2.write(char+"	"+"M"+"\n")

         m = m + 1

st = fo.readline()

通过语料建立词典（文件“words”）、建立词典并统计统计词频（文件“word_for_trainning”），代码如下：（过滤标点符号）

#! /usr/bin/python
# -*- coding:UTF-8 -*-

from zhon.hanzi import punctuation as punc

import string

epunc = string.punctuation

fo = open("train","r")

fo2 = open("words_for_training","a")

fo3 = open("words","a")

st = fo.readline()

vocab = {}

while st != "" :

    for string in st.split(" "):

        i = string.rstrip()

        if len(i) == 1 and (i in punc or i in epunc) : continue

        if i not in vocab : vocab[i] = 1

        else : vocab[i] = vocab[i] + 1

    st = fo.readline()

keys = vocab.keys()

for i in keys :

    fo2.write("NONE " + i + " " + str(vocab[i]) + "\n")

    fo3.write(i+"\n")

从切分格式向BMES格式转换：（过滤标点符号）

#! /usr/bin/python
# -*- coding:UTF-8 -*-

from zhon.hanzi import punctuation as punc

import string

epunc = string.punctuation

fo = open("test", "r" , encoding = "UTF-8")

fo2 = open("test2", "a" , encoding = "UTF-8")

st = fo.readline()

while st != "" :

    string = st.rstrip()

    for i in string.split(" "):

         if len(i) == 1 and (i in punc or i in epunc): continue

         m = 1

         for char in i:

             if len(i)==1 : fo2.write(char+"#"+"\<NONE>"+"#"+"S"+"_NONE\n")

             elif m == 1 : fo2.write(char+"#"+"\<NONE>"+"#"+"B"+"_NONE\n")

             elif m == len(i) : fo2.write(char + "#" + "\<NONE>" + "#" + "E" + "_NONE\n")

             else : fo2.write(char + "#" + "\<NONE>" + "#" + "M" + "_NONE\n")

             m = m + 1

    st = fo.readline()

从切分格式转换到生语料（方便测试使用）：

! /usr/bin/python
# -*- coding:UTF-8 -*-

fo = open("test","r")

fo2 = open("test_raw","a")

str = fo.readline()

while str != "" :

    for i in str.split(" "):

        fo2.write(i)

    str = fo.readline()