将使用jieba分词的语料库转化成TFIDF向量

最新推荐文章于 2024-04-26 16:33:16 发布

d_benhua

最新推荐文章于 2024-04-26 16:33:16 发布

阅读量1.6k

点赞数 4

分类专栏：自然语言处理 (NLP) 文章标签：自然语言处理 python 文本分类 jieba分词 TF-IDF向量

本文链接：https://blog.csdn.net/d_benhua/article/details/110914269

版权

自然语言处理 (NLP) 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

二、使用jieba组件对分类语料库分词

本文参考链接：https://blog.csdn.net/SA14023053/article/details/52083399
jieba组件参考链接：https://github.com/fxsjy/jieba

承接上文“Preprocessing Chinese Text”

此文对分类语料库文件进行预处理和分词并且去除停用词
中文语料库为复旦大学中文语料库test_corpus中C7-History的C7-History001.txt、C7-History002.txt、C7-History004.txt。
停用词表为中文停用词表

数据文件下载链接：https://github.com/JackDani/Preprocessing_Chinese_Text

text_mining
- text_corpus_samll 目录：原语料库路径，包含语料库文件。
- text_corpus_pos 目录：预处理后语料库路径。
- text_corpus_segment 目录：分词后语料库路径。
- text_corpus_dropstopword 目录：去除停用词后语料库路径。
- text_corpus_dict 目录：生成的字典文件路径。
- text_corpus_bow 目录：生成的bow向量文件路径。
- text_corpus_tfidf 目录：生成的tfidf向量存储路径。
- Test 目录：python处理文件。
- - corpus_pos.py 文件：语料库预处理执行文件。
- - corpus_segment.py 文件：语料库分词执行文件。
- - corpus_dropstopword.py 文件：语料库去除停用词执行文件。
- - corpus_tfidf.py 文件：已分词语料库转为tfidf向量执行文件。
- stopword 目录：停用词路径。
- README.txt

1. 只保留中文

去除其他所有非中文字符

#分类语料预处理执行文件

#分类语料库存储在text_corpus_small目录
#预处理后分类语料库存储到text_corpus_pos目录
# _*_ coding: utf-8 _*_



#以下进行只保留汉字操作
import os

#分类语料库路径
small_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_small"+"\\"

#预处理后分类语料库路径
pos_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_pos"+"\\"

# 以上路径为文件实际存储路径

def is_chinese(uchar):  #判断是否为中文
    #if uchar >= u'\u4e00' and uchar <= u'\u9fa5':
    if uchar >= u'\u4e00' and uchar <= u'\u9fff': # U+4E00～U+9FA5
        return True
    else:
        return False

def format_str(ustr):  #去除非中文函数
    ex_str = ''
    for i in ustr:
        if is_chinese(i):
            ex_str = ex_str + i
    return ex_str

file_list = os.listdir(small_path) #获取small_path下的所有文件
for file_path in file_list:  #遍历所有文件
    file_name = small_path + file_path #得到文件全路径
    file_read = open(file_name,'r',encoding = 'gbk',errors = 'ignore') # 打开一个文件GB2312 < GBK < GB18030  ,encoding = 'utf-8'
    # errors = 'ignore'对处理文件时的错误进行忽略处理
    # 解决方法链接参考；https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c
    # python编码参考链接：https://docs.python.org/3/howto/unicode.html#the-unicode-type
    
    #有错提示
    # file_read = open(file_name,'r',encoding = 'gbk')
    # file_read = open(file_name,'r')
    # 报错结果：UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 23677: illegal multibyte sequence
    
    
    raw_corpus = file_read.read() #读取为预处理语料库
    # 参函数传入的是每一句话
    #pos_corpus = []
    #for line in raw_corpus:
        #pos_corpus.append(format_str(line))
    pos_corpus = format_str(raw_corpus)

    # 得出预处理后分类语料库目录
    pos_dir = pos_path
    if not os.path.exists(pos_dir):  #如果没有创建则创建
        os.makedirs(pos_dir)
    file_write = open(pos_dir + file_path,'w') #创建或打开预处理后语料文件，文件名与未预处理语料文件相同
    file_write.write(str(pos_corpus)) #将预处理结果写入文件
    file_read.close() #关闭打开的文件
    file_write.close() #关闭写入的文件
# end for


print("预处理成功。")

知识点

1. os模块

os模块提供了非常丰富的方法用来处理文件和目录。

参考链接：
https://www.runoob.com/python/os-file-methods.html http://kuanghy.github.io/python-os/
http://python.usyiyi.cn/python_278/library/os.html

2. os.listdir(path)

用于返回值指定的文件夹path包含的文件或文件夹的名字的列表。
不包括_.和_…即使在文件中。
只支持Unix和Windows下使用。

语法

os.listdir(path)

参数

返回值

返回指定路径下的文件和文件夹的列表。

参考链接：https://www.runoob.com/python3/python3-os-listdir.html

# e.g.
# -*- coding: UTF-8 -*-

import os, sys

#打开文件
path = "C:\\Users\\Wu\\Desktop\\Now Go it"
dirs = os.listdir(path)

#输出所有文件和文件夹
for file in dirs:
    print(file)

3. os.path

os.path模块主要用于获取文件的属性

模块常用方法：

os.path.exists(path) #路径存在则返回True，损坏返回False

参考链接：https://www.runoob.com/python3/python3-os-path.html

# e.g.

import os

path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my"  #路径存在
print(os.path.exists(path))

path1 = "C:\\Users\\Wu\\Desktop\\Now Go it\\m1"  #不存在路径
print(os.path.exists(path1))

4. os.makedirs(path, mode = 0o777)

递归创建目录(文件夹)。
若子目录创建失败或已存在，则会抛出一个OSError异常，Windows上Error 183 即为目录已经存在的异常错误。
如果第一个参数 path 只有一级，则 mkdir() 函数相同。
递归文件夹创建函数。像mkdir(), 但创建的所有intermediate-level文件夹需要包含子文件夹。

语法

os.makedirs(path, mode = 0o777)。

参数

path – 需要递归创建的目录，可以是相对或者绝对路径。
mode – 权限模式。

返回值

该方法没有返回值。

参考链接：https://www.runoob.com/python/os-makedirs.html

# e.g.实例

#_*_ coding: UTF-8 _*_

import os

#创建的目录(即Windows下的文件夹)
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my"
for i in range(0,4):
    os.makedirs(path + "\\" + str(i))
print("路径被创建。")

5. open()

用于打开一个文件，并返回文件对象。若无法打开，则抛出OSError
wrong：使用open()方法则一定要调用close()方法

语法

open(file, mode = ‘r’)
完整语法格式open(file, mode=‘r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

参数

file: 必需，文件路径（相对或者绝对路径）。
mode: 可选，文件打开模式
buffering: 设置缓冲
encoding: 一般使用utf8
errors: 报错级别
newline: 区分换行符
closefd: 传入的file参数类型
opener: 设置自定义开启器，开启器的返回值必须是一个打开的文件描述符。

6. read()

从文件读取指定的字节数，如果未给定或为负则读取所有。

语法

fileobject.read([size])

参数

size – 可选参数，从文件中读取的字节数，默认为-1，表示读取整个文件

返回值

返回从字符串中读取的字节。

7. write()

向文件中写入指定字符串
在文件关闭前或缓冲区刷新前，字符串内容存储在缓冲区中，此时文件中看不到写入的内容。
如果文件打开模式带 b，那写入文件内容时，str (参数)要用 encode 方法转为 bytes 形式，否则报错：TypeError: a bytes-like object is required, not ‘str’。

语法

fileobject.write([str])

参数

str – 要写入文件的字符串

返回值

返回写入的字符长度

8. close()

关闭一个已打开的文件。
打开一个文件并处理完之后一定要进行关闭文件操作。

语法

fileobject.close()

参数

无。

返回值

9. readline()

读取所有行（直到结束符EOF）并返回列表。

语法

fileobject.readlines()

参数

返回值

返回列表，包含所有的行。

参考文献：
https://www.runoob.com/python3/python3-file-methods.html
https://blog.csdn.net/weixin_39643135/article/details/91348983
https://blog.csdn.net/weixin_40449300/article/details/79143971

# e.g. 实例
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\1.doc"
f = open(path,'r',encoding = 'utf-8')
f_read = f.read()
print(f_read)
f.close()

# e.g. 实例
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\1.doc"
f = open(path,'rb') #二进制形式打开.doc文件
f_read = f.read()
print(f_read)
f.close()

!pip install python-docx #导入模块python-docx
# %pip install python-docx #导入模块到内核

#!pip install python-docx  #导入模块python-docx
# e.g. 实例
import docx
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my\\test.docx"
f = docx.Document(path)
for item in f.paragraphs:
    print(item.text)
#此方法成功输出.docx文件

# e.g. 实例
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my\\test.txt"
f = open(path,'r',encoding = 'utf-8') #二进制形式打开.txt文件
f_read = f.read()
print(f_read)
f.close()

2. 进行jieba分词

#分类语料分词执行文件
#分词所需预处理后的文件存储在text_corpus_pos目录
#分词后文件存储到text_corpus_segment目录
# _*_ coding: utf-8 _*_


import os
import jieba


# 分类语料库路径
corpus_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_pos"+"\\"

# 分词后分类语料库路径
seg_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_segment"+"\\"

file_list = os.listdir(corpus_path) # 获取corpus_path下的所有文件
for file_path in file_list: # 遍历所有文件
    #print("输出" + file_path)
    file_name = corpus_path + file_path # 拼出文件全路径
    file_read = open(file_name,'rb') # 打开一个文件

    raw_corpus = file_read.read() # 读取为分词语料库
    seg_corpus = jieba.cut(raw_corpus) # 结巴分词操作

    # 拼出分词后分类语料库目录
    seg_dir = seg_path
    if not os.path.exists(seg_dir):  # 如果没有创建
        os.makedirs(seg_dir)
    file_write = open(seg_dir + file_path,'w')  # 创建分词后语料文件，文件名与未分词语料相同
    file_write.write("\n".join(seg_corpus)) #用换行符将分词结果分开并写入到分词后语料文件中

    file_read.close() #关闭打开的文件
    file_write.close() # 关闭写入的文件

print("中文语料分词成功。")

知识点

1.join()

将序列中的元素以指定的字符链接生成一个新的字符串。

语法

str.join(sequence)

参数

sequence – 要连接的元素序列。

返回值

返回处理后的新字符。

参考链接：
https://www.runoob.com/python3/python3-string-join.html
https://www.runoob.com/python3/python3-string.html

# e.g. 实例

s1 = "-"
s2 = ""
seq = ("r", "u", "n", "o", "o", "b") # 字符串序列
print (s1.join( seq ))
print (s2.join( seq ))

3. 去除停用词

# 去除停用词

# _*_ coding: utf-8 _*_

import os,pprint

#分词后的分类语料库路径
seg_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_segment"+"\\"

#去除停用词后分类语料库路径
dropstopword_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_dropstopword"+"\\"

#停用词存储路径
stopword_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\stopword\\中文停用词库.txt"

#加载本地停用词
fi=open(stopword_path,'r', encoding='UTF-8')
txt = fi.readlines()  #行读取整个文件
stopwords=[]
for w in txt:
    w = w.replace('\n','')  #将换行符替换掉
    stopwords.append(w)   #形成停用词列表


# 去掉文本中的停用词
def drop_stopwords(contents, stopwords):
    contents_clean = []
    for word in contents:
        if word in stopwords:
            continue
        contents_clean.append(word)
    return contents_clean


# 对文件操作
file_list = os.listdir(seg_path) #获取seg_path目录下的所有文件
for file_path in file_list: #遍历所有文件
    file_name = seg_path + file_path # 得到文件全路径
    file_read = open(file_name,'r') #打开一个文件

    #获得待去除停用词的已分词语料库列表（可参照停用词列表的形成方法）
    txt_corpus = file_read.readlines() #按行读取为去除停用词语料库
    raw_corpus = []
    for s in txt_corpus:
        s = s.replace('\n','')
        raw_corpus.append(s)
    # pprint.pprint(raw_corpus)
    drop_corpus = drop_stopwords(raw_corpus, stopwords) #去除停用词

    #得出去除停用词后分类语料库的目录
    drop_dir = dropstopword_path
    if not os.path.exists(drop_dir): #如果没有创建则创建
        os.makedirs(drop_dir)

    file_write = open(drop_dir + file_path,'w')  #创建或写入去除停用词后语料库文件
    #file_write.write(str(drop_corpus)) #将去除停用词结果写入文件
    file_write.write("\n".join(drop_corpus))
    file_read.close() #关闭打开的文件
    file_write.close() #关闭写入的文件

print("去除停用词成功。")

知识点

见jupyter notebook中的File chinese text preprocessing笔记

4. 转化为TF-IDF向量

将已分词文本文件转化为向量

# 转化为向量
# _*_ coding: UTF-8 _*_

import os,pprint

#去除停用词后分类语料库路径
dropstopword_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_dropstopword"+"\\"

#转化为bow向量的存储路径
bow_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_bow"+"\\"

# 转化为tfidf向量的存储路径
tfidf_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_tfidf"+"\\"

# 字典（词：ID）的存储路径
dict_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_dict"+"\\"

from gensim import corpora
from gensim import models

# 将文件内容输出成[[...],[...],[...],...]形式，获取原语料库
original_corpus = []
file_list = os.listdir(dropstopword_path)# 获取dropstopword目录下的所有文件
for file_path in file_list: #遍历所有文件
    file_name = dropstopword_path + file_path  #得到文件的全路径
    file_read = open(file_name, 'r')  #打开一个文件
    
    # 对每个文件输出操作
    str_corpus = file_read.readlines()
    text_corpus = []
    for s in str_corpus:
        s = s.replace('\n','')
        text_corpus.append(s)
    original_corpus.append(text_corpus)

    file_read.close() #关闭打开的文件
# end for
#pprint.pprint(original_corpus)


dictionary = corpora.Dictionary(original_corpus)
bow_vec = [dictionary.doc2bow(text) for text in original_corpus]
#pprint.pprint(dictionary.token2id)


# 存储字典（词：ID）到文件中
dict_dir = dict_path # 得出bow向量存储路径
if not os.path.exists(dict_dir):  #如果目录不存在则创建
    os.makedirs(dict_dir)
file_write_bow = open(dict_dir + "dict", 'w') #创建或写入bow向量文件
file_write_bow.write(str(dictionary.token2id))  #写入bow向量
file_write_bow.close() # 关闭写入的bow文件

# 存储bow_vec到文件中
bowvec_dir = bow_path # 得出bow向量存储路径
if not os.path.exists(bowvec_dir):  #如果目录不存在则创建
    os.makedirs(bowvec_dir)
file_write_bow = open(bowvec_dir + "bow_vec", 'w') #创建或写入bow向量文件
file_write_bow.write(str(bow_vec))  #写入bow向量
file_write_bow.close() # 关闭写入的bow文件


tfidf = models.TfidfModel(bow_vec)

# 对每个文件TFIDF向量化
original_corpus = []
file_list = os.listdir(dropstopword_path)# 获取dropstopword目录下的所有文件
for file_path in file_list: #遍历所有文件
    file_name = dropstopword_path + file_path  #得到文件的全路径
    file_read = open(file_name, 'r')  #打开一个文件
    
    # 对每个文件输出操作
    str_corpus = file_read.readlines()
    text_corpus = []
    for s in str_corpus:
        s = s.replace('\n','')
        text_corpus.append(s)
    #pprint.pprint(text_corpus)
    # pprint.pprint(text_corpus.split())
    train_list = dictionary.doc2bow(text_corpus)
    tfidf_vec = tfidf[train_list]
    #pprint.pprint(tfidf_vec)
    
    
    # 得出tfidf向量存储路径
    tfidf_dir = tfidf_path
    if not os.path.exists(tfidf_dir):  #如果目录不存在则创建
        os.makedirs(tfidf_dir)

    file_write_tfidf = open(tfidf_dir + file_path, 'w')  # 创建或写入tfidf向量
    file_write_tfidf.write(str(tfidf_vec))  #写入tfidf向量
    
    file_read.close() #关闭打开的文件
    file_write_tfidf.close() #  关闭写入的tfidf文件
# end for

print("\ntfidf向量转化成功！")

5. 在训练语料库的字典空间下，将测试语料库转化为tfidf向量的过程总结

1. 获得分词语料库。对训练语料库*train_corpus进行分词并去除停用词得到分词语料库participle_corpus*。
1. 获得字典。使用gensim.corpora.Dictionary(participle_corpus)通过分词语料库建立字典*dictionary，（即建立向量空间，字典字符token*的个数代表向量空间的维数）。（使用字典的dictionary.token2id方法查看“词”与“ID”的一一对应）
1. 获得bow向量。使用字典的dictionary.doc2bow(participle_corpus)方法将分词语料库转化为词袋模型*bag-of-word的bow向量bow_vector*。（participle_corpus一层列表）
1. 训练tfidf模型。使用gensim.models.TfidfModel(bow_vector)通过bow向量训练tfidf模型*tfidf model*。（bow_vector二层向量列表）
1. 获得tfidf向量。在字典集下，将测试语料库*test_corpus转化为bow向量（note：此处是测试语料库转化后得到的bow向量），再使用训练好的模型tfidf[bow_vector]将bow向量转化为tfidf向量tfidf_vector*。（bow_vector一层向量列表）
1. 对向量后续运算。对tfidf向量进行其他算法运算，如文章相似度计算，调用sklearn库算法等等。

扩展知识点

# Examples - 导入gensim.downloader模块作为api接口加载Document

import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

dataset = api.load("text8")
dct = Dictionary(dataset)  # fit dictionary
corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format

model = TfidfModel(corpus)  # fit model
vector = model[corpus[0]]  # apply model to the first corpus document
print(vector)

    # 使用时载入模型
    tfidf = models.TfidfModel.load("my_model.tfidf")
    
    words = "历史学 中国 古老 二十世纪 危机 王者 风衣".lower().split()
    pprint.pprint(tfidf[dictionary.doc2bow(words)])

d_benhua

关注

4
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
将使用jieba分词的语料库转化成TFIDF向量

二、使用jieba组件对分类语料库分词本文参考链接：https://blog.csdn.net/SA14023053/article/details/52083399jieba组件参考链接：https://github.com/fxsjy/jieba承接上文“Preprocessing Chinese Text”此文对分类语料库文件进行预处理和分词并且去除停用词中文语料库为复旦大学中文语料库test_corpus中C7-History的C7-History001.txt、C7-History00
复制链接

扫一扫

专栏目录