【python实现基于深度学习的文本情感分类(2)】——数据准备和Jieba分词

最新推荐文章于 2024-08-13 18:29:56 发布

UCAS菌皓

最新推荐文章于 2024-08-13 18:29:56 发布

阅读量2k

点赞数 2

分类专栏： Python 文章标签： python 深度学习人工智能文本情感分类

本文链接：https://blog.csdn.net/qq_41831350/article/details/87346298

版权

Python 专栏收录该内容

4 篇文章 1 订阅

订阅专栏

用到的库：xlrd, jieba

要做什么

这一步是为了分词，利用jieba分词实现。不过先要把包含原始数据的xlsx文件中包含文本的部分提取到txt文件中方便处理。
下面上代码。

xlsx转txt

#encoding=utf-8

#############################################
###      对word2vec进行训练需要语料库       ###
###  将excel单元格中数据转成txt文件便于读入  ###
#############################################

#转换完成，之后不再执行这段代码

import xlrd

fname = "classfied_data.xlsx"
excelbook = xlrd.open_workbook(r'E:\python\Deep_Text_Classfication\data\classfied_data.xlsx')

def getSheet(sh_index):
    try:
        sh = excelbook.sheet_by_index(sh_index)
    except:
        print('no sheet'+sh_index+' in %s',format(fname))
    return sh

#导入excel数据sheet1
sh1 = getSheet(0)

#获取单元格（5，1）的内容
cell_value = sh1.cell_value(5,1)

#获取单元格（1，1）到（rows-1，1）的内容
i = 1
rows = sh1.nrows
#打开要写入的文件
f=open(r"E:\python\Deep_Text_Classfication\script\f.txt","a+",encoding="utf-8")
#写入……
while i<=(rows-1):
    cv = sh1.cell_value(i,1)
    f.write(cv)
    i += 1

Jieba分词

jieba分词友情链接

#encoding=utf-8
def cut_txt(old_file):
    import jieba

    #导入用户词典，重新执行需注意文件路径
    jieba.load_userdict(r'E:\python\Deep_Text_Classfication\data\word_list\mydict.txt')
    jieba.load_userdict(r'E:\python\Deep_Text_Classfication\data\word_list\negative.txt')
    jieba.load_userdict(r'E:\python\Deep_Text_Classfication\data\word_list\positive.txt')

    global cut_file     # 分词之后保存的文件名
    cut_file = old_file + '_cut.txt'

    try:
        fi = open(old_file, 'r', encoding='utf-8')
    except BaseException as e:  # 因BaseException是所有错误的基类，用它可以获得所有错误类型
        print(Exception, ":", e)    # 追踪错误详细信息

    text = fi.read()  # 获取文本内容
    new_text = jieba.cut(text, cut_all=False)  # 精确模式
    str_out = ' '.join(new_text).replace('，', '').replace('。', '').replace('？', '').replace('！', '') \
        .replace('“', '').replace('”', '').replace('：', '').replace('…', '').replace('（', '').replace('）', '') \
        .replace('—', '').replace('《', '').replace('》', '').replace('、', '').replace('‘', '') \
        .replace('’', '')     # 去掉标点符号
    fo = open(cut_file, 'w', encoding='utf-8')
    fo.write(str_out)
    
cut_file = 'f.txt_cut.txt'
if not os.path.exists(cut_file):    # 判断文件是否存在
    print("here i am")
    cut_txt(r'E:\python\Deep_Text_Classfication\script\f.txt')  # 注意文件必须先另存为utf-8编码格式
else:
    print('分词已经完成，不用再次分词')

最后得到已分词的txt文件，保存在脚本路径中。
在这里插入图片描述

UCAS菌皓

关注

2
点赞
踩
17

收藏

觉得还不错? 一键收藏
2
评论
【python实现基于深度学习的文本情感分类(2)】——数据准备和Jieba分词

用到的库：xlrd, jieba要做什么这一步是为了分词，利用jieba分词实现。不过先要把包含原始数据的xlsx文件中包含文本的部分提取到txt文件中方便处理。下面上代码。xlsx转txt#encoding=utf-8################################################ 对word2vec进行训练需要语料库 ###...
复制链接

扫一扫

专栏目录