金融领域词典构建

最新推荐文章于 2024-07-05 20:18:43 发布

#叫啥名字呢

最新推荐文章于 2024-07-05 20:18:43 发布

阅读量7.2k

点赞数 2

分类专栏： NLP 文章标签： nlp

本文链接：https://blog.csdn.net/weixin_40411446/article/details/81014669

版权

为提升情感分析效果，文章讲述了如何构建金融领域词典。通过计算TF-IDF并结合HowNet情感词典筛选词汇，选取种子词，并使用SO-PMI进行词典扩展。

摘要由CSDN通过智能技术生成

做情感分析还是需要结合情景和业务，之前直接用词典库效果太差，准备自建金融词典构建

语料库，呃呃呃，所有的词汇来源dict_myself

1.计算TF- IDF ，然后排序，得到的词可能会有和情感词典中重复的

#coding=UTF-8
"""
author:susuxuer
function:构建金融领域词汇
参考文献：https://www.cnblogs.com/en-heng/p/5848553.html
"""
import jieba.posseg as pseg
import numpy as np
import pandas as pd
import jieba
import time
import csv
import sys
import glob
import os
from collections import Counter
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from collections import defaultdict
from gensim import corpora,models

#调用停用词
def loadPoorEnt(path2 = 'G:/project/sentimation_analysis/data/stopwords.csv'):
    csvfile = open(path2,encoding='UTF-8')
    stopwords  = [line.strip() for line in csvfile.readlines()]
    return stopwords
stop_words=loadPoorEnt()

#读取所有文件路径
def get_all_content():
    #abel_dir = [path + x for x in os.listdir(path) if os.path.isdir(path + x)]
    all_files = glob.glob(r'D:/GFZQ/GFZQ/xuesu2018/xuesu/*.csv')
    return all_files

#获取文本信息
def get_wenben(path):
	csvfile = open(path,'r',encoding='UTF-8')
	reader = csv.reader(csvfile)
	return reader

# 进行句子的切分，选取v、a、d
def cut(data):
    result=[]    #pos=['n','v']
    res = pseg.cut(data)
    list = []
    for item in res:
        if item.word not in stop_words and (item.flag == 'd' or item.flag == 'a' or item.flag == 'v'):
            list.append(item.word)
    result.append(list)
    return result

#每篇业绩说明会选取部分词汇ÿ