【笔记】tushare获取财经新闻快讯文本与自然语言处理实践

最新推荐文章于 2022-12-31 18:01:25 发布

SAhere

最新推荐文章于 2022-12-31 18:01:25 发布

阅读量583

点赞数

文章标签：自然语言处理人工智能

原文链接：http://www.tup.tsinghua.edu.cn/Wap/wap_index.aspx

版权

501379

关键词：tushare获取支持向量机自然语言处理财经金融舆情分析词频统计

1..导入所需要的库

2.使用tushare获取新闻快讯文本

# 导入tushare
import tushare as ts
# 初始化pro接口
pro = ts.pro_api('此处id需获取')

# 拉取数据
news = pro.jinse(**{
    "start_date": "2022-06-01",
    "end_date": "2022-06-02",
    "limit": "",
    "offset": ""
}, fields=[
    "title",
    "content",
    "type",
    "url",
    "datetime"
])
news.head()

3.文本清洗与分词

#以该处内容为例
message = news.iloc[7][1]
#去无用字符
#message = ''.join(message.split())
#分词 去stopwords
#此处中文处理建议加上encoding='UTF-8'
import jieba
words = ''
stopwords = [line.strip() for line in open(r"stopwords.txt",encoding='UTF-8').readlines()]
word = ' '.join(jieba.cut(message))
for w in word:
    if w not in stopwords:
        words += w
words

4.CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer


with open('message.txt','w') as f:
    f.write(words)
vect = CountVectorizer()
f = open('message.txt','r')

vect.fit(f)
f = open('message.txt')
vectors = vect.transform(f)
print(vectors.toarray())

vect.vocabulary_

学习笔记