每日新闻预测金融市场的变化_版本1

数据来源于国外的网站,类似于国内的贴吧网站

###  RedditNews.csv: two columns The first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.

###    DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info.

###    Combined_News_DJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".


######加载包#####
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from datetime import date
import os

#####导入数据######
os.chdir(r'D:/.../..../利用每日新闻预测金融市场变化')
data = pd.read_csv('Combined_News_DJIA.csv')

#####将headlines合并#####
data["combined_news"] = data.filter(regex = ("Top.*")).apply(lambda x: ''.join(str(x.values)),axis = 1)

########分割测试/训练集
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']


############提取特征#############
feature_extraction = TfidfVectorizer()
X_train = feature_extraction.fit_transform(train["combined_news"].values)
#训练(fit)文本信息,transform我们所需要的TfidfVectorizer模型
X_test = feature_extraction.transform(test["combined_news"].values)


y_train = train["Label"].values#将label变成numpy输出
             
y_test = test["Label"].values

             
#######训练模型#############
clf = SVC(probability = True , kernel = 'rbf')
clf.fit(X_train,y_train)

#预测
predictions = clf.predict_proba(X_test)

#验证准确度
print('ROC-AUC yields' + str(roc_auc_score(y_test,predictions[:,1])))


评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值