数据来源于国外的网站,类似于国内的贴吧网站
### RedditNews.csv: two columns The first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.
### DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info.
### Combined_News_DJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".
######加载包#####
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from datetime import date
import os
#####导入数据######
os.chdir(r'D:/.../..../利用每日新闻预测金融市场变化')
data = pd.read_csv('Combined_News_DJIA.csv')
#####将headlines合并#####
data["combined_news"] = data.filter(regex = ("Top.*")).apply(lambda x: ''.join(str(x.values)),axis = 1)
########分割测试/训练集
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']
############提取特征#############
feature_extraction = TfidfVectorizer()
X_train = feature_extraction.fit_transform(train["combined_news"].values)
#训练(fit)文本信息,transform我们所需要的TfidfVectorizer模型
X_test = feature_extraction.transform(test["combined_news"].values)
y_train = train["Label"].values#将label变成numpy输出
y_test = test["Label"].values
#######训练模型#############
clf = SVC(probability = True , kernel = 'rbf')
clf.fit(X_train,y_train)
#预测
predictions = clf.predict_proba(X_test)
#验证准确度
print('ROC-AUC yields' + str(roc_auc_score(y_test,predictions[:,1])))