kaggle竞赛入门整理

1、Bike Sharing Demand

kaggle: https://www.kaggle.com/c/bike-sharing-demand

目的:根据日期、时间、天气、温度等特征,预测自行车的租借量

处理:1、将日期(含年月日时分秒)提取出年,月, 星期几,以及小时

           2、season, weather都是类别标记的,利用哑变量编码

算法模型选取:

回归问题:1、RandomForestRegressor

                  2、GradientBoostingRegressor

 

# -*- coding: utf-8 -*-
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# 选取特征值
selected_features = ['datetime', 'season', 'holiday',
                'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed']

#X_train = train[selected_features]
Y_train = train["count"]
result = test["datetime"]

# 特征值处理
month = pd.DatetimeIndex(train.datetime).month
day = pd.DatetimeIndex(train.datetime).dayofweek
hour = pd.DatetimeIndex(train.datetime).hour
season = pd.get_dummies(train.season)
weather = pd.get_dummies(train.weather)

X_train = pd.concat([season, weather], axis=1)
X_test = pd.concat([pd.get_dummies(test.season), pd.get_dummies(test.weather)], axis=1)
X_train['month'] = month
X_test['month'] = pd.DatetimeIndex(test.datetime).month
X_train['day'] = day
X_test['day'] = pd.DatetimeIndex(test.datetime).dayofweek
X_train['hour'] = hour
X_test['hour'] = pd.DatetimeIndex(test.datetime).hour
X_train['holiday'] = train['holiday']
X_test['holiday'] = test['holiday']
X_train['workingday'] = train['workingday']
X_test['workingday'] = test['workingday']
X_train['temp'] = train['temp']
X_test['temp'] = test['temp']
X_train['humidity'] = train['humidity']
X_test['humidity'] = test['humidity']
X_train['windspeed'] = train['windspeed']
X_test['windspeed'] = test['windspeed']


from sklearn.ensemble import *
clf = GradientBoostingRegressor(n_estimators=200, max_depth=3)
clf.fit(X_train, Y_train)
result = clf.predict(X_test)
result = np.expm1(result)

df=pd.DataFrame({'datetime':test['datetime'], 'count':result})
df.to_csv('results1.csv', index = False, columns=['datetime','count'])

from sklearn.ensemble import RandomForestRegressor
gbr = RandomForestRegressor()
gbr.fit(X_train, Y_train)

y_predict = gbr.predict(X_test).astype(int)

df = pd.DataFrame({'datetime': test.datetime, 'count': y_predict})
df.to_csv('result2.csv', index=False, columns=['datetime', 'count'])
#predictions_file = open("RandomForestRegssor.csv", "wb")
#open_file_object = csv.writer(predictions_file)
#open_file_object.writerow(["datetime", "count"])
#open_file_object.writerows(zip(res_time, y_predict))
View Code

 

2、Daily News for Stock Market Prediction

通过历史数据:包含每日点击率最高的25条新闻,与当日股市涨跌,来预测未来股市涨跌

方法一:

     1、将25条新闻合并成一篇新闻,然后对每个单词做预处理(去掉特殊字符,含数字的单词,删除停词,变成小写,取词干),然后用TF-IDF提取特征,用SVM训练

     2、用word2vec提取特征

具体实现:

https://github.com/yjfiejd/News_predict

3、

转载于:https://www.cnblogs.com/zhaopAC/p/9197608.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值