kaggle homesite

kaggle homesite

  • 该项目是针对kaggle中的homesite进行的算法预测,使用xgboost的sklearn接口,进行数据建模,购买预测。
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
QuoteNumberOriginal_Quote_DateQuoteConversion_FlagField6Field7Field8Field9Field10Field11Field12...GeographicField59AGeographicField59BGeographicField60AGeographicField60BGeographicField61AGeographicField61BGeographicField62AGeographicField62BGeographicField63GeographicField64
012013-08-160B230.94030.00069651.0200N...99-18-118-110NCA
122014-04-220F71.00060.00405481.2433N...1010-111-117-120NNJ
242014-08-250F71.00060.00405481.2433N...1518-121-111-18NNJ
362013-04-150J100.97690.00041,1651.2665N...65-110-19-121NTX
482014-01-250E230.94720.00061,4871.3045N...1822-110-111-112NIL

5 rows × 299 columns

train=train.drop('QuoteNumber',axis=1)
test = test.drop('QuoteNumber', axis=1)
时间格式的转化
train['Date']=pd.to_datetime(train['Original_Quote_Date'])
train= train.drop('Original_Quote_Date',axis=1)
test['Date']=pd.to_datetime(test['Original_Quote_Date'])
test= test.drop('Original_Quote_Date',axis=1)
train['year']=train['Date'].dt.year
train['month']=train['Date'].dt.month
train['weekday']=train['Date'].dt.weekday
train.head()
QuoteConversion_FlagField6Field7Field8Field9Field10Field11Field12CoverageField1ACoverageField1B...GeographicField61AGeographicField61BGeographicField62AGeographicField62BGeographicField63GeographicField64Dateyearmonthweekday
00B230.94030.00069651.0200N1723...-118-110NCA2013-08-16201384
10F71.00060.00405481.2433N68...-117-120NNJ2014-04-22201441
20F71.00060.00405481.2433N712...-111-18NNJ2014-08-25201480
30J100.97690.00041,1651.2665N32...-19-121NTX2013-04-15201340
40E230.94720.00061,4871.3045N813...-111-112NIL2014-01-25201415

5 rows × 301 columns

test['year']=test['Date'].dt.year
test['month']=test['Date'].dt.month
test['weekday']=test['Date'].dt.weekday
train = train.drop('Date', axis=1)  
test = test.drop('Date', axis=1)
查看数据类型
train.dtypes
QuoteConversion_Flag      int64
Field6                   object
Field7                    int64
Field8                  float64
Field9                  float64
Field10                  object
Field11                 float64
Field12                  object
CoverageField1A           int64
CoverageField1B           int64
CoverageField2A           int64
CoverageField2B           int64
CoverageField3A           int64
CoverageField3B           int64
CoverageField4A           int64
CoverageField4B           int64
CoverageField5A           int64
CoverageField5B           int64
CoverageField6A           int64
CoverageField6B           int64
CoverageField8           object
CoverageField9           object
CoverageField11A          int64
CoverageField11B          int64
SalesField1A              int64
SalesField1B              int64
SalesField2A              int64
SalesField2B              int64
SalesField3               int64
SalesField4               int64
                         ...   
GeographicField50B        int64
GeographicField51A        int64
GeographicField51B        int64
GeographicField52A        int64
GeographicField52B        int64
GeographicField53A        int64
GeographicField53B        int64
GeographicField54A        int64
GeographicField54B        int64
GeographicField55A        int64
GeographicField55B        int64
GeographicField56A        int64
GeographicField56B        int64
GeographicField57A        int64
GeographicField57B        int64
GeographicField58A        int64
GeographicField58B        int64
GeographicField59A        int64
GeographicField59B        int64
GeographicField60A        int64
GeographicField60B        int64
GeographicField61A        int64
GeographicField61B        int64
GeographicField62A        int64
GeographicField62B        int64
GeographicField63        object
GeographicField64        object
year                      int64
month                     int64
weekday                   int64
Length: 300, dtype: object
查看DataFrame的详细信息
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260753 entries, 0 to 260752
Columns: 300 entries, QuoteConversion_Flag to weekday
dtypes: float64(6), int64(267), object(27)
memory usage: 596.8+ MB
填充缺失值
train = train.fillna(-999)
test = test.fillna(-999)
category 数据类型转化
from sklearn import preprocessing
features = list(train.columns[1:])  
for i in features:
    if train[i].dtype=='object':
        le=preprocessing.LabelEncoder()
        le.fit(list(train[i].values)+list(test[i].values))
        train[i] = le.transform(list(train[i].values))
        test[i] = le.transform(list(test[i].values))
        
模型参数设定
#brute force scan for all parameters, here are the tricks
#usually max_depth is 6,7,8
#learning rate is around 0.05, but small changes may make big diff
#tuning min_child_weight subsample colsample_bytree can have 
#much fun of fighting against overfit 
#n_estimators is how many round of boosting
#finally, ensemble xgboost with multiple seeds may reduce variance

xgb_model = xgb.XGBClassifier()

parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower
              'objective':['binary:logistic'],
              'learning_rate': [0.05,0.1], #so called `eta` value
              'max_depth': [6],
              'min_child_weight': [11],
              'silent': [1],
              'subsample': [0.8],
              'colsample_bytree': [0.7],
              'n_estimators': [5], #number of trees, change it to 1000 for better results
              'missing':[-999],
              'seed': [1337]}
sfolder = StratifiedKFold(n_splits=5,random_state=42,shuffle=True)
clf= GridSearchCV(xgb_model,parameters,n_jobs=4,cv=sfolder.split(train[features], train["QuoteConversion_Flag"]),scoring='roc_auc',
                   verbose=2, refit=True,return_train_score=True)
clf.fit(train[features], train["QuoteConversion_Flag"])
Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:  2.4min finished





GridSearchCV(cv=<generator object _BaseKFold.split at 0x0000000018459888>,
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=4,
       param_grid={'nthread': [4], 'objective': ['binary:logistic'], 'learning_rate': [0.05, 0.1], 'max_depth': [6], 'min_child_weight': [11], 'silent': [1], 'subsample': [0.8], 'colsample_bytree': [0.7], 'n_estimators': [5], 'missing': [-999], 'seed': [1337]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='roc_auc', verbose=2)
clf.grid_scores_
c:\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





[mean: 0.94416, std: 0.00118, params: {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 6, 'min_child_weight': 11, 'missing': -999, 'n_estimators': 5, 'nthread': 4, 'objective': 'binary:logistic', 'seed': 1337, 'silent': 1, 'subsample': 0.8},
 mean: 0.94589, std: 0.00120, params: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 6, 'min_child_weight': 11, 'missing': -999, 'n_estimators': 5, 'nthread': 4, 'objective': 'binary:logistic', 'seed': 1337, 'silent': 1, 'subsample': 0.8}]
pd.DataFrame(clf.cv_results_['params'])
colsample_bytreelearning_ratemax_depthmin_child_weightmissingn_estimatorsnthreadobjectiveseedsilentsubsample
00.70.05611-99954binary:logistic133710.8
10.70.10611-99954binary:logistic133710.8
best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1])
print('Raw AUC score:', score)
for param_name in sorted(best_parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))
Raw AUC score: 0.9458947562485674
colsample_bytree: 0.7
learning_rate: 0.1
max_depth: 6
min_child_weight: 11
missing: -999
n_estimators: 5
nthread: 4
objective: 'binary:logistic'
seed: 1337
silent: 1
subsample: 0.8


c:\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
test_probs = clf.predict_proba(test[features])[:,1]

sample = pd.read_csv('sample_submission.csv')
sample.QuoteConversion_Flag = test_probs
sample.to_csv("xgboost_best_parameter_submission.csv", index=False)
clf.best_estimator_.predict_proba(test[features])
array([[0.6988076 , 0.3011924 ],
       [0.6787684 , 0.3212316 ],
       [0.6797658 , 0.32023418],
       ...,
       [0.5018287 , 0.4981713 ],
       [0.6988076 , 0.3011924 ],
       [0.62464744, 0.37535256]], dtype=float32)
kears_result=pd.read_csv('keras_nn_test.csv')
result1=[1 if i>0.5 else 0 for i in kears_result['QuoteConversion_Flag']]
xgb_result=pd.read_csv('xgboost_best_parameter_submission.csv')
result2=[1 if i>0.5 else 0 for i in xgb_result['QuoteConversion_Flag']]
from sklearn import metrics
metrics.accuracy_score(result1,result2)
0.8566004740099864
metrics.confusion_matrix(result1,result2)
array([[148836,  24862],
       [    66,     72]], dtype=int64)
结论
  • 对数据的时间进行了预处理
  • 对数据中的category类型进行了label化,我觉得有必要对这个进行重新考虑,个人觉得应该使用one-hot进行category的处理,而不是LabelEncoder处理(疑虑)
  • Label encoding在某些情况下很有用,但是场景限制很多。再举一例:比如有[dog,cat,dog,mouse,cat],我们把其转换为[1,2,1,3,2]。这里就产生了一个奇怪的现象:dog和mouse的平均值是cat。所以目前还没有发现标签编码的广泛使用。
  • 得到的模型对测试集进行处理,Raw AUC 0.94,而对应的准确率只有85%,实际上并没有实际的分类效果,对于实际上是0的,预测成1的太多了,也就是假阳性太高了,而漏检的也很多。
  • 其实模型还有很多可以调整的参数都没有调整,如果对调参有兴趣的可以查看美团的文本分类项目中的例子。

posted on 2018-10-12 16:06 多一点 阅读(...) 评论(...) 编辑 收藏

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值