##菜鸟一枚,datacastle比赛题目,用的是Logistic,做出的结果不好,目前只排在100名左右。先放在博客上面,项目比较紧张,就怕以后没时间做了。。。。
后续思路:
(1)特征工程(特征筛选,融合等)
(2)堆模型,考虑RF,GBDT等,也可以使用堆叠神经网络
(3)还是要对特征做一些研究
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 10 09:54:12 2017
###Datacastle的‘用户贷款风险预测’竞赛题目###
#初步想法是利用逻辑斯蒂回归,特征的选择对结果影响很大,有时间的话多看看特征选择方面的东西
"""
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
class DataCastle(object):
def __init__(self):
self.name = "<<- User loan forecast match ->>"
self.result = "result.csv"
#读取用户信息表 并返回
def readUserInfo(self):
user_info_train = readData("train/user_info_train.txt")
user_info_test = readData("test/user_info_test.txt")
col_names = ['userid', 'sex', 'occupation', 'education', 'marriage', 'household']
user_info_train.columns = col_names
user_info_test.columns = col_names
user_info = pd.concat([user_info_train, user_info_test])
user_info.index = user_info['userid']
user_info.drop('userid',axis=1,inplace=True)
return user_info
#读取用户银行账单表 对账单数据求和并返回
def readBankDetail(self):
bank_detail_train = readData("train/bank_detail_train.txt")
bank_detail_test = readData("test/bank_detail_test.txt")
col_names = ['userid', 'time_bank',<