天池大数据竞赛——天池精准医疗大赛人工智能辅助糖尿病遗传风险预测赛后总结
天池大数据竞赛官方网址(链接)
六、预测算法
1. LightGBM
LightGBM利用基于histogram的算法,通过将连续特征(属性)值分段为discrete bins来加快训练的速度并减少内存的使用。直方图算法的基本思想:先把连续的浮点特征值离散化成k个整数,同时构造一个宽度为k的直方图。遍历数据时,根据离散化后的值作为索引在直方图中累积统计量,当遍历一次数据后,直方图累积了需要的统计量,然后根据直方图的离散值,遍历寻找最优的分割点。基于histogram算法有很多优点,包括减少分割增益的计算量、通过直方图的相减来进行进一步的加速、减少内存的使用、减少并行学习的通信代价等。
Gradient Boosting Decision Tree(GBDT)是一种被广泛使用的算法,目前也有很多实现方法,比如说scikit-learn,LightGBM。其实关于梯度提升树不同的实现,本质就是所使用的损失函数和最小化损失函数的方法有所差异。而梯度提升算法的核心思想就是通过拟合负梯度值去学习决策树。
相关代码如下,仍需要根据实际应用做出相应的更改。
import time
import datetime
import numpy as np
import pandas as pd
import lightgbm as lgb
from dateutil.parser import parse
from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error
from scipy import stats
data_path = 'data/'
train = pd.read_csv(data_path + 'd_train_20180102.csv', encoding='gb2312')
test = pd.read_csv(data_path + 'd_test_A_20180102.csv', encoding='gb2312')
def make_feat(train, test):
train_id = train.id.values.copy()
test_id = test.id.values.copy()
data = pd.concat([train, test])
data['性别'] = data['性别'].map({
'男': 1, '女': 0, '??':0})
data['体检日期'] = (pd.to_datetime(data['体检日期']) - parse('2017-9-10')).dt.days
train_feat = data[data.id.isin(train_id)]
test_feat = data[data.id.isin(test_id)]
train_feat = train_feat.drop(['id','乙肝表面抗原','乙肝表面抗体','乙肝e抗原','乙肝e抗体','乙肝核心抗体'],axis=1)
test_feat = test_feat.drop(['id','乙肝表面抗原','乙肝表面抗体','乙肝e抗原','乙肝e抗体','乙肝核心抗体'], axis=1)
train_feat.fillna(train_feat.median(axis=0), inplace=True)
test_feat.fillna(test_feat.median(axis=0), inplace=True)
train_feat = train_feat.drop(train_feat[train_feat['*r-谷氨酰基转换酶'] > 600 ].index)
train_feat = train_feat.drop(train_feat[train_feat['白细胞计数'] > 20.06].index)
train_feat = train_feat.drop(train_feat[train_feat['*丙氨酸氨基转换酶'] == 498.89].index)
train_feat = train_feat.drop(train_feat[train_feat['单核细胞%'] > 20 ].index)
train_feat = train_feat.drop(train_feat[train_feat['*碱性磷酸酶'] > 340].index)
train_feat = train_feat.drop