数据挖掘 - Santander客户交易预测

数据分析 - Santander 客户交易预测

介绍

根据给定的特征,预测客户是否将产生交易。

数据准备

数据字段简介,训练集与测试集下载地址:
https://www.kaggle.com/c/santander-customer-transaction-prediction/data.

正文

准备

加载要用到的库和数据

# EDA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# model
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import shap
import statsmodels.api as sm
#
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_columns',None)
sns.set_style('whitegrid')

# load data
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

看一下数据大小,并简单了解数据:

print(f'Shape: Train, {train.shape}; Test, {test.shape}')
display(train.describe())

Shape: Train, (200000, 202); Test, (200000, 201)

targetvar_0var_1var_2var_3var_4var_5var_6var_7var_8var_9var_10var_11var_12var_13var_14var_15var_16var_17var_18var_19var_20var_21var_22var_23var_24var_25var_26var_27var_28var_29var_30var_31var_32var_33var_34var_35var_36var_37var_38var_39var_40var_41var_42var_43var_44var_45var_46var_47var_48var_49var_50var_51var_52var_53var_54var_55var_56var_57var_58var_59var_60var_61var_62var_63var_64var_65var_66var_67var_68var_69var_70var_71var_72var_73var_74var_75var_76var_77var_78var_79var_80var_81var_82var_83var_84var_85var_86var_87var_88var_89var_90var_91var_92var_93var_94var_95var_96var_97var_98var_99var_100var_101var_102var_103var_104var_105var_106var_107var_108var_109var_110var_111var_112var_113var_114var_115var_116var_117var_118var_119var_120var_121var_122var_123var_124var_125var_126var_127var_128var_129var_130var_131var_132var_133var_134var_135var_136var_137var_138var_139var_140var_141var_142var_143var_144var_145var_146var_147var_148var_149var_150var_151var_152var_153var_154var_155var_156var_157var_158var_159var_160var_161var_162var_163var_164var_165var_166var_167var_168var_169var_170var_171var_172var_173var_174var_175var_176var_177var_178var_179var_180var_181var_182var_183var_184var_185var_186var_187var_188var_189var_190var_191var_192var_193var_194var_195var_196var_197var_198var_199
count200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000
mean0.10049010.679914-1.62762210.7151926.79652911.078333-5.0653175.40894916.5458500.2841627.5672360.394340-3.24559614.0239788.5302327.53760614.5731269.333264-5.69673115.24401312.43856713.29089417.2578834.3054303.01954010.58440013.667496-4.055133-1.1379085.5329805.053874-7.68774010.393046-0.51288614.77414711.4342503.8424992.1872305.86889910.6421310.662956-6.7255059.29985811.22235611.5699548.948289-12.69966711.326488-12.47173714.70471316.68249912.74098613.428912-2.5288166.0085691.13711712.74585216.6291656.2720143.1776338.93112412.155618-11.9467440.8741700.6611736.3691570.9828915.79403911.9432235.018893-3.33151524.4468110.6697560.64055319.61088819.51884616.8537326.05087119.0669935.34947914.4021365.79504414.719024-3.4712731.025817-2.59020918.3627215.62105811.3514838.7029243.725208-16.5481476.98754112.73957810.55674010.999162-0.08434414.40043318.5396451.752012-0.746296-6.60051813.41352622.2949081.56839311.5098344.2447448.61765717.79626614.22443518.4580015.5132386.3126033.3178438.1365423.0811912.2137172.40257016.102233-5.3051323.03284924.52107811.3105911.1929847.0762544.27274012.48916513.2023260.851507-1.12795215.46031412.2571510.5446747.7996766.813270-4.826053-4.25947222.96860217.6136511.2107927.7601933.4236362.89759611.98348912.3336988.6476324.84132810.341178-3.3007793.9907265.29623716.81767110.1415427.63319916.7279026.974955-2.07412813.209272-4.81355217.91459110.22328224.2593005.6332935.36289611.002170-2.87190619.3157532.963335-4.1511554.9371245.636008-0.004962-0.83177719.817094-0.67796720.21067711.640613-2.79958511.882933-1.0140642.591444-2.74166610.0855180.7191098.76908812.756676-3.9832618.970274-10.33504315.3771740.7460723.2344407.4384081.9278393.33177417.993784-0.1420882.3033358.90815815.870720-3.326537
std0.3006533.0400514.0500442.6408942.0433191.6231507.8632670.8666073.4180763.3326341.2350705.5007935.9702530.1900594.6395362.2479080.4117112.5574216.7126127.8513707.9966945.8762548.1965642.8479580.5268933.7772450.2855355.9222101.5237140.7833672.6159427.9651982.1598912.5878304.3223250.5416145.1795593.1199782.2497304.2789034.0688458.2792595.9380880.6959910.3095995.90307321.4049122.86051110.57986211.3843327.8557620.6917098.1873064.9855320.7647538.4142415.6900723.5401740.7950264.2966860.8547984.22238911.6229482.0262383.1130891.4858543.7864931.1213667.3651150.0071863.95572311.9517420.2666963.9447037.46630314.1125916.0553227.9383513.8172921.9937921.3090557.4367372.2995678.4792558.2972296.2253053.9085367.7511425.6618672.4914603.56055413.1528100.1526414.1862520.5433412.7680990.6211258.52540012.6423820.7158361.8625509.1816834.9505378.6281790.1850201.9705200.8556981.8948997.6047230.1710914.3550313.8232531.0824041.5911704.4590770.9853962.6218511.65091213.2976628.7992684.18279612.1210161.7144165.1684796.1473452.7368210.3181000.7760563.1376843.2380434.1364530.8321990.4562801.4564860.3756036.1661267.61773210.3822358.8905164.5517507.6864334.8963256.7156375.6919362.9347060.9224693.8992812.5188837.4133010.19919210.3851332.4641573.9624263.0053732.0142004.9616785.7712610.9551405.5702727.8855794.12291210.8802630.2179381.4196125.2620565.4577845.0241820.3696847.7980203.1059860.3694374.4246215.3780088.6741715.9666747.1364272.8921677.5139392.6288958.5798102.7989565.2612431.3718628.9634344.4749249.3182804.7251673.18975911.5747083.9446040.9763484.5599223.0232721.4784233.9920303.1351621.4293725.4543690.9216253.01094510.438015
min0.0000000.408400-15.0434002.117100-0.0402005.074800-32.5626002.3473005.349700-10.5055003.970500-20.731300-26.09500013.434600-6.0111001.01330013.0769000.635100-33.380200-10.664200-12.402500-5.432200-10.089000-5.3225001.209800-0.67840012.720000-24.243100-6.1668002.089600-4.787200-34.7984002.140600-8.9861001.5085009.816900-16.513600-8.095100-1.183400-6.337100-14.545700-35.211700-8.5359008.85900010.652800-9.939600-90.2525001.206200-47.686200-23.902200-8.07070010.385500-15.046200-24.7214003.344900-26.778600-3.7826002.7618003.442300-12.6009006.184000-2.100600-48.802700-6.328900-10.5544001.611700-14.0888001.336800-19.5443004.993800-16.309400-17.027500-0.224000-12.383400-1.665800-34.101500-1.293600-21.6333007.425700-1.81830010.445400-18.0422007.586500-30.026600-24.220100-24.4398007.023000-19.272200-8.4816001.350200-9.601400-61.7180006.521800-1.0185008.4916002.819000-2.432400-12.158400-21.740000-0.603500-7.280600-39.1791000.075700-7.3829000.9793004.0846000.7153000.942400-5.89800013.7290005.769700-9.2398002.194200-2.030200-5.513900-0.050500-6.858600-3.163000-31.836900-37.527700-9.774200-18.6962006.305200-15.194000-12.405900-7.05380011.48610011.265400-8.876900-11.7559002.1863009.528300-0.9548002.8900005.359300-24.254600-31.380800-9.949300-9.851000-16.468400-21.274300-15.459500-16.693700-7.1080002.8068005.444300-8.2734000.427400-29.9840003.320500-41.1683009.242000-2.191500-2.88000011.030800-8.196600-21.8409009.996500-22.990400-4.554400-4.641600-7.4522004.8526000.623100-6.531700-19.9977003.8167001.851200-35.969500-5.2502004.258800-14.506000-22.479300-11.453300-22.748700-2.9953003.241500-29.1165004.952100-29.273400-7.856100-22.0374005.416500-26.001100-4.808200-18.489700-22.583300-3.022300-47.7536004.412300-2.554300-14.093300-2.691700-3.814500-11.7834008.694400-5.261000-14.2096005.9606006.299300-38.852800
25%0.0000008.453850-4.7400258.7224755.2540759.883175-11.2003504.76770013.943800-2.3178006.618800-3.594950-7.51060013.8940005.0728005.78187514.2628007.452275-10.4762259.1779506.2764758.62780011.5510002.1824002.6341007.61300013.456400-8.321725-2.3079004.9921003.171700-13.7661758.870000-2.50087511.45630011.0323000.116975-0.0071254.1254757.591050-2.199500-12.8318254.51957510.71320011.3438005.313650-28.7307009.248750-20.6545256.35197510.65347512.2690007.267625-6.0650255.435600-5.1476258.16390014.0978755.6875000.1835008.3124008.912750-20.901725-0.572400-1.5887005.293500-1.7028004.9738006.7532005.014000-6.33662515.2566250.472300-2.19710014.0972759.59597512.4809750.59630016.0147003.81727513.3754000.69447513.214775-10.004950-5.106400-7.21612515.3385750.4075507.2471756.9187751.140500-26.6656006.8699009.67030010.1956008.828000-0.5274007.7969508.9195251.267675-2.106200-13.1987009.63980016.0479751.42890010.0979003.6396007.28230012.16807514.09890015.1071752.8174755.5101002.0926754.8032502.3887750.3997001.1718756.373500-11.587850-0.16197515.6962759.996400-2.5652002.8170502.35360012.24540012.608400-1.502325-3.58072512.51447511.6193000.2078006.7243756.543500-9.625700-9.95710014.93390010.656550-2.0118252.387575-0.121700-2.1537257.90000010.3112007.9680751.8858758.646900-8.7514503.853600-1.90320014.9522007.0646005.56790015.2330003.339900-6.26602512.475100-8.93995012.1092007.24352515.6961255.4705004.3261007.029600-7.09402515.7445502.699000-9.6431002.7032005.374600-3.258500-4.72035013.731775-5.00952515.0646009.371600-8.3865009.808675-7.3957000.625575-6.6739009.084700-6.0644255.4231005.663300-7.3600006.715200-19.20512512.5015500.014900-0.0588255.1574000.8897750.58460015.629800-1.170700-1.9469258.25280013.829700-11.208475
50%0.00000010.524750-1.60805010.5800006.82500011.108250-4.8331505.38510016.4568000.3937007.6296000.487300-3.28695014.0255008.6042507.52030014.5741009.232050-5.66635015.19625012.45390013.19680017.2342504.2751503.00865010.38035013.662500-4.196900-1.1321005.5348504.950200-7.41175010.365650-0.49765014.57600011.4352003.9177502.1980005.90065010.5627000.672300-6.6174509.16265011.24340011.5650009.437200-12.54720011.310750-12.48240014.55920016.67240012.74560013.444400-2.5024506.0278001.27405012.59410016.6481506.2625003.1701008.90100012.064350-11.8920000.7947000.6817006.3777001.0213505.78200011.9220005.019100-3.32550024.4450000.6684000.64645019.30975019.53665016.8442006.29780018.9678505.44005014.3888506.06175014.844500-3.2844501.069700-2.51795018.2964506.00670011.2880008.6162003.642550-16.4826006.98650012.67350010.58220010.983850-0.09860014.36990018.5021501.768300-0.771300-6.40150013.38085022.3068501.56600011.4979504.2245008.60515017.57320014.22660018.2813505.3943006.3401003.4084008.1485503.0838002.2498502.45630015.944850-5.1895003.02395024.35470011.2397001.2007007.2343004.30210012.48630013.1668000.925000-1.10175015.42680012.2646500.5566007.8091006.806700-4.704250-4.11190022.94830017.2572501.2117508.0662503.5647002.97550011.85590012.3563508.6518504.90470010.395600-3.1787003.9960005.28325016.73695010.1279007.67370016.6497506.994050-2.06610013.184300-4.86840017.63045010.21755023.8645005.6335005.35970010.788700-2.63780019.2708002.960200-4.0116004.7616005.6343000.002800-0.80735019.748000-0.56975020.20610011.679800-2.53845011.737250-0.9420502.512300-2.68880010.0360500.7202008.60000012.521000-3.9469508.902150-10.20975015.2394500.7426003.2036007.3477501.9013003.39635017.957950-0.1727002.4089008.88820015.934050-2.819550
75%0.00000012.7582001.35862512.5167008.32410012.2611250.9248006.00300019.1029002.9379008.5844254.3829250.85282514.16420012.2747759.27042514.87450011.055900-0.81077521.01332518.43330017.87940023.0890506.2932003.40380013.47960013.863700-0.0902000.0156256.0937006.798925-1.44345011.8850001.46910018.09712511.8444007.4877254.4604007.54240013.5989253.637825-0.88087513.75480011.75690011.80460013.0873003.15052513.318300-4.24452523.02865022.54905013.23450019.3856500.9443506.5429007.40182517.08662519.2897006.8450006.2097009.56652515.116500-3.2254502.2282003.0203007.4906003.7392006.58620017.0376505.024100-0.49887533.6331500.8644003.51070025.20712529.62070021.43222511.81880022.0411006.86720015.38310011.44912516.3408003.1017257.4499001.98670021.35885011.15837515.43322510.5670256.146200-6.4093757.10140015.84022510.94490013.0891000.32910020.81937528.1589752.2609000.5285000.13210017.25022528.6822251.70540012.9021004.8222009.92890023.34860014.36180021.8529008.1043257.0803004.57740011.5962003.8119004.1215003.66510025.7808250.9718006.09840033.10527512.6194255.09170011.7347506.19220012.71810013.8117003.2930001.35170018.48040012.8767000.9010008.9114257.070800-0.1788001.12595031.04242524.4260254.39122513.2325257.0785258.19242516.07392514.4610509.3150007.67692512.1132252.0282754.13160012.68822518.68250013.0576009.81730018.26390010.7663501.89175013.929300-0.98857523.87532513.09452532.6228505.7920006.37120014.6239001.32360023.0240253.2415001.3187257.0200255.9054003.0964002.95680025.9077253.61990025.64122513.7455002.70440013.9313005.3387504.3911250.99620011.0113007.49917512.12742519.456150-0.59065011.193800-1.46600018.3452251.4829006.4062009.5125252.9495006.20580020.3965250.8296006.5567259.59330018.0647254.836800
max1.00000020.31500010.37680019.35300013.18830016.67140017.2516008.44770027.69180010.15130011.15060018.67020017.18870014.65450022.33150014.93770015.86330017.95060019.02590041.74800035.18300031.28590049.04430014.5945004.87520025.44600014.65460015.6751003.2431008.78740013.14310015.65150020.1719006.78710029.54660013.28780021.52890014.24560011.86380029.82350015.32230018.10560026.16580013.46960012.57790034.19610062.08440021.29390020.68540054.27380041.15300015.31720040.68900017.0968008.23150028.57240029.09210029.0741009.16090020.48330011.98670025.19550027.1029007.75360011.23170011.15370015.7313009.71320039.3968005.0469008.54730064.4644001.57190014.15000044.53610070.27200036.15670034.43520030.95690011.35070018.22560030.47690023.13240021.89340027.71430017.74240032.90110034.56370033.35410017.45940015.48160027.2713007.48950026.99760012.53430018.9750001.80400040.88060058.2879004.5028005.07640025.14090028.45940051.3265002.18870019.0206007.16920015.30740046.37950014.74300032.05910019.5193009.8002008.43170021.5421006.58500011.9504008.12070064.81090025.26350015.68850074.03210017.30740018.47140026.87490014.99150013.66420015.51560010.5976009.80960031.20360014.9895002.19230012.4650008.30910012.72360021.41280054.57940044.43760018.81870036.09710021.12190023.96580032.89110022.69160011.81010016.00830020.43730022.1494004.75280048.42400025.43570021.12450018.38460024.00750023.24280016.83160016.49700011.97210044.77950025.12000058.3942006.30990010.13440027.56480012.11930038.3322004.22040021.27660014.8861007.08900016.73190017.91730053.59190018.85540043.54680020.85480020.24520020.59650029.84130013.44870012.75050014.39390029.24870023.70490044.36340012.99750021.73920022.78610029.3303004.03410018.44090016.7165008.40240018.28180027.9288004.27290018.32150012.00040026.07910028.500700

似乎所有的数据都是数值,再看下数据类型和数据缺失情况:

train.info()
print(f'Number of columns with nan: {train.isna().any().sum()}')

RangeIndex: 200000 entries, 0 to 199999
Columns: 202 entries, ID_code to var_199
dtypes: float64(200), int64(1), object(1)
memory usage: 308.2+ MB

Number of columns with nan: 0

所有特征的数据类型都是数值,有一列object是唯一的ID,在建模时删除,并且数据集没有数据缺失的情况。

检查测试集内目标值的分布情况:

sns.countplot(train['target'],palette='Set3')
print(f"target=1 占比:{train['target'].mean()*100}%")

在这里插入图片描述

target=1 占比:10.049%

数据集的特征名似乎对理解数据没有参考意义,先从一些统计量入手。
先看下不同target下各特征的分布情况,这里摘取前25个特征:

features=train.columns[2:27]
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
plt.subplots(5,5,figsize=(18,22))
for i,val in enumerate(features):
    plt.subplot(5,5,i+1)
    sns.distplot(t0[val],hist=False,label="target = 0")
    sns.distplot(t1[val],hist=False,label="target = 1")
    plt.xlabel(val,fontsize=9)
    plt.tick_params(axis='x', which='major', labelsize=3, pad=-6)
    plt.tick_params(axis='y', which='major', labelsize=3)
plt.show()

在这里插入图片描述
对于不同的target,一些特征在某个范围内分布会有明显的不同,如var_2、var_6、var_12等,加入特征值的数量可能会对预测有帮助。

以下对比target=0和target=1时,统计量的分布情况
样本的标准差分布:

plt.figure(figsize=(16,6))
sns.distplot(t0[features].std(axis=1),hist=True,bins=120,label='target = 0',kde=True)
sns.distplot(t1[features].std(axis=1),hist=True,bins=120,label='target = 1',kde=True)
plt.legend()

在这里插入图片描述
特征的标准差分布:

plt.figure(figsize=(16,6))
sns.distplot(t0[features].std(axis=0),hist=True,bins=120,label='target = 0',kde=True)
sns.distplot(t1[features].std(axis=0),hist=True,bins=120,label='target = 1',kde=True)
plt.legend()

在这里插入图片描述
样本的最大值分布:

plt.figure(figsize=(16,6))
sns.distplot(t0[features].max(axis=1),hist=True,bins=120,label='target = 0',kde=True)
sns.distplot(t1[features].max(axis=1),hist=True,bins=120,label='target = 1',kde=True)
plt.legend()

在这里插入图片描述
特征的最大值分布:

plt.figure(figsize=(16,6))
sns.distplot(t0[features].max(axis=0),hist=True,bins=120,label='target = 0',kde=True)
sns.distplot(t1[features].max(axis=0),hist=True,bins=120,label='target = 1',kde=True)
plt.legend()

在这里插入图片描述

对比训练集和测试集的数据分布情况:
样本的均值:

features=train.columns[2:202]
plt.figure(figsize=(16,6))
sns.distplot(train[features].mean(axis=1),hist=True,color='green',bins=120,label='train',kde=True)
sns.distplot(test.mean(axis=1),hist=True,color='red',bins=120,label='test',kde=True)

在这里插入图片描述
特征的均值分布:

features=train.columns[2:202]
plt.figure(figsize=(16,6))
sns.distplot(train[features].mean(axis=0),hist=True,color='magenta',bins=120,label='train',kde=True)
sns.distplot(test[features].mean(axis=0),hist=True,color='darkblue',bins=120,label='test',kde=True)
plt.legend()

在这里插入图片描述
样本的标准差分布:

features=train.columns[2:202]
plt.figure(figsize=(16,6))
sns.distplot(train[features].std(axis=0),hist=True,color='blue',bins=120,label='train',kde=True)
sns.distplot(test[features].std(axis=0),hist=True,color='g',bins=120,label='test',kde=True)
plt.legend()

在这里插入图片描述
特征的标准差分布:

features=train.columns[2:202]
plt.figure(figsize=(16,6))
sns.distplot(train[features].std(axis=1),hist=True,color='magenta',bins=120,label='train',kde=True)
sns.distplot(test[features].std(axis=1),hist=True,color='darkblue',bins=120,label='test',kde=True)
plt.legend()

在这里插入图片描述
检查特征之间的相关性,最大值仅0.009844,说明特征之间的分布不存在明显的相关性

correlations=train[features].corr().abs().unstack().sort_values(kind='quicksort').reset_index()
correlations=correlations.loc[correlations['level_0']!=correlations['level_1']]
correlations.tail(10)
level_0level_10
39790var_183var_1890.009359
39791var_189var_1830.009359
39792var_174var_810.009490
39793var_81var_1740.009490
39794var_81var_1650.009714
39795var_165var_810.009714
39796var_53var_1480.009788
39797var_148var_530.009788
39798var_26var_1390.009844
39799var_139var_260.009844

建模

如下Stack模型:1.Lightgbm 2. Logistic regression:

  1. Lightgbm模型在迭代时不会考虑到特征值的频数,因此根据以上分析,增加特征值的频数
# 增加频数特征
def encode_FE(df,col,test):
    cv=df[col].value_counts()
    nm=col+'_FE'
    df[nm]=df[col].map(cv)
    test[nm]=test[col].map(cv)
    test[nm].fillna(0,inplace=True)
    if cv.max()<255:
        df[nm]=df[nm].astype('uint8')
        test[nm]=test[nm].astype('uint8')
    else:
        df[nm]=df[nm].astype('uint16')
        test[nm]=test[nm].astype('uint16')
    return
    
test['target']=-1
comb=pd.concat([train,test],axis=0,sort=True)
for i in range(200): encode_FE(comb,'var_'+str(i),test)

交叉验证前的参数与数据准备:

# 取出训练集
train = comb[:len(train)]
# LightGBM模型的参数,其中目标是分类,按要求以AUC评判模型
params={
    'learning_rate':0.04,
    'num_leaves':3,
    'metric':'auc',
    'boost_from_average':False,
    'feature_fraction':1.0,
    'max_depth':-1,
    'objective':'binary',
    'verbosity':10
}
# 洗牌
train2=train.sample(frac=1.0,random_state=1)
num_vars=200
# all_oof 交叉验证时的OUT-OF-FOLDER PREDCTION
# all_oofB 不包含频数特征 
all_oof = np.zeros((len(train2),num_vars+1))
all_oof[:,0] = np.ones(len(train2))
all_oofB = np.zeros((len(train2),num_vars+1))
all_oofB[:,0] = np.ones(len(train2))

# all_preds 测试集预测;all_predsB 不含特征值频数的测试集预测
all_preds = np.zeros((len(test),num_vars+1))
all_preds[:,0] = np.ones(len(test))
all_predsB = np.zeros((len(test),num_vars+1))
all_predsB[:,0] = np.ones(len(test))

# 以下代码为模型训练:
evals_result={}
for j in range(num_vars):
    features=['var_'+str(j),'var_'+str(j)+'_FE']
    oof=np.zeros(len(train2))
    pred=np.zeros(len(test))
    # 分布图
    plt.figure(figsize=(16,8))
    plt.subplot(1,2,2)
    sns.distplot(train2[train2['target']==0]['var_'+str(j)],label='t=0')
    sns.distplot(train2[train2['target']==1]['var_'+str(j)],label='t=1')
    plt.legend()
    plt.yticks([])
    plt.xlabel('var_'+str(j))
    
    # mn,mx 特征极限值
    # mnFE,mxFE 特征频数极限值
    mn,mx=plt.xlim()
    mnFE=train2['var_'+str(j)+'_FE'].min()
    mxFE=train2['var_'+str(j)+'_FE'].max()
    
    # df, 用于nunique*step, 计算该特征不同值在不同frequency下,被预测为正的概率
    step=50
    stepB=train2['var_'+str(j)+'_FE'].nunique()
    w=(mx-mn)/step
    x=w*(np.arange(0,step)+0.5)+mn   #x 每个bin中心的x值
    x2 = np.array([])
    for i in range(stepB):
        x2 = np.concatenate([x,x2])
    df = pd.DataFrame({'var_'+str(j):x2})
    df['var_'+str(j)+'_FE'] = mnFE + (mxFE-mnFE)/(stepB-1) * (df.index//step)
    df['pred'] = 0
    
    # 分成5个折子进行交叉验证,带频数特征的样本
    for k in range(5):
        valid=train2.iloc[(40000*k):(40000*(k+1))]
        train=train2[~train2.index.isin(valid.index)]
        val=lgb.Dataset(valid[features],valid['target'])
        trn=lgb.Dataset(train[features],train['target'])
        model=lgb.train(params,train_set=trn,num_boost_round=750,valid_sets=[trn,val],verbose_eval=False,evals_result=evals_result)
        
        oof[(40000*k):(40000*(k+1))]=model.predict(valid[features],num_iteration=model.best_iteration)
        pred+=model.predict(test[features],num_iteration=model.best_iteration)/5.0
        df['pred'] += model.predict(df[features],num_iteration=model.best_iteration)/5.0
    
    val_auc=roc_auc_score(train2['target'],oof)
    print('VAR_'+str(j)+' with magic val_auc =',round(val_auc,5))
    
    all_oof[:,j+1] = oof
    all_preds[:,j+1] = pred
    x = df['pred'].values
    x = np.reshape(x,(stepB,step))
    x = np.flip(x,axis=0)
    
    plt.subplot(1,2,1)
    sns.heatmap(x,cmap='RdBu_r',center=0.0)
    plt.title('VAR_'+str(j)+' Predictions with Magic',fontsize=16)    
    plt.xticks(np.linspace(0,49,5),np.round(np.linspace(mn,mx,5),1))
    plt.xlabel('Var_'+str(j))
    s = min(mxFE-mnFE+1,20)
    plt.yticks(np.linspace(mnFE,mxFE,s)-0.5,np.linspace(mxFE,mnFE,s).astype('int'))
    plt.ylabel('Count')
    plt.show()
    
    # model without features(FE)
    features = ['var_'+str(j)]
    oof = np.zeros(len(train2))
    preds = np.zeros(len(test))
    
    # PLOT DENSITIES
    plt.figure(figsize=(16,5))
    plt.subplot(1,2,2)
    sns.distplot(train2[train2['target']==0]['var_'+str(j)], label = 't=0')
    sns.distplot(train2[train2['target']==1]['var_'+str(j)], label = 't=1')
    plt.legend()
    plt.yticks([])
    plt.xlabel('Var_'+str(j))
    
    # MAKE A GRID OF POINTS FOR LGBM TO PREDICT
    mn,mx = plt.xlim()
    mnFE = train2['var_'+str(j)+'_FE'].min()
    mxFE = train2['var_'+str(j)+'_FE'].max()
    step = 50
    stepB = train2['var_'+str(j)+'_FE'].nunique()
    w = (mx-mn)/step
    x = w * (np.arange(0,step)+0.5) + mn
    x2 = np.array([])
    for i in range(stepB):
        x2 = np.concatenate([x,x2])
    df = pd.DataFrame({'var_'+str(j):x2})
    df['var_'+str(j)+'_FE'] = mnFE + (mxFE-mnFE)/(stepB-1) * (df.index//step)
    df['pred'] = 0
    
    # 分成5个折子进行交叉验证,不带频数特征的样本
    for k in range(5):
        valid = train2.iloc[k*40000:(k+1)*40000]
        train = train2[ ~train2.index.isin(valid.index) ]
        trn_data  = lgb.Dataset(train[features], label=train['target'])
        val_data = lgb.Dataset(valid[features], label=valid['target'])

        model = lgb.train(params, trn_data, 750, valid_sets = [trn_data, val_data], verbose_eval=False, evals_result=evals_result)      

        oof[k*40000:(k+1)*40000] = model.predict(valid[features], num_iteration=model.best_iteration)
        preds += model.predict(test[features], num_iteration=model.best_iteration)/5.0
        df['pred'] += model.predict(df[features], num_iteration=model.best_iteration)/5.0
            
    val_auc = roc_auc_score(train2['target'],oof)
    print('VAR_'+str(j)+' without magic val_auc =',round(val_auc,5))
    all_oofB[:,j+1] = oof
    all_predsB[:,j+1] = preds
    x = df['pred'].values
    x = np.reshape(x,(stepB,step))
    x = np.flip(x,axis=0)
    
    # PLOT LGBM PREDICTIONS WITHOUT USING FE
    plt.subplot(1,2,1)
    sns.heatmap(x, cmap='RdBu_r', center=0.0) 
    plt.title('VAR_'+str(j)+' Predictions without Magic',fontsize=16)
    plt.xticks(np.linspace(0,49,5),np.round(np.linspace(mn,mx,5),1))
    plt.xlabel('Var_'+str(j))
    plt.yticks([])
    plt.ylabel('')
    plt.show()

截取VAR_2对比如下,VAR_2在增加频数特征,预测模型指标会有略微的提升。
增加频数特征后:
在这里插入图片描述

VAR_2 with magic FE val_auc = 0.55043

增加频数特征前:
在这里插入图片描述

VAR_2 with magic FE val_auc = 0.55043

在以上模型的数据基础上,再增加逻辑斯蒂回归的模型训练:

logrB = sm.Logit(train2['target'], all_oofB[:,:num_vars+1])
logrB = logrB.fit(disp=0)
ensemble_predsB = logrB.predict(all_oofB[:,:num_vars+1])
ensemble_aucB = roc_auc_score(train2['target'],ensemble_predsB)  
print('Combined Model without FE Val_AUC=',round(ensemble_aucB,5))
print()

# ENSEMBLE MODEL WITH FE
logr = sm.Logit(train2['target'], all_oof[:,:num_vars+1])
logr = logr.fit(disp=0)
ensemble_preds = logr.predict(all_preds[:,:num_vars+1])
ensemble_auc = roc_auc_score(train2['target'],ensemble_preds)  
print('Combined Model with FE Val_AUC=',round(ensemble_auc,5))
print()

Combined Model without magic FE Val_AUC= 0.88619
Combined Model with magic FE Val_AUC= 0.91285

第一次提交,仅增加一些EDA中的统计量特征如最大/小值、方差、偏度、峰度、中位数等,成绩只有TOP42%左右:
在这里插入图片描述
第二次提交,增加上述Stack模型融合(LightGBM+Logistic Regression),并保留频数特征,成绩在TOP1.4%(123位)左右:
在这里插入图片描述
相比Blend,Stack更容易获得高分

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值