几点思考: 1、对pandas的使用就像是操作SQL语句, 总体说是增删改查, 但是涉及到联表, 涉及到分组,涉及到不同数据类型的操作,就有很多tricks在里面, 这些tricks是需要在不断的学习->使用中不断精进和掌握; 2、特征中包含datetime类型特征的时候, 可以依此分组构造新的时序特征, (1) 是否是周末? (2) 是一个月的第几天? (3) 趋势特征 (4) 其他 3、值得fork的代码是: (1) 数值类型特征异常值检测处理方法; (2) 反应时间趋势特征的指数加权移动平均的方法; (3) 时序特征统计量 4、不同的机器学习算法对特征的构造方式是有区别的, 比如KNN算法, 不需要对异常值做处理(异常值不敏感), 而线性回归,SVM等算法,需要处理异常值;比如决策树方法对特征量纲不敏感, 不需要特征归一化,而KNN需要; 比如像xgboost算法,不需要对空值(nan)做处理, 在训练的过程中会预测填充, 而简单的算法必须删除或填充异常值.(xgboost在百度的PaddlePaddle里面已经在配置环境默认安装了, 但是lightgbm, catboost等前延算法需要每次重置环镜自行安装. 还有我想请问一下可能看见这篇文章的各位大佬, Paddle里给免费提供的算力卡, 真是这样的吗(Tesla v100)?见图1, 我怎么感觉没那么好的性能呢). 5、硬件配置真的影响了在机器学习包括深度学习领域的科研信心, 验证一个模型要等几个小时(就是在高配的Paddle环境中, 天池的环境又总是排队, 昨天“天元”宣布开源了, 不知道能不能提供可靠的免费 算力), 需要学习数据存储和操作的tricks, 只能软解了. 6、机器学习竞赛的三把利刃: 特征构造、建模调参(对数化、寻忧算法(贪心算法、网格搜索、贝叶斯寻优))、模型融合(stacking, voting, 随机森林+bagging+Adaboost).模型融合真的吃内存.
饭店流量预测
饭店来客数据
import pandas as pd
air_visit = pd. read_csv( 'air_visit_data.csv' )
air_visit. head( )
air_store_id visit_date visitors 0 air_ba937bf13d40fb24 2016-01-13 25 1 air_ba937bf13d40fb24 2016-01-14 32 2 air_ba937bf13d40fb24 2016-01-15 29 3 air_ba937bf13d40fb24 2016-01-16 22 4 air_ba937bf13d40fb24 2016-01-18 6
air_visit. index = pd. to_datetime( air_visit[ 'visit_date' ] )
air_visit. head( )
air_store_id visit_date visitors visit_date 2016-01-13 air_ba937bf13d40fb24 2016-01-13 25 2016-01-14 air_ba937bf13d40fb24 2016-01-14 32 2016-01-15 air_ba937bf13d40fb24 2016-01-15 29 2016-01-16 air_ba937bf13d40fb24 2016-01-16 22 2016-01-18 air_ba937bf13d40fb24 2016-01-18 6
按天来算
(1)对时间按天采样resample(‘1d’).sum()
air_visit = air_visit.groupby(‘air_store_id’).apply(lambda g: g[‘visitors’].resample(‘1d’).sum()).reset_index()
air_visit = air_visit. groupby( 'air_store_id' ) . apply ( lambda g: g[ 'visitors' ] . resample( '1d' ) . sum ( ) ) . reset_index( )
air_visit. head( )
air_store_id visit_date visitors 0 air_00a91d42b08b08d9 2016-07-01 35 1 air_00a91d42b08b08d9 2016-07-02 9 2 air_00a91d42b08b08d9 2016-07-03 0 3 air_00a91d42b08b08d9 2016-07-04 20 4 air_00a91d42b08b08d9 2016-07-05 25
air_visit. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296279 entries, 0 to 296278
Data columns (total 3 columns):
air_store_id 296279 non-null object
visit_date 296279 non-null datetime64[ns]
visitors 296279 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 6.8+ MB
缺失值填0
(2) 规范时间变量 dt.strftime(’%Y-%m-%d’)
air_visit[‘visit_date’] = air_visit[‘visit_date’].dt.strftime(’%Y-%m-%d’)
air_visit[ 'visit_date' ] = air_visit[ 'visit_date' ] . dt. strftime( '%Y-%m-%d' )
air_visit[ 'was_nil' ] = air_visit[ 'visitors' ] . isnull( )
air_visit[ 'visitors' ] . fillna( 0 , inplace= True )
air_visit. head( )
air_store_id visit_date visitors was_nil 0 air_00a91d42b08b08d9 2016-07-01 35 False 1 air_00a91d42b08b08d9 2016-07-02 9 False 2 air_00a91d42b08b08d9 2016-07-03 0 False 3 air_00a91d42b08b08d9 2016-07-04 20 False 4 air_00a91d42b08b08d9 2016-07-05 25 False
日历数据
date_info = pd. read_csv( 'date_info.csv' )
date_info. head( )
calendar_date day_of_week holiday_flg 0 2016-01-01 Friday 1 1 2016-01-02 Saturday 1 2 2016-01-03 Sunday 1 3 2016-01-04 Monday 0 4 2016-01-05 Tuesday 0
(3) shift()操作对数据进行移动,可以观察前一天和后天是不是节假日。
date_info. rename( columns= { 'holiday_flg' : 'is_holiday' , 'calendar_date' : 'visit_date' } , inplace= True )
date_info[ 'prev_day_is_holiday' ] = date_info[ 'is_holiday' ] . shift( ) . fillna( 0 )
date_info[ 'next_day_is_holiday' ] = date_info[ 'is_holiday' ] . shift( - 1 ) . fillna( 0 )
date_info. head( )
visit_date day_of_week is_holiday prev_day_is_holiday next_day_is_holiday 0 2016-01-01 Friday 1 0.0 1.0 1 2016-01-02 Saturday 1 1.0 1.0 2 2016-01-03 Sunday 1 1.0 0.0 3 2016-01-04 Monday 0 1.0 0.0 4 2016-01-05 Tuesday 0 0.0 0.0
地区数据
air_store_info = pd. read_csv( 'air_store_info.csv' )
air_store_info. head( )
air_store_id air_genre_name air_area_name latitude longitude 0 air_0f0cdeee6c9bf3d7 Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 1 air_7cc17a324ae5c7dc Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 2 air_fee8dcf4d619598e Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 3 air_a17f0778617c76e2 Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 4 air_83db5aff8f50478e Italian/French Tōkyō-to Minato-ku Shibakōen 35.658068 139.751599
测试集
(4) 字符串特征切片 str.slice(,)
submission[‘air_store_id’] = submission[‘id’].str.slice(0, 20)
import numpy as np
submission = pd. read_csv( 'sample_sub.csv' )
submission[ 'air_store_id' ] = submission[ 'id' ] . str . slice ( 0 , 20 )
submission[ 'visit_date' ] = submission[ 'id' ] . str . slice ( 21 )
submission[ 'is_test' ] = True
submission[ 'visitors' ] = np. nan
submission[ 'test_number' ] = range ( len ( submission) )
submission. head( )
id visitors air_store_id visit_date is_test test_number 0 air_00a91d42b08b08d9_2017-04-23 NaN air_00a91d42b08b08d9 2017-04-23 True 0 1 air_00a91d42b08b08d9_2017-04-24 NaN air_00a91d42b08b08d9 2017-04-24 True 1 2 air_00a91d42b08b08d9_2017-04-25 NaN air_00a91d42b08b08d9 2017-04-25 True 2 3 air_00a91d42b08b08d9_2017-04-26 NaN air_00a91d42b08b08d9 2017-04-26 True 3 4 air_00a91d42b08b08d9_2017-04-27 NaN air_00a91d42b08b08d9 2017-04-27 True 4
所有数据信息汇总
data = pd. concat( ( air_visit, submission. drop( 'id' , axis= 'columns' ) ) )
data. head( )
/Users/liu/TM/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
air_store_id is_test test_number visit_date visitors was_nil 0 air_00a91d42b08b08d9 NaN NaN 2016-07-01 35.0 False 1 air_00a91d42b08b08d9 NaN NaN 2016-07-02 9.0 False 2 air_00a91d42b08b08d9 NaN NaN 2016-07-03 0.0 False 3 air_00a91d42b08b08d9 NaN NaN 2016-07-04 20.0 False 4 air_00a91d42b08b08d9 NaN NaN 2016-07-05 25.0 False
data. shape
(328298, 6)
data. isnull( ) . sum ( )
air_store_id 0
is_test 296279
test_number 296279
visit_date 0
visitors 32019
was_nil 32019
dtype: int64
data[ 'is_test' ] . fillna( False , inplace= True )
data = pd. merge( left= data, right= date_info, on= 'visit_date' , how= 'left' )
data = pd. merge( left= data, right= air_store_info, on= 'air_store_id' , how= 'left' )
data[ 'visitors' ] = data[ 'visitors' ] . astype( float )
data. head( )
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude 0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595
import missingno as msno
msno. bar( data)
<matplotlib.axes._subplots.AxesSubplot at 0x121fe9390>
拿到天气数据
import glob
weather_dfs = [ ]
for path in glob. glob( './Weather/*.csv' ) :
weather_df = pd. read_csv( path)
weather_df[ 'station_id' ] = path. split( '\\' ) [ - 1 ] . rstrip( '.csv' )
weather_dfs. append( weather_df)
weather = pd. concat( weather_dfs, axis= 'rows' )
weather. rename( columns= { 'calendar_date' : 'visit_date' } , inplace= True )
weather. head( )
visit_date avg_temperature high_temperature low_temperature precipitation hours_sunlight solar_radiation deepest_snowfall total_snowfall avg_wind_speed avg_vapor_pressure avg_local_pressure avg_humidity avg_sea_pressure cloud_cover station_id 0 2016-01-01 20.5 22.4 17.5 0.0 0.6 NaN NaN NaN 6.3 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara 1 2016-01-02 23.5 26.2 21.2 5.0 3.6 NaN NaN NaN 4.7 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara 2 2016-01-03 21.7 23.7 20.2 11.0 0.0 NaN NaN NaN 2.8 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara 3 2016-01-04 21.6 23.8 20.4 11.0 0.1 NaN NaN NaN 3.3 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara 4 2016-01-05 22.1 24.6 20.5 35.5 0.0 NaN NaN NaN 2.4 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara
用各个小地方数据求出平均气温
(5) 以某一列为分组, 对其他列进行统计groupby()[[’’,’’]].mean()
means = weather.groupby(‘visit_date’)[[‘avg_temperature’, ‘precipitation’]].mean().reset_index()
means = weather. groupby( 'visit_date' ) [ [ 'avg_temperature' , 'precipitation' ] ] . mean( ) . reset_index( )
means. rename( columns= { 'avg_temperature' : 'global_avg_temperature' , 'precipitation' : 'global_precipitation' } , inplace= True )
means. head( )
visit_date global_avg_temperature global_precipitation 0 2016-01-01 2.868353 0.564662 1 2016-01-02 5.279225 2.341998 2 2016-01-03 6.589978 1.750616 3 2016-01-04 5.857883 1.644946 4 2016-01-05 4.556850 3.193625
means. visit_date. nunique( )
517
weather. visit_date. nunique( )
517
weather = pd. merge( left= weather, right= means, on= 'visit_date' , how= 'left' )
weather[ 'avg_temperature' ] . fillna( weather[ 'global_avg_temperature' ] , inplace= True )
weather[ 'precipitation' ] . fillna( weather[ 'global_precipitation' ] , inplace= True )
weather[ [ 'visit_date' , 'avg_temperature' , 'precipitation' ] ] . head( )
visit_date avg_temperature precipitation 0 2016-01-01 20.5 0.0 1 2016-01-02 23.5 5.0 2 2016-01-03 21.7 11.0 3 2016-01-04 21.6 11.0 4 2016-01-05 22.1 35.5
信息数据
data. info( )
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 328298 entries, 2016-07-01 to 2017-05-31
Data columns (total 15 columns):
air_store_id 328298 non-null object
is_test 328298 non-null bool
test_number 32019 non-null float64
visit_date 328298 non-null datetime64[ns]
visitors 296279 non-null float64
was_nil 296279 non-null object
day_of_week 328298 non-null object
is_holiday 328298 non-null int64
prev_day_is_holiday 328298 non-null float64
next_day_is_holiday 328298 non-null float64
air_genre_name 328298 non-null object
air_area_name 328298 non-null object
latitude 328298 non-null float64
longitude 328298 non-null float64
is_weekend 328298 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(6), int64(2), object(5)
memory usage: 37.9+ MB
data. reset_index( drop= True , inplace= True )
data. sort_values( [ 'air_store_id' , 'visit_date' ] , inplace= True )
data. head( )
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude is_weekend 0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0
(6) 异常点问题,数据中存在部分异常点,以正太分布为出发点,认为95%的是正常的,所以选择了1.96这个值。对异常点来规范,让特别大的点等于正常中最大的。
def find_outliers ( series) :
return ( series - series. mean( ) ) > 1.96 * series. std( )
def cap_values ( series) :
outliers = find_outliers( series)
max_val = series[ ~ outliers] . max ( )
series[ outliers] = max_val
return series
stores = data. groupby( 'air_store_id' )
data[ 'is_outlier' ] = stores. apply ( lambda g: find_outliers( g[ 'visitors' ] ) ) . values
data[ 'visitors_capped' ] = stores. apply ( lambda g: cap_values( g[ 'visitors' ] ) ) . values
data[ 'visitors_capped_log1p' ] = np. log1p( data[ 'visitors_capped' ] )
data. head( )
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude is_weekend is_outlier visitors_capped visitors_capped_log1p 0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 35.0 3.583519 1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 9.0 2.302585 2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 0.0 0.000000 3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 20.0 3.044522 4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 25.0 3.258097
data. isnull( ) . sum ( )
air_store_id 0
is_test 0
test_number 296279
visit_date 0
visitors 32019
was_nil 32019
day_of_week 0
is_holiday 0
prev_day_is_holiday 0
next_day_is_holiday 0
air_genre_name 0
air_area_name 0
latitude 0
longitude 0
is_weekend 0
is_outlier 0
visitors_capped 32019
visitors_capped_log1p 32019
dtype: int64
日期特征
(7) 添加“是否是周末”, “一个月的第几天”两个特征
data[‘is_weekend’] = data[‘day_of_week’].isin([[‘Saturday’, ‘Sunday’]]).astype(int)
data[‘day_of_month’] = data[‘visit_date’].dt.day
data[ 'is_weekend' ] = data[ 'day_of_week' ] . isin( [ 'Saturday' , 'Sunday' ] ) . astype( int )
data[ 'day_of_month' ] = data[ 'visit_date' ] . dt. day
data. head( )
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude is_weekend is_outlier visitors_capped visitors_capped_log1p day_of_month 0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 35.0 3.583519 1 1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 9.0 2.302585 2 2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 0.0 0.000000 3 3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 20.0 3.044522 4 4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 25.0 3.258097 5
(8) 指数加权移动平均(Exponential Weighted Moving Average),反应时间序列变换趋势,需要我们给定alpha值,这里我们来优化求一个最合适的。
from scipy import optimize
def calc_shifted_ewm ( series, alpha, adjust= True ) :
return series. shift( ) . ewm( alpha= alpha, adjust= adjust) . mean( )
def find_best_signal ( series, adjust= False , eps= 10e - 5 ) :
def f ( alpha) :
shifted_ewm = calc_shifted_ewm( series= series, alpha= min ( max ( alpha, 0 ) , 1 ) , adjust= adjust)
corr = np. mean( np. power( series - shifted_ewm, 2 ) )
return corr
res = optimize. differential_evolution( func= f, bounds= [ ( 0 + eps, 1 - eps) ] )
return calc_shifted_ewm( series= series, alpha= res[ 'x' ] [ 0 ] , adjust= adjust)
roll = data. groupby( [ 'air_store_id' , 'day_of_week' ] ) . apply ( lambda g: find_best_signal( g[ 'visitors_capped' ] ) )
data[ 'optimized_ewm_by_air_store_id_&_day_of_week' ] = roll. sort_index( level= [ 'air_store_id' , 'visit_date' ] ) . values
roll = data. groupby( [ 'air_store_id' , 'is_weekend' ] ) . apply ( lambda g: find_best_signal( g[ 'visitors_capped' ] ) )
data[ 'optimized_ewm_by_air_store_id_&_is_weekend' ] = roll. sort_index( level= [ 'air_store_id' , 'visit_date' ] ) . values
roll = data. groupby( [ 'air_store_id' , 'day_of_week' ] ) . apply ( lambda g: find_best_signal( g[ 'visitors_capped_log1p' ] ) )
data[ 'optimized_ewm_log1p_by_air_store_id_&_day_of_week' ] = roll. sort_index( level= [ 'air_store_id' , 'visit_date' ] ) . values
roll = data. groupby( [ 'air_store_id' , 'is_weekend' ] ) . apply ( lambda g: find_best_signal( g[ 'visitors_capped_log1p' ] ) )
data[ 'optimized_ewm_log1p_by_air_store_id_&_is_weekend' ] = roll. sort_index( level= [ 'air_store_id' , 'visit_date' ] ) . values
(9) 尽可能多的提取时间序列信息
def extract_precedent_statistics ( df, on, group_by) :
df. sort_values( group_by + [ 'visit_date' ] , inplace= True )
groups = df. groupby( group_by, sort= False )
stats = {
'mean' : [ ] ,
'median' : [ ] ,
'std' : [ ] ,
'count' : [ ] ,
'max' : [ ] ,
'min' : [ ]
}
exp_alphas = [ 0.1 , 0.25 , 0.3 , 0.5 , 0.75 ]
stats. update( { 'exp_{}_mean' . format ( alpha) : [ ] for alpha in exp_alphas} )
for _, group in groups:
shift = group[ on] . shift( )
roll = shift. rolling( window= len ( group) , min_periods= 1 )
stats[ 'mean' ] . extend( roll. mean( ) )
stats[ 'median' ] . extend( roll. median( ) )
stats[ 'std' ] . extend( roll. std( ) )
stats[ 'count' ] . extend( roll. count( ) )
stats[ 'max' ] . extend( roll. max ( ) )
stats[ 'min' ] . extend( roll. min ( ) )
for alpha in exp_alphas:
exp = shift. ewm( alpha= alpha, adjust= False )
stats[ 'exp_{}_mean' . format ( alpha) ] . extend( exp. mean( ) )
suffix = '_&_' . join( group_by)
for stat_name, values in stats. items( ) :
df[ '{}_{}_by_{}' . format ( on, stat_name, suffix) ] = values
extract_precedent_statistics(
df= data,
on= 'visitors_capped' ,
group_by= [ 'air_store_id' , 'day_of_week' ]
)
extract_precedent_statistics(
df= data,
on= 'visitors_capped' ,
group_by= [ 'air_store_id' , 'is_weekend' ]
)
extract_precedent_statistics(
df= data,
on= 'visitors_capped' ,
group_by= [ 'air_store_id' ]
)
extract_precedent_statistics(
df= data,
on= 'visitors_capped_log1p' ,
group_by= [ 'air_store_id' , 'day_of_week' ]
)
extract_precedent_statistics(
df= data,
on= 'visitors_capped_log1p' ,
group_by= [ 'air_store_id' , 'is_weekend' ]
)
extract_precedent_statistics(
df= data,
on= 'visitors_capped_log1p' ,
group_by= [ 'air_store_id' ]
)
data. sort_values( [ 'air_store_id' , 'visit_date' ] ) . head( )
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday ... visitors_capped_log1p_median_by_air_store_id visitors_capped_log1p_std_by_air_store_id visitors_capped_log1p_count_by_air_store_id visitors_capped_log1p_max_by_air_store_id visitors_capped_log1p_min_by_air_store_id visitors_capped_log1p_exp_0.1_mean_by_air_store_id visitors_capped_log1p_exp_0.25_mean_by_air_store_id visitors_capped_log1p_exp_0.3_mean_by_air_store_id visitors_capped_log1p_exp_0.5_mean_by_air_store_id visitors_capped_log1p_exp_0.75_mean_by_air_store_id visit_date 2016-07-01 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 ... NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 2016-07-02 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 ... 3.583519 NaN 1.0 3.583519 3.583519 3.583519 3.583519 3.583519 3.583519 3.583519 2016-07-03 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 True Sunday 0 0.0 0.0 ... 2.943052 0.905757 2.0 3.583519 2.302585 3.455426 3.263285 3.199239 2.943052 2.622819 2016-07-04 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 ... 2.302585 1.815870 3.0 3.583519 0.000000 3.109883 2.447464 2.239467 1.471526 0.655705 2016-07-05 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 ... 2.673554 1.578354 4.0 3.583519 0.000000 3.103347 2.596729 2.480984 2.258024 2.447318
5 rows × 89 columns
(10) 对数据的某几列进行onehot编码: data = pd.get_dummies(data, columns=[‘day_of_week’, ‘air_genre_name’])
data = pd. get_dummies( data, columns= [ 'day_of_week' , 'air_genre_name' ] )
data. head( )
数据集划分
data[ 'visitors_log1p' ] = np. log1p( data[ 'visitors' ] )
train = data[ ( data[ 'is_test' ] == False ) & ( data[ 'is_outlier' ] == False ) & ( data[ 'was_nil' ] == False ) ]
test = data[ data[ 'is_test' ] ] . sort_values( 'test_number' )
to_drop = [ 'air_store_id' , 'is_test' , 'test_number' , 'visit_date' , 'was_nil' ,
'is_outlier' , 'visitors_capped' , 'visitors' ,
'air_area_name' , 'latitude' , 'longitude' , 'visitors_capped_log1p' ]
train = train. drop( to_drop, axis= 'columns' )
train = train. dropna( )
test = test. drop( to_drop, axis= 'columns' )
X_train = train. drop( 'visitors_log1p' , axis= 'columns' )
X_test = test. drop( 'visitors_log1p' , axis= 'columns' )
y_train = train[ 'visitors_log1p' ]
X_train. head( )
is_holiday prev_day_is_holiday next_day_is_holiday is_weekend day_of_month optimized_ewm_by_air_store_id_&_day_of_week optimized_ewm_by_air_store_id_&_is_weekend optimized_ewm_log1p_by_air_store_id_&_day_of_week optimized_ewm_log1p_by_air_store_id_&_is_weekend visitors_capped_mean_by_air_store_id_&_day_of_week ... air_genre_name_Dining bar air_genre_name_International cuisine air_genre_name_Italian/French air_genre_name_Izakaya air_genre_name_Japanese food air_genre_name_Karaoke/Party air_genre_name_Okonomiyaki/Monja/Teppanyaki air_genre_name_Other air_genre_name_Western food air_genre_name_Yakiniku/Korean food visit_date 2016-07-15 0 0.0 0.0 0 15 35.000700 31.642520 3.588106 3.425707 38.5 ... 0 0 1 0 0 0 0 0 0 0 2016-07-16 0 0.0 0.0 1 16 9.061831 8.618812 2.302603 2.003579 10.0 ... 0 0 1 0 0 0 0 0 0 0 2016-07-19 0 1.0 0.0 0 19 24.841272 27.988385 3.252832 2.428565 24.5 ... 0 0 1 0 0 0 0 0 0 0 2016-07-20 0 0.0 0.0 0 20 29.198575 27.675525 3.412813 2.667124 32.5 ... 0 0 1 0 0 0 0 0 0 0 2016-07-21 0 0.0 0.0 0 21 32.710972 26.767268 3.537397 2.761626 31.0 ... 0 0 1 0 0 0 0 0 0 0
5 rows × 96 columns
y_train. head( )
visit_date
2016-07-15 3.367296
2016-07-16 1.791759
2016-07-19 3.258097
2016-07-20 2.995732
2016-07-21 3.871201
Name: visitors_log1p, dtype: float64
(11) 断言语句查看是不是哪还有问题
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019
assert X_train. isnull( ) . sum ( ) . sum ( ) == 0
assert y_train. isnull( ) . sum ( ) == 0
assert len ( X_train) == len ( y_train)
assert X_test. isnull( ) . sum ( ) . sum ( ) == 0
assert len ( X_test) == 32019
(12) lightgbm建模
import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection
np. random. seed( 42 )
model = lgbm. LGBMRegressor(
objective= 'regression' ,
max_depth= 5 ,
num_leaves= 25 ,
learning_rate= 0.007 ,
n_estimators= 1000 ,
min_child_samples= 80 ,
subsample= 0.8 ,
colsample_bytree= 1 ,
reg_alpha= 0 ,
reg_lambda= 0 ,
random_state= np. random. randint( 10e6 )
)
n_splits = 6
cv = model_selection. KFold( n_splits= n_splits, shuffle= True , random_state= 42 )
val_scores = [ 0 ] * n_splits
sub = submission[ 'id' ] . to_frame( )
sub[ 'visitors' ] = 0
feature_importances = pd. DataFrame( index= X_train. columns)
for i, ( fit_idx, val_idx) in enumerate ( cv. split( X_train, y_train) ) :
X_fit = X_train. iloc[ fit_idx]
y_fit = y_train. iloc[ fit_idx]
X_val = X_train. iloc[ val_idx]
y_val = y_train. iloc[ val_idx]
model. fit(
X_fit,
y_fit,
eval_set= [ ( X_fit, y_fit) , ( X_val, y_val) ] ,
eval_names= ( 'fit' , 'val' ) ,
eval_metric= 'l2' ,
early_stopping_rounds= 200 ,
feature_name= X_fit. columns. tolist( ) ,
verbose= False
)
val_scores[ i] = np. sqrt( model. best_score_[ 'val' ] [ 'l2' ] )
sub[ 'visitors' ] += model. predict( X_test, num_iteration= model. best_iteration_)
feature_importances[ i] = model. feature_importances_
print ( 'Fold {} RMSLE: {:.5f}' . format ( i+ 1 , val_scores[ i] ) )
sub[ 'visitors' ] /= n_splits
sub[ 'visitors' ] = np. expm1( sub[ 'visitors' ] )
val_mean = np. mean( val_scores)
val_std = np. std( val_scores)
print ( 'Local RMSLE: {:.5f} (±{:.5f})' . format ( val_mean, val_std) )
Fold 1 RMSLE: 0.48936
Fold 2 RMSLE: 0.49091
Fold 3 RMSLE: 0.48654
Fold 4 RMSLE: 0.48831
Fold 5 RMSLE: 0.48788
Fold 6 RMSLE: 0.48706
Local RMSLE: 0.48834 (±0.00146)
输出结果
sub. to_csv( 'result.csv' , index= False )
import pandas as pd
df = pd. read_csv( 'result.csv' )
df. head( )
id visitors 0 air_00a91d42b08b08d9_2017-04-23 4.340348 1 air_00a91d42b08b08d9_2017-04-24 22.739363 2 air_00a91d42b08b08d9_2017-04-25 29.535532 3 air_00a91d42b08b08d9_2017-04-26 29.319551 4 air_00a91d42b08b08d9_2017-04-27 31.838669
代码部分参考: https://edu.aliyun.com/course/1915?spm=a2c6h.12873581.0.0.6d6c56815vyMWI