饭店流量预测-多表关联+lightgbm

几点思考:
1、对pandas的使用就像是操作SQL语句, 总体说是增删改查, 但是涉及到联表, 涉及到分组,涉及到不同数据类型的操作,就有很多tricks在里面, 这些tricks是需要在不断的学习->使用中不断精进和掌握;
2、特征中包含datetime类型特征的时候, 可以依此分组构造新的时序特征,
(1) 是否是周末?
(2) 是一个月的第几天?
(3) 趋势特征
(4) 其他
3、值得fork的代码是:
(1) 数值类型特征异常值检测处理方法;
(2) 反应时间趋势特征的指数加权移动平均的方法;
(3) 时序特征统计量
4、不同的机器学习算法对特征的构造方式是有区别的, 比如KNN算法, 不需要对异常值做处理(异常值不敏感), 而线性回归,SVM等算法,需要处理异常值;比如决策树方法对特征量纲不敏感, 不需要特征归一化,而KNN需要; 比如像xgboost算法,不需要对空值(nan)做处理, 在训练的过程中会预测填充, 而简单的算法必须删除或填充异常值.(xgboost在百度的PaddlePaddle里面已经在配置环境默认安装了, 但是lightgbm, catboost等前延算法需要每次重置环镜自行安装. 还有我想请问一下可能看见这篇文章的各位大佬, Paddle里给免费提供的算力卡, 真是这样的吗(Tesla v100)?见图1, 我怎么感觉没那么好的性能呢).
在这里插入图片描述
5、硬件配置真的影响了在机器学习包括深度学习领域的科研信心, 验证一个模型要等几个小时(就是在高配的Paddle环境中, 天池的环境又总是排队, 昨天“天元”宣布开源了, 不知道能不能提供可靠的免费算力), 需要学习数据存储和操作的tricks, 只能软解了.
6、机器学习竞赛的三把利刃: 特征构造、建模调参(对数化、寻忧算法(贪心算法、网格搜索、贝叶斯寻优))、模型融合(stacking, voting, 随机森林+bagging+Adaboost).模型融合真的吃内存.

饭店流量预测

饭店来客数据

import pandas as pd

air_visit = pd.read_csv('air_visit_data.csv')
air_visit.head()
air_store_idvisit_datevisitors
0air_ba937bf13d40fb242016-01-1325
1air_ba937bf13d40fb242016-01-1432
2air_ba937bf13d40fb242016-01-1529
3air_ba937bf13d40fb242016-01-1622
4air_ba937bf13d40fb242016-01-186
air_visit.index = pd.to_datetime(air_visit['visit_date'])
air_visit.head()
air_store_idvisit_datevisitors
visit_date
2016-01-13air_ba937bf13d40fb242016-01-1325
2016-01-14air_ba937bf13d40fb242016-01-1432
2016-01-15air_ba937bf13d40fb242016-01-1529
2016-01-16air_ba937bf13d40fb242016-01-1622
2016-01-18air_ba937bf13d40fb242016-01-186

按天来算

(1)对时间按天采样resample(‘1d’).sum()

air_visit = air_visit.groupby(‘air_store_id’).apply(lambda g: g[‘visitors’].resample(‘1d’).sum()).reset_index()

air_visit = air_visit.groupby('air_store_id').apply(lambda g: g['visitors'].resample('1d').sum()).reset_index()
air_visit.head()
air_store_idvisit_datevisitors
0air_00a91d42b08b08d92016-07-0135
1air_00a91d42b08b08d92016-07-029
2air_00a91d42b08b08d92016-07-030
3air_00a91d42b08b08d92016-07-0420
4air_00a91d42b08b08d92016-07-0525
air_visit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296279 entries, 0 to 296278
Data columns (total 3 columns):
air_store_id    296279 non-null object
visit_date      296279 non-null datetime64[ns]
visitors        296279 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 6.8+ MB

缺失值填0

(2) 规范时间变量 dt.strftime(’%Y-%m-%d’)

air_visit[‘visit_date’] = air_visit[‘visit_date’].dt.strftime(’%Y-%m-%d’)

air_visit['visit_date'] = air_visit['visit_date'].dt.strftime('%Y-%m-%d')
air_visit['was_nil'] = air_visit['visitors'].isnull()
air_visit['visitors'].fillna(0, inplace=True)

air_visit.head()
air_store_idvisit_datevisitorswas_nil
0air_00a91d42b08b08d92016-07-0135False
1air_00a91d42b08b08d92016-07-029False
2air_00a91d42b08b08d92016-07-030False
3air_00a91d42b08b08d92016-07-0420False
4air_00a91d42b08b08d92016-07-0525False

日历数据

date_info = pd.read_csv('date_info.csv')
date_info.head()
calendar_dateday_of_weekholiday_flg
02016-01-01Friday1
12016-01-02Saturday1
22016-01-03Sunday1
32016-01-04Monday0
42016-01-05Tuesday0

(3) shift()操作对数据进行移动,可以观察前一天和后天是不是节假日。

date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True)
date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0)
date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0)
date_info.head()
visit_dateday_of_weekis_holidayprev_day_is_holidaynext_day_is_holiday
02016-01-01Friday10.01.0
12016-01-02Saturday11.01.0
22016-01-03Sunday11.00.0
32016-01-04Monday01.00.0
42016-01-05Tuesday00.00.0

地区数据

air_store_info = pd.read_csv('air_store_info.csv')

air_store_info.head()
air_store_idair_genre_nameair_area_namelatitudelongitude
0air_0f0cdeee6c9bf3d7Italian/FrenchHyōgo-ken Kōbe-shi Kumoidōri34.695124135.197852
1air_7cc17a324ae5c7dcItalian/FrenchHyōgo-ken Kōbe-shi Kumoidōri34.695124135.197852
2air_fee8dcf4d619598eItalian/FrenchHyōgo-ken Kōbe-shi Kumoidōri34.695124135.197852
3air_a17f0778617c76e2Italian/FrenchHyōgo-ken Kōbe-shi Kumoidōri34.695124135.197852
4air_83db5aff8f50478eItalian/FrenchTōkyō-to Minato-ku Shibakōen35.658068139.751599

测试集

(4) 字符串特征切片 str.slice(,)

submission[‘air_store_id’] = submission[‘id’].str.slice(0, 20)

import numpy as np

submission = pd.read_csv('sample_sub.csv')
submission['air_store_id'] = submission['id'].str.slice(0, 20)
submission['visit_date'] = submission['id'].str.slice(21)
submission['is_test'] = True #  标志位
submission['visitors'] = np.nan
submission['test_number'] = range(len(submission))

submission.head()
idvisitorsair_store_idvisit_dateis_testtest_number
0air_00a91d42b08b08d9_2017-04-23NaNair_00a91d42b08b08d92017-04-23True0
1air_00a91d42b08b08d9_2017-04-24NaNair_00a91d42b08b08d92017-04-24True1
2air_00a91d42b08b08d9_2017-04-25NaNair_00a91d42b08b08d92017-04-25True2
3air_00a91d42b08b08d9_2017-04-26NaNair_00a91d42b08b08d92017-04-26True3
4air_00a91d42b08b08d9_2017-04-27NaNair_00a91d42b08b08d92017-04-27True4

所有数据信息汇总

data = pd.concat((air_visit, submission.drop('id', axis='columns')))
data.head()
/Users/liu/TM/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.
air_store_idis_testtest_numbervisit_datevisitorswas_nil
0air_00a91d42b08b08d9NaNNaN2016-07-0135.0False
1air_00a91d42b08b08d9NaNNaN2016-07-029.0False
2air_00a91d42b08b08d9NaNNaN2016-07-030.0False
3air_00a91d42b08b08d9NaNNaN2016-07-0420.0False
4air_00a91d42b08b08d9NaNNaN2016-07-0525.0False
data.shape
(328298, 6)
data.isnull().sum()
air_store_id         0
is_test         296279
test_number     296279
visit_date           0
visitors         32019
was_nil          32019
dtype: int64
data['is_test'].fillna(False, inplace=True)
data = pd.merge(left=data, right=date_info, on='visit_date', how='left')
data = pd.merge(left=data, right=air_store_info, on='air_store_id', how='left')
data['visitors'] = data['visitors'].astype(float)

data.head()
air_store_idis_testtest_numbervisit_datevisitorswas_nilday_of_weekis_holidayprev_day_is_holidaynext_day_is_holidayair_genre_nameair_area_namelatitudelongitude
0air_00a91d42b08b08d9FalseNaN2016-07-0135.0FalseFriday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.753595
1air_00a91d42b08b08d9FalseNaN2016-07-029.0FalseSaturday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.753595
2air_00a91d42b08b08d9FalseNaN2016-07-030.0FalseSunday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.753595
3air_00a91d42b08b08d9FalseNaN2016-07-0420.0FalseMonday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.753595
4air_00a91d42b08b08d9FalseNaN2016-07-0525.0FalseTuesday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.753595
import missingno as msno
msno.bar(data)
<matplotlib.axes._subplots.AxesSubplot at 0x121fe9390>

在这里插入图片描述

拿到天气数据

import glob

weather_dfs = []

for path in glob.glob('./Weather/*.csv'):
    weather_df = pd.read_csv(path)
    weather_df['station_id'] = path.split('\\')[-1].rstrip('.csv')
    weather_dfs.append(weather_df)

weather = pd.concat(weather_dfs, axis='rows')
weather.rename(columns={'calendar_date': 'visit_date'}, inplace=True)

weather.head()
visit_dateavg_temperaturehigh_temperaturelow_temperatureprecipitationhours_sunlightsolar_radiationdeepest_snowfalltotal_snowfallavg_wind_speedavg_vapor_pressureavg_local_pressureavg_humidityavg_sea_pressurecloud_coverstation_id
02016-01-0120.522.417.50.00.6NaNNaNNaN6.3NaNNaNNaNNaNNaN./Weather/okinawa__ohara-kana__oohara
12016-01-0223.526.221.25.03.6NaNNaNNaN4.7NaNNaNNaNNaNNaN./Weather/okinawa__ohara-kana__oohara
22016-01-0321.723.720.211.00.0NaNNaNNaN2.8NaNNaNNaNNaNNaN./Weather/okinawa__ohara-kana__oohara
32016-01-0421.623.820.411.00.1NaNNaNNaN3.3NaNNaNNaNNaNNaN./Weather/okinawa__ohara-kana__oohara
42016-01-0522.124.620.535.50.0NaNNaNNaN2.4NaNNaNNaNNaNNaN./Weather/okinawa__ohara-kana__oohara

用各个小地方数据求出平均气温

(5) 以某一列为分组, 对其他列进行统计groupby()[[’’,’’]].mean()

means = weather.groupby(‘visit_date’)[[‘avg_temperature’, ‘precipitation’]].mean().reset_index()

means = weather.groupby('visit_date')[['avg_temperature', 'precipitation']].mean().reset_index()
means.rename(columns={'avg_temperature': 'global_avg_temperature', 'precipitation': 'global_precipitation'}, inplace=True)
means.head()
visit_dateglobal_avg_temperatureglobal_precipitation
02016-01-012.8683530.564662
12016-01-025.2792252.341998
22016-01-036.5899781.750616
32016-01-045.8578831.644946
42016-01-054.5568503.193625
means.visit_date.nunique()
517
weather.visit_date.nunique()
517
weather = pd.merge(left=weather, right=means, on='visit_date', how='left')
weather['avg_temperature'].fillna(weather['global_avg_temperature'], inplace=True)
weather['precipitation'].fillna(weather['global_precipitation'], inplace=True)

weather[['visit_date', 'avg_temperature', 'precipitation']].head()
visit_dateavg_temperatureprecipitation
02016-01-0120.50.0
12016-01-0223.55.0
22016-01-0321.711.0
32016-01-0421.611.0
42016-01-0522.135.5

信息数据

data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 328298 entries, 2016-07-01 to 2017-05-31
Data columns (total 15 columns):
air_store_id           328298 non-null object
is_test                328298 non-null bool
test_number            32019 non-null float64
visit_date             328298 non-null datetime64[ns]
visitors               296279 non-null float64
was_nil                296279 non-null object
day_of_week            328298 non-null object
is_holiday             328298 non-null int64
prev_day_is_holiday    328298 non-null float64
next_day_is_holiday    328298 non-null float64
air_genre_name         328298 non-null object
air_area_name          328298 non-null object
latitude               328298 non-null float64
longitude              328298 non-null float64
is_weekend             328298 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(6), int64(2), object(5)
memory usage: 37.9+ MB
data.reset_index(drop=True, inplace=True)
#data.sort_values(['air_store_id', 'visit_date'], inplace=True)

#data.head()
data.sort_values(['air_store_id', 'visit_date'], inplace=True)

data.head()
air_store_idis_testtest_numbervisit_datevisitorswas_nilday_of_weekis_holidayprev_day_is_holidaynext_day_is_holidayair_genre_nameair_area_namelatitudelongitudeis_weekend
0air_00a91d42b08b08d9FalseNaN2016-07-0135.0FalseFriday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950
1air_00a91d42b08b08d9FalseNaN2016-07-029.0FalseSaturday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535951
2air_00a91d42b08b08d9FalseNaN2016-07-030.0FalseSunday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535951
3air_00a91d42b08b08d9FalseNaN2016-07-0420.0FalseMonday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950
4air_00a91d42b08b08d9FalseNaN2016-07-0525.0FalseTuesday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950

(6) 异常点问题,数据中存在部分异常点,以正太分布为出发点,认为95%的是正常的,所以选择了1.96这个值。对异常点来规范,让特别大的点等于正常中最大的。

def find_outliers(series):
    return (series - series.mean()) > 1.96 * series.std()


def cap_values(series):
    outliers = find_outliers(series)
    max_val = series[~outliers].max()
    series[outliers] = max_val
    return series


stores = data.groupby('air_store_id')
data['is_outlier'] = stores.apply(lambda g: find_outliers(g['visitors'])).values
data['visitors_capped'] = stores.apply(lambda g: cap_values(g['visitors'])).values
data['visitors_capped_log1p'] = np.log1p(data['visitors_capped'])

data.head()
air_store_idis_testtest_numbervisit_datevisitorswas_nilday_of_weekis_holidayprev_day_is_holidaynext_day_is_holidayair_genre_nameair_area_namelatitudelongitudeis_weekendis_outliervisitors_cappedvisitors_capped_log1p
0air_00a91d42b08b08d9FalseNaN2016-07-0135.0FalseFriday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950False35.03.583519
1air_00a91d42b08b08d9FalseNaN2016-07-029.0FalseSaturday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535951False9.02.302585
2air_00a91d42b08b08d9FalseNaN2016-07-030.0FalseSunday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535951False0.00.000000
3air_00a91d42b08b08d9FalseNaN2016-07-0420.0FalseMonday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950False20.03.044522
4air_00a91d42b08b08d9FalseNaN2016-07-0525.0FalseTuesday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950False25.03.258097
data.isnull().sum()
air_store_id                  0
is_test                       0
test_number              296279
visit_date                    0
visitors                  32019
was_nil                   32019
day_of_week                   0
is_holiday                    0
prev_day_is_holiday           0
next_day_is_holiday           0
air_genre_name                0
air_area_name                 0
latitude                      0
longitude                     0
is_weekend                    0
is_outlier                    0
visitors_capped           32019
visitors_capped_log1p     32019
dtype: int64

日期特征

(7) 添加“是否是周末”, “一个月的第几天”两个特征

data[‘is_weekend’] = data[‘day_of_week’].isin([[‘Saturday’, ‘Sunday’]]).astype(int)

data[‘day_of_month’] = data[‘visit_date’].dt.day

data['is_weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
data['day_of_month'] = data['visit_date'].dt.day
data.head()
air_store_idis_testtest_numbervisit_datevisitorswas_nilday_of_weekis_holidayprev_day_is_holidaynext_day_is_holidayair_genre_nameair_area_namelatitudelongitudeis_weekendis_outliervisitors_cappedvisitors_capped_log1pday_of_month
0air_00a91d42b08b08d9FalseNaN2016-07-0135.0FalseFriday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950False35.03.5835191
1air_00a91d42b08b08d9FalseNaN2016-07-029.0FalseSaturday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535951False9.02.3025852
2air_00a91d42b08b08d9FalseNaN2016-07-030.0FalseSunday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535951False0.00.0000003
3air_00a91d42b08b08d9FalseNaN2016-07-0420.0FalseMonday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950False20.03.0445224
4air_00a91d42b08b08d9FalseNaN2016-07-0525.0FalseTuesday00.00.0Italian/FrenchTōkyō-to Chiyoda-ku Kudanminami35.694003139.7535950False25.03.2580975

(8) 指数加权移动平均(Exponential Weighted Moving Average),反应时间序列变换趋势,需要我们给定alpha值,这里我们来优化求一个最合适的。

from scipy import optimize


def calc_shifted_ewm(series, alpha, adjust=True):
    return series.shift().ewm(alpha=alpha, adjust=adjust).mean()


def find_best_signal(series, adjust=False, eps=10e-5):
    
    def f(alpha):
        shifted_ewm = calc_shifted_ewm(series=series, alpha=min(max(alpha, 0), 1), adjust=adjust)
        corr = np.mean(np.power(series - shifted_ewm, 2))
        return corr
     
    res = optimize.differential_evolution(func=f, bounds=[(0 + eps, 1 - eps)])
    
    return calc_shifted_ewm(series=series, alpha=res['x'][0], adjust=adjust)


roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

(9) 尽可能多的提取时间序列信息

def extract_precedent_statistics(df, on, group_by):
    
    df.sort_values(group_by + ['visit_date'], inplace=True)
    
    groups = df.groupby(group_by, sort=False)
    
    stats = {
        'mean': [],
        'median': [],
        'std': [],
        'count': [],
        'max': [],
        'min': []
    }
    
    exp_alphas = [0.1, 0.25, 0.3, 0.5, 0.75]
    stats.update({'exp_{}_mean'.format(alpha): [] for alpha in exp_alphas})
    
    for _, group in groups:
        
        shift = group[on].shift()
        roll = shift.rolling(window=len(group), min_periods=1)
        
        stats['mean'].extend(roll.mean())
        stats['median'].extend(roll.median())
        stats['std'].extend(roll.std())
        stats['count'].extend(roll.count())
        stats['max'].extend(roll.max())
        stats['min'].extend(roll.min())
        
        for alpha in exp_alphas:
            exp = shift.ewm(alpha=alpha, adjust=False)
            stats['exp_{}_mean'.format(alpha)].extend(exp.mean())
    
    suffix = '_&_'.join(group_by)
    
    for stat_name, values in stats.items():
        df['{}_{}_by_{}'.format(on, stat_name, suffix)] = values


extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id', 'day_of_week']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id', 'is_weekend']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id', 'day_of_week']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id', 'is_weekend']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id']
)

data.sort_values(['air_store_id', 'visit_date']).head()
air_store_idis_testtest_numbervisit_datevisitorswas_nilday_of_weekis_holidayprev_day_is_holidaynext_day_is_holiday...visitors_capped_log1p_median_by_air_store_idvisitors_capped_log1p_std_by_air_store_idvisitors_capped_log1p_count_by_air_store_idvisitors_capped_log1p_max_by_air_store_idvisitors_capped_log1p_min_by_air_store_idvisitors_capped_log1p_exp_0.1_mean_by_air_store_idvisitors_capped_log1p_exp_0.25_mean_by_air_store_idvisitors_capped_log1p_exp_0.3_mean_by_air_store_idvisitors_capped_log1p_exp_0.5_mean_by_air_store_idvisitors_capped_log1p_exp_0.75_mean_by_air_store_id
visit_date
2016-07-01air_00a91d42b08b08d9FalseNaN2016-07-0135.0FalseFriday00.00.0...NaNNaN0.0NaNNaNNaNNaNNaNNaNNaN
2016-07-02air_00a91d42b08b08d9FalseNaN2016-07-029.0FalseSaturday00.00.0...3.583519NaN1.03.5835193.5835193.5835193.5835193.5835193.5835193.583519
2016-07-03air_00a91d42b08b08d9FalseNaN2016-07-030.0TrueSunday00.00.0...2.9430520.9057572.03.5835192.3025853.4554263.2632853.1992392.9430522.622819
2016-07-04air_00a91d42b08b08d9FalseNaN2016-07-0420.0FalseMonday00.00.0...2.3025851.8158703.03.5835190.0000003.1098832.4474642.2394671.4715260.655705
2016-07-05air_00a91d42b08b08d9FalseNaN2016-07-0525.0FalseTuesday00.00.0...2.6735541.5783544.03.5835190.0000003.1033472.5967292.4809842.2580242.447318

5 rows × 89 columns

(10) 对数据的某几列进行onehot编码: data = pd.get_dummies(data, columns=[‘day_of_week’, ‘air_genre_name’])

data = pd.get_dummies(data, columns=['day_of_week', 'air_genre_name'])
data.head()

数据集划分

data['visitors_log1p'] = np.log1p(data['visitors'])
train = data[(data['is_test'] == False) & (data['is_outlier'] == False) & (data['was_nil'] == False)]
test = data[data['is_test']].sort_values('test_number')

to_drop = ['air_store_id', 'is_test', 'test_number', 'visit_date', 'was_nil',
           'is_outlier', 'visitors_capped', 'visitors',
           'air_area_name', 'latitude', 'longitude', 'visitors_capped_log1p']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')

X_train = train.drop('visitors_log1p', axis='columns')
X_test = test.drop('visitors_log1p', axis='columns')
y_train = train['visitors_log1p']

X_train.head()
is_holidayprev_day_is_holidaynext_day_is_holidayis_weekendday_of_monthoptimized_ewm_by_air_store_id_&_day_of_weekoptimized_ewm_by_air_store_id_&_is_weekendoptimized_ewm_log1p_by_air_store_id_&_day_of_weekoptimized_ewm_log1p_by_air_store_id_&_is_weekendvisitors_capped_mean_by_air_store_id_&_day_of_week...air_genre_name_Dining barair_genre_name_International cuisineair_genre_name_Italian/Frenchair_genre_name_Izakayaair_genre_name_Japanese foodair_genre_name_Karaoke/Partyair_genre_name_Okonomiyaki/Monja/Teppanyakiair_genre_name_Otherair_genre_name_Western foodair_genre_name_Yakiniku/Korean food
visit_date
2016-07-1500.00.001535.00070031.6425203.5881063.42570738.5...0010000000
2016-07-1600.00.01169.0618318.6188122.3026032.00357910.0...0010000000
2016-07-1901.00.001924.84127227.9883853.2528322.42856524.5...0010000000
2016-07-2000.00.002029.19857527.6755253.4128132.66712432.5...0010000000
2016-07-2100.00.002132.71097226.7672683.5373972.76162631.0...0010000000

5 rows × 96 columns

y_train.head()
visit_date
2016-07-15    3.367296
2016-07-16    1.791759
2016-07-19    3.258097
2016-07-20    2.995732
2016-07-21    3.871201
Name: visitors_log1p, dtype: float64

(11) 断言语句查看是不是哪还有问题

assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019

(12) lightgbm建模

import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection


np.random.seed(42)

model = lgbm.LGBMRegressor(
    objective='regression',
    max_depth=5,
    num_leaves=25,
    learning_rate=0.007,
    n_estimators=1000,
    min_child_samples=80,
    subsample=0.8,
    colsample_bytree=1,
    reg_alpha=0,
    reg_lambda=0,
    random_state=np.random.randint(10e6)
)

n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)

val_scores = [0] * n_splits

sub = submission['id'].to_frame()
sub['visitors'] = 0

feature_importances = pd.DataFrame(index=X_train.columns)

for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]
    
    model.fit(
        X_fit,
        y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_names=('fit', 'val'),
        eval_metric='l2',
        early_stopping_rounds=200,
        feature_name=X_fit.columns.tolist(),
        verbose=False
    )
    
    val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
    sub['visitors'] += model.predict(X_test, num_iteration=model.best_iteration_)
    feature_importances[i] = model.feature_importances_
    
    print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
    
sub['visitors'] /= n_splits
sub['visitors'] = np.expm1(sub['visitors'])

val_mean = np.mean(val_scores)
val_std = np.std(val_scores)

print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std))
Fold 1 RMSLE: 0.48936
Fold 2 RMSLE: 0.49091
Fold 3 RMSLE: 0.48654
Fold 4 RMSLE: 0.48831
Fold 5 RMSLE: 0.48788
Fold 6 RMSLE: 0.48706
Local RMSLE: 0.48834 (±0.00146)

输出结果

sub.to_csv('result.csv', index=False)
import pandas as pd
df = pd.read_csv('result.csv')
df.head()
idvisitors
0air_00a91d42b08b08d9_2017-04-234.340348
1air_00a91d42b08b08d9_2017-04-2422.739363
2air_00a91d42b08b08d9_2017-04-2529.535532
3air_00a91d42b08b08d9_2017-04-2629.319551
4air_00a91d42b08b08d9_2017-04-2731.838669

代码部分参考:
https://edu.aliyun.com/course/1915?spm=a2c6h.12873581.0.0.6d6c56815vyMWI

  • 4
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值