Kaggle Predict Future Sales interpretions

最新推荐文章于 2022-08-23 11:51:02 发布

NYUfirstMID

最新推荐文章于 2022-08-23 11:51:02 发布

阅读量261

点赞数

分类专栏：技术博客

本文链接：https://blog.csdn.net/qq_39100624/article/details/97078942

版权

技术博客专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1.data explanation
在这里插入图片描述
1.1item_categories

item_categories contains two attributes
item_category_name and item_category_id
1.2

items contain 3 attributes item_name,item_id ,item_category_id
1.3

items contain 2 attributes shop_name,item_id ,shop_id
1.4
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.0 1.0
1 03.01.2013 0 25 2552 899.0 1.0
2 05.01.2013 0 25 2552 899.0 -1.0
train has date,date_block_num ,shop_id,item_id ,item_price ,item_cnt_day
1.5
shop_id item_id
ID
0 5 5037

test has shop_id ,item_id
2.1.data visualing

plt.figure(figsize=(10,4))
plt.xlim(-100, 3000)
sns.boxplot(x=train.item_cnt_day)
plt.figure(figsize=(10,4))
plt.xlim(train.item_price.min(), train.item_price.max()*1.1)
sns.boxplot(x=train.item_price)

the distribution of the item_cnt_day,we can see most data points are below 1000,
在这里插入图片描述
the distribution of the item_price,we can see most data points are below 100000,

2.2.data cleaning:

train = train[train.item_price<100000]
train = train[train.item_cnt_day<1001]
print(train[train.item_price<0])

在这里插入图片描述
date date_block_num shop_id item_id item_price item_cnt_day
484683 15.05.2013 4 32 2973 -1.0 1.0

median = train[(train.shop_id==32)&(train.item_id==2973)&(train.date_block_num==4)&(train.item_price>0)].item_price.median()
train.loc[train.item_price<0, 'item_price'] = median

so we choose the same shop_id and the same item_id 's media price to represent the price which is below 0
2.3.making the characters spliting and use LabelEncoder to label the characters into numbers

shops['city'] = shops['shop_name'].str.split(' ').map(lambda x: x[0])
shops['city_code'] = LabelEncoder().fit_transform(shops['city'])
shops = shops[['shop_id','city_code']]

before the process:
在这里插入图片描述
shop_name shop_id
0 !Якутск Орджоникидзе, 56 фран 0
1 !Якутск ТЦ “Центральный” фран 1
2 Адыгея ТЦ “Мега” 2
This is the result：

shop_id city_code
0 0 29
1 1 29
2 2 0

2.4.There are 363 item_id which don’t contain in the new test, 5100 test_item_id and 214200 test datasets.

len(list(set(test.item_id) - set(test.item_id).intersection(set(train.item_id)))), len(list(set(test.item_id))), len(test)

(363, 5100, 214200)

2.5.add the total cnt group by date_block_num’,‘shop_id’,'item_id

group = train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': ['sum']})
group.columns = ['item_cnt_month']
group.reset_index(inplace=True)

after merge group and train we get 4 attributes and the item_cnt_month has been sumed by group

matrix = pd.merge(matrix, group, on=cols, how='left')
matrix['item_cnt_month'] = (matrix['item_cnt_month']
                                .fillna(0)
                                .clip(0,20) # NB clip target here
                                .astype(np.float16))
print(matrix.head(3))
time.time() - ts

date_block_num shop_id item_id item_cnt_month
0 0 2 19 0.0
1 0 2 27 1.0
2 0 2 28 0.0

2.6
merging the matrix with shops,items,cats get 8 attributes

ts = time.time()
matrix = pd.merge(matrix, shops, on=['shop_id'], how='left')
matrix = pd.merge(matrix, items, on=['item_id'], how='left')
matrix = pd.merge(matrix, cats, on=['item_category_id'], how='left')
matrix['city_code'] = matrix['city_code'].astype(np.int8)
matrix['item_category_id'] = matrix['item_category_id'].astype(np.int8)
matrix['type_code'] = matrix['type_code'].astype(np.int8)
matrix['subtype_code'] = matrix['subtype_code'].astype(np.int8)
time.time() - ts

在这里插入图片描述
2.8 using this 8 attributes to train in XGBoost
data=[‘date_block_num’ ,‘shop_id’, ‘item_id’ , ‘item_cnt_month’ , ‘city_code’,‘item_category_id’, ‘type_code’, ‘subtype_code’]

ts = time.time()

model = XGBRegressor(
    max_depth=8,
    n_estimators=10,
    min_child_weight=300, 
    colsample_bytree=0.8, 
    subsample=0.8, 
    eta=0.3,    
    seed=42)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 5)

time.time() - ts

第一次提交：
在这里插入图片描述

NYUfirstMID

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Kaggle Predict Future Sales interpretions

making the characters spliting and use LabelEncoder to label the characters into numbersshops['city'] = shops['shop_name'].str.split(' ').map(lambda x: x[0])shops['city_code'] = LabelEncoder().fit_t...
复制链接

扫一扫