用户新增预测挑战赛Baseline注释

用户新增预测挑战赛Baseline注释版

背景:数据由约62万条训练集、20万条测试集数据组成,共包含13个字段。其中uuid为样本唯一标识eid为访问行为IDudmap为行为属性,其中的key1到key9表示不同的行为属性,如项目名、项目id等相关字段,common_ts为应用访问记录发生时间(毫秒时间戳),其余字段x1至x8为用户相关的属性,为匿名处理字段。target字段为预测目标,即是否为新增用户。

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

1.导入需要的packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

2.csv数据文件读取函数

def ReadData(path):
    train_data = pd.read_csv(path + 'train.csv')
    test_data = pd.read_csv(path + 'test.csv')
    return train_data,test_data
2.1针对数据的初步观察
train_data, test_data = ReadData('用户新增预测挑战赛公开数据/')
train_data.head()
train_data.info()
train_data.describe()
uuideidudmapcommon_tsx1x2x3x4x5x6x7x8target
0026{"key3":"67804","key2":"650"}168967346824440411072061010
1126{"key3":"67804","key2":"484"}16890829414694041242834810
228unknown16894073930404041712884710
3311unknown16894678156881341173661610
4426{"key3":"67804","key2":"650"}16894917514420341923834810
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 620356 entries, 0 to 620355
Data columns (total 13 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   uuid       620356 non-null  int64 
 1   eid        620356 non-null  int64 
 2   udmap      620356 non-null  object
 3   common_ts  620356 non-null  int64 
 4   x1         620356 non-null  int64 
 5   x2         620356 non-null  int64 
 6   x3         620356 non-null  int64 
 7   x4         620356 non-null  int64 
 8   x5         620356 non-null  int64 
 9   x6         620356 non-null  int64 
 10  x7         620356 non-null  int64 
 11  x8         620356 non-null  int64 
 12  target     620356 non-null  int64 
dtypes: int64(12), object(1)
memory usage: 61.5+ MB
uuideidcommon_tsx1x2x3x4x5x6x7x8target
count620356.000000620356.0000006.203560e+05620356.000000620356.000000620356.000000620356.000000620356.000000620356.000000620356.000000620356.000000620356.000000
mean310177.50000022.1482871.689317e+122.6757231.10635040.97449982.860080224.9090962.9016815.8637200.8554590.140566
std179081.49613412.1391222.746865e+081.7192791.1741571.37301644.109037114.3050621.4447972.5758540.3516380.347574
min0.0000000.0000001.688382e+120.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%155088.75000011.0000001.689088e+121.0000000.00000041.00000051.000000133.0000001.0000006.0000001.0000000.000000
50%310177.50000026.0000001.689377e+124.0000001.00000041.00000086.000000241.0000004.0000007.0000001.0000000.000000
75%465266.25000034.0000001.689563e+124.0000002.00000041.000000107.000000313.0000004.0000007.0000001.0000000.000000
max620355.00000042.0000001.689696e+124.0000003.00000074.000000151.000000413.0000004.0000009.0000001.0000001.000000

3.对udmap列进行one-hot编码

# FeatureEnginerring
## 事实上udmap列的数据并非字典对象而是str数据,故需要先用eval函数进行转化
# type(train_data['udmap'][1])  == type({})
# 3.1 udmap_onethot()函数
def udmap_onethot(d):
    v = np.zeros(9) # 创建length为9的零数组
    if d == 'unknown':
        return v
    d = eval(d)
    for i in range(1,10):
        if 'key' + str(i) in d:
            v[i-1] = d['key' + str(i)]
    return v #f 返回得到的one-hot编码数组
# test for one-hot 查看编码后的列表效果
udmap_onethot(train_data['udmap'][1])
array([    0.,   484., 67804.,     0.,     0.,     0.,     0.,     0.,
           0.])
# # np.vstack()可以将结果堆叠成一个数组, 类似的还有np.hstack() 效果如下:
# arr1 = np.array([[1,2,3],[4,5,6]])
# arr2 = np.array([[7,8,9],[10,11,12]])
# np.hstack((arr1,arr2))
# np.vstack((arr1,arr2))

train_udmap_df = pd.DataFrame(np.vstack(train_data['udmap'].apply(udmap_onethot)))
test_udmap_df = pd.DataFrame(np.vstack(test_data['udmap'].apply(udmap_onethot)))

# 3.2为新特征命名 df数据结构的columns参数支持list对象的输入
train_udmap_df.columns = ['key' + str(i) for i in range(1,10)]
test_udmap_df.columns = ['key' + str(i) for i in range(1,10)]
train_data = pd.concat([train_data,train_udmap_df],axis=1)
test_data = pd.concat([test_data,test_udmap_df],axis=1)
# train_data.head()

4. 检测udmap是否为空

# 在根据udmap列值是否为known的判断中转化为布尔值,再用astype转化为int数值,然后赋给数据的新字段
train_data['udmap_isknown'] = (train_data['udmap'] == 'unknown').astype(int)
test_data['udmap_isknown'] = (test_data['udmap'] == 'unknown').astype(int)
train_data.head()

# 至此关于udmap的行为属性特征解包初步完成
uuideidudmapcommon_tsx1x2x3x4x5x6...key1key2key3key4key5key6key7key8key9udmap_isknown
0026{"key3":"67804","key2":"650"}168967346824440411072061...0.0650.067804.00.00.00.00.00.00.00
1126{"key3":"67804","key2":"484"}16890829414694041242834...0.0484.067804.00.00.00.00.00.00.00
228unknown16894073930404041712884...0.00.00.00.00.00.00.00.00.01
3311unknown16894678156881341173661...0.00.00.00.00.00.00.00.00.01
4426{"key3":"67804","key2":"650"}16894917514420341923834...0.0650.067804.00.00.00.00.00.00.00

5 rows × 23 columns

5.提取eid的频次特征 eid为访问行为ID

## 5.1 通过value_counts()方法返回各个eid出现的次数
# train_data['eid'].value_counts()
# 可做测试使用
# train_data['eid'].map(train_data['eid'].value_counts())
# train_data.tail()
## 5.2 使用map函数将每个样本的eid映射到eid的对应频计数
train_data['eid_freq'] = train_data['eid'].map(train_data['eid'].value_counts())
test_data['eid_freq'] = test_data['eid'].map(train_data['eid'].value_counts())

6.提取eid的标签特征

# # 6.1 借助groupby()方法,返回根据eid字段分组后的各组target均值
# train_data.groupby('eid')['target'].mean()
# 6.2 将上一步的分组均值匹配到对应的组内,测试集同理
# train_data
train_data['eid_mean'] = train_data['eid'].map(train_data.groupby('eid')['target'].mean())
test_data['eid_mean'] = test_data['eid'].map(train_data.groupby('eid')['target'].mean())

7.提取时间戳

## 7.1使用pd.to_datetime()函数将时间戳列转换为datetime类型
# ## 补充pd.to_datetime()相关内容
# datestrs = ['2023-08-23 23:12:00', '2022-09-01 12:00:00']
# pd.to_datetime(datestrs)
train_data['common_ts'] = pd.to_datetime(train_data['common_ts'],unit='ms')
test_data['common_ts'] = pd.to_datetime(test_data['common_ts'],unit='ms')

## 7.2 使用dt.hour属性从datetime列中提取小时信息,并存储在新列中
train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
test_data['common_ts_hour'] = test_data['common_ts'].dt.hour
## 7.3 同理也可以把日信息提取出来,经过试验可知数据样本都来自2023-07,故截止到day即可
train_data['common_ts_day'] = train_data['common_ts'].dt.day
test_data['common_ts_day'] = test_data['common_ts'].dt.day

8. 利用决策树模型进行训练

clf = DecisionTreeClassifier()
clf.fit(
    # 8.1 在输入时摘除不需要的字段或者无效特征
    train_data.drop(['udmap','common_ts','uuid','target'],axis=1),
    # 8.2 将traget特征作为目标进行训练
    train_data['target']
)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

9.对测试集进行预测

# 9.1 创建DataFrame来存储结果,包括'uuid'与'target'
result_df = pd.DataFrame({
    'uuid':test_data['uuid'],
    'target':clf.predict(test_data.drop(['udmap','common_ts','uuid'],axis=1))
})

10.保存结果

# index=None 表示不将DataFrame的索引index写入文件中
result_df.to_csv('submit.csv',index=None)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值