Datawhale AI夏令营 机器学习组学习日记(任务1)

一、赛题内容

        本次大赛提供了讯飞开放平台海量的应用数据作为训练样本,参赛选手需要基于提供的样本构建模型,预测用户的新增情况

        赛题数据由约62万条训练集、20万条测试集数据组成,共包含13个字段。其中uuid为样本唯一标识,eid为访问行为ID,udmap为行为属性,其中的key1到key9表示不同的行为属性,如项目名、项目id等相关字段,common_ts为应用访问记录发生时间(毫秒时间戳),其余字段x1至x8为用户相关的属性,为匿名处理字段。target字段为预测目标,即是否为新增用户。数据集部分内容如下:

        评价标准采用f1_score

二、修改后代码解析

        AI夏令营中已经给好了可以直接运行的Baseline。在任务一的学习中,针对数据集中的各个属性进行了相关性分析,并根据相关性分析的结果对提供的Baseline进行了调整,使其效果更优。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

        导入需要用到的包。pandas包用于csv表格的导入,numpy包用于处理数组,sklearn包是机器学习相关的包,引入的三个模块分别用于测试集与训练集的划分,决策树构建以及f1_score评分。

def udmap_onethot(d):
    v = np.zeros(9)  # 创建全0数组
    if d == 'unknown':
        return v  # unkown说明key1到key9都没有,数组保持全0

    d = eval(d)  # 字符串解析
    for i in range(1, 10):  # 遍历key1到key9
        if 'key' + str(i) in d:
            v[i - 1] = d['key' + str(i)]  # 只要该key存在,就将key的值赋值到数组对应元素上

    return v

        这个函数负责处理udmap属性。这个属性的内容是一个字符串,例如{"key3":"67804","key2":"650"}。这个字符串数据保存了该条数据key1到key9的数据,且不是key1到key9的每个属性都存在,这些属性都不存在,udmap属性为udmap。这个函数的作用是将输入的字符串转换成可以表示某个向量的数组。

        代码的具体功能在注释中展示。其中,eval()函数是一个内置函数,用于将字符串作为代码进行解析和执行。由于udmap属性的内容为字符串,故需要先解析成json型数据才能拆解各个属性。

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

        这里导入了训练集和测试集,分别为train_data和test_data。

train_data['common_ts'] = pd.to_datetime(train_data['common_ts'], unit='ms')
test_data['common_ts'] = pd.to_datetime(test_data['common_ts'], unit='ms')

train_udmap_df = pd.DataFrame(np.vstack(train_data['udmap'].apply(udmap_onethot)))
test_udmap_df = pd.DataFrame(np.vstack(test_data['udmap'].apply(udmap_onethot)))

train_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
test_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]

train_data = pd.concat([train_data, train_udmap_df], axis=1)
test_data = pd.concat([test_data, test_udmap_df], axis=1)

train_data['eid_freq'] = train_data['eid'].map(train_data['eid'].value_counts())
test_data['eid_freq'] = test_data['eid'].map(train_data['eid'].value_counts())

train_data['eid_mean'] = train_data['eid'].map(train_data.groupby('eid')['target'].mean())
test_data['eid_mean'] = test_data['eid'].map(train_data.groupby('eid')['target'].mean())

train_data['udmap_isunknown'] = (train_data['udmap'] == 'unknown').astype(int)
test_data['udmap_isunknown'] = (test_data['udmap'] == 'unknown').astype(int)

train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
test_data['common_ts_hour'] = test_data['common_ts'].dt.hour

        这段代码的做用是处理训练集和测试集的各条属性,接下来将一组一组的展示每个属性是如何进行处理的

train_data['common_ts'] = pd.to_datetime(train_data['common_ts'], unit='ms')
test_data['common_ts'] = pd.to_datetime(test_data['common_ts'], unit='ms')

        这段代码中,to_datetime()函数将指定列中的时间戳(以毫秒为单位表示)转换为日期时间格式。此时,commom_ts属性的内容从时间戳替换成了对应的日期和时间。unit='ms'参数指定了时间戳的单位是毫秒。

train_udmap_df = pd.DataFrame(np.vstack(train_data['udmap'].apply(udmap_onethot)))
test_udmap_df = pd.DataFrame(np.vstack(test_data['udmap'].apply(udmap_onethot)))

train_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
test_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]

train_data = pd.concat([train_data, train_udmap_df], axis=1)
test_data = pd.concat([test_data, test_udmap_df], axis=1)

        这段代码使用了前面定义的udmap_onethot函数处理udmap列。在上述代码中,apply()函数将udmap_onethot函数应用于'train_data'和'test_data'中的每个'udmap'列。np.vstack()函数将所有结果垂直堆叠成一个NumPy数组。然后使用pd.DataFrame()函数将该NumPy数组转换为一个DataFrame对象。这样,train_udmap_dftest_udmap_df对象将分别存储转换后的'udmap'数据

        然后修改列名,即属性名。将train_udmap_dftest_udmap_df的列名修改为'key1'到'key9',便于后期加入训练集。

        最后使用pd.concat()函数将train_udmap_dftest_udmap_df两个DataFrame对象,即新增的属性分别增添进了训练集和测试集中。注意,此时训练集和测试集中,处理前的属性udmap还存在。

train_data['eid_freq'] = train_data['eid'].map(train_data['eid'].value_counts())
test_data['eid_freq'] = test_data['eid'].map(train_data['eid'].value_counts())

        这里增添了eid_freq属性,用于统计每条数据的eid属性出现的次数。比如某一行的eid属性为26,若训练集一共有10条数据的eid属性都是26,那么该条数据的eid_freq属性为10。value_counts()函数用于计算'eid'列中每个值的频率。然后,使用.map()函数将频率值映射到对应的'eid'值

train_data['eid_mean'] = train_data['eid'].map(train_data.groupby('eid')['target'].mean())
test_data['eid_mean'] = test_data['eid'].map(train_data.groupby('eid')['target'].mean())

        这里增添了eid_mean属性,用于统计eid列每个值对应的target列的均值.groupby('eid')['target'].mean()进行了分组操作,根据'eid'列的值对数据进行分组,并计算每个分组中'target'列的均值。然后,使用map()函数将均值值映射到对应的'eid'值,得到新的'eid_mean'列。

train_data['udmap_isunknown'] = (train_data['udmap'] == 'unknown').astype(int)
test_data['udmap_isunknown'] = (test_data['udmap'] == 'unknown').astype(int)

        这里增添了udmap_isunknown属性,专门标记出了udmap属性中无行为,即值为unknow的数据

train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
test_data['common_ts_hour'] = test_data['common_ts'].dt.hour

        这里增添了common_ts_hour属性,记录了每条数据时间戳对应的小时数。

        上述代码处理了一些非结构化的属性,并在已有属性中提取了一些信息,新增了一些属性。

clf = DecisionTreeClassifier(criterion='entropy')
# 按照指定比例拆分成训练集和测试集
train_df, test_df = train_test_split(train_data, test_size=0.2, random_state=None)
# ,'udmap_isunknown','key3','eid_freq','key5'
clf.fit(
    train_df.drop(['udmap', 'common_ts', 'uuid', 'target','key8','key3','key2','key4','key5'], axis=1),
    train_df['target']
)  # 使用训练集进行训练
test_y = clf.predict(
    test_df.drop(['udmap', 'common_ts', 'uuid', 'target','key8','key3','key2','key4','key5'], axis=1))  # ,'udmap_isunknown','key3','eid_freq','key5'
#  测试集预测结果f1_score
score = f1_score(y_true=test_df['target'], y_pred=test_y, average='macro')

        这段代码进行了决策树的构建与训练,与Baseline中的代码相比有所改变。首先使用sklearn包中的DecisionTreeClassifier函数构建了一个决策树,criterion='entropy'是决策树的属性,决策树算法会选择熵最小的特征作为划分标准。在实践中能够提升训练该训练集时f1_score的得分。

        train_test_split()函数是sklearn包中用于划分训练集与测试集的函数。在这里,将训练集train_data按照2:8分为测试集与训练集,由于train_data中的target已知,因此可以得出该测试集的真实值与预测值,进而得出预测效果f1_score得分,进而可以选择更好的一种参数选取方式预测真正的测试集,即test_data的预测。

        clf.fit()函数对决策树进行了训练,其中drop()函数用于删除一些列。'udmap', 'common_ts', 'uuid', 'target'这几列是用不到的,'key8','key3','key2','key4','key5'属性是进行相关性分析后删掉的,具体分析见下一节。

        将 'target' 列作为模型的目标进行训练。

        clf.predict()函数进行结果的预测,f1_score()函数计算了训练集的f1_score得分。

pd.DataFrame({
    'uuid': test_data['uuid'],
    'target': clf.predict(test_data.drop(['udmap', 'common_ts', 'uuid','key8','key3','key2','key4','key5'], axis=1))
}).to_csv('submit.csv', index=None)

        最后预测test_data测试集中的数据,并存入csv文件中。

三、一些探索——相关性分析

        在进行决策树训练之前,先进行了相关性分析,结果如下:

eidx1x2x3x4x5x6x7x8targetkey1key2key3key4key5key6key7key8key9eid_freqeid_meanudmap_isunknowncommon_ts_hour
eid10.0031569550.0443685760.000949449-0.012432440.0097912130.028610076-0.202015453-0.3476847610.035941107-0.0725464830.1154934410.162220133-0.521876248-0.524798477-0.014012581-0.011128957-0.00230111-0.0642361410.2255771020.1188099570.032973963-0.003093196
x10.00315695510.027141121-0.006335021-0.003229184-0.011904048-0.03275256-0.014764255-0.0316387340.0029409920.0127635920.003600704-0.0009631260.000996877-0.00026949-0.0045987090.0017263470.0022073370.009379771-0.0150238960.008827733-0.0071494610.117610561
x20.0443685760.0271411211-0.016739923-0.0128604980.0206263480.044130989-0.082782175-0.1455423330.027476160.027736045-0.0105916270.012249589-0.015183201-0.0154088460.0012142580.0050026060.003539210.015950857-0.0527771090.05698642-0.0356368140.000519635
x30.000949449-0.006335021-0.01673992310.004846301-0.013762958-0.012861371-0.0089931280.003159686-0.008968955-0.003575913-0.001050182-8.17E-05-0.00501877-0.0052070860.000417213-0.006444160.0001647940.0018474910.003081747-0.0047800930.0025695160.00174124
x4-0.01243244-0.003229184-0.0128604980.0048463011-0.0706056650.0028885740.0214470780.036651243-0.010641945-0.007258161-0.000776948-0.003290250.0008524897.56E-050.004224309-0.000841959-0.002794909-0.0062334840.009583036-0.0217763420.011675347-0.002434963
x50.009791213-0.0119040480.020626348-0.013762958-0.0706056651-0.010918821-0.028954404-0.0280094570.0116773050.00238166-0.0053733370.000516122-0.007514654-0.006461250.0004277160.0015316110.0014756290.003610075-0.0084280240.01352405-0.004499133-0.007483399
x60.028610076-0.032752560.044130989-0.0128613710.002888574-0.01091882110.249626567-0.071504623-0.0287246120.015424374-0.0029769380.008373173-0.002434594-0.001765327-0.0009236790.0039172240.0014508290.007311049-0.0241875530.025275631-0.024545001-0.015039375
x7-0.202015453-0.014764255-0.082782175-0.0089931280.021447078-0.0289544040.24962656710.616568067-0.199992115-0.0982827350.033484255-0.0672678790.0807988330.0809647510.020490107-0.022150644-0.013308673-0.0625608370.216145256-0.2651883330.1468564630.001077918
x8-0.347684761-0.031638734-0.1455423330.0031596860.036651243-0.028009457-0.0715046230.6165680671-0.127708832-0.1593153530.044410139-0.1191794390.1374711630.1384387070.031066357-0.035925707-0.021585083-0.1014662290.339527617-0.4221650930.243788320.005713371
target0.0359411070.0029409920.02747616-0.008968955-0.0106419450.011677305-0.028724612-0.199992115-0.12770883210.04756728-0.048094264-0.094184056-0.015707262-0.012470102-0.0120579750.0095093130.009908890.041849429-0.1552121030.3025092170.055496236-0.006896458
key1-0.0725464830.0127635920.027736045-0.003575913-0.0072581610.002381660.015424374-0.098282735-0.1593153530.0475672810.3653745050.0405013190.3364902920.353028910.055790236-0.00377095-0.002265683-0.010650427-0.2324303460.163299502-0.2179294150.014098659
key20.1154934410.003600704-0.010591627-0.001050182-0.000776948-0.005373337-0.0029769380.0334842550.044410139-0.0480942640.36537450510.5997573790.0405590360.0485460740.000204219-0.010706937-0.006433001-0.0302399760.517681018-0.198775995-0.6187714650.01443342
key30.162220133-0.0009631260.012249589-8.17E-05-0.003290250.0005161220.008373173-0.067267879-0.119179439-0.0941840560.0405013190.5997573791-0.051972316-0.048968107-0.062461354-0.012204471-0.007332758-0.0344695120.562032738-0.310678991-0.7053163920.000943405
key4-0.5218762480.000996877-0.015183201-0.005018770.000852489-0.007514654-0.0024345940.0807988330.137471163-0.0157072620.3364902920.040559036-0.05197231610.883002107-0.02527606-0.004938749-0.002967326-0.01394868-0.208549949-0.035090898-0.2854184010.005684539
key5-0.524798477-0.00026949-0.015408846-0.0052070867.56E-05-0.00646125-0.0017653270.0809647510.138438707-0.0124701020.353028910.048546074-0.0489681070.8830021071-0.025453957-0.004973508-0.002988211-0.014046853-0.210904683-0.036048185-0.287427220.005845879
key6-0.014012581-0.0045987090.0012142580.0004172130.0042243090.000427716-0.0009236790.0204901070.031066357-0.0120579750.0557902360.000204219-0.062461354-0.02527606-0.0254539571-0.001116081-0.00067057-0.003152186-0.097920875-0.041027628-0.0645001450.005787156
key7-0.0111289570.0017263470.005002606-0.00644416-0.0008419590.0015316110.003917224-0.022150644-0.0359257070.009509313-0.00377095-0.010706937-0.012204471-0.004938749-0.004973508-0.0011160811-0.000131024-0.000615913-0.0195816690.045759211-0.0126028350.000802206
key8-0.002301110.0022073370.003539210.000164794-0.0027949090.0014756290.001450829-0.013308673-0.0215850830.00990889-0.002265683-0.006433001-0.007332758-0.002967326-0.002988211-0.00067057-0.0001310241-0.000370056-0.0117630020.028821779-0.0075721050.001346071
key9-0.0642361410.0093797710.0159508570.001847491-0.0062334840.0036100750.007311049-0.062560837-0.1014662290.041849429-0.010650427-0.030239976-0.034469512-0.01394868-0.014046853-0.003152186-0.000615913-0.0003700561-0.0552377360.137003222-0.035594626-0.000375587
eid_freq0.225577102-0.015023896-0.0527771090.0030817470.009583036-0.008428024-0.0241875530.2161452560.339527617-0.155212103-0.2324303460.5176810180.562032738-0.208549949-0.210904683-0.097920875-0.019581669-0.011763002-0.0552377361-0.513082228-0.4066421630.001870099
eid_mean0.1188099570.0088277330.05698642-0.004780093-0.0217763420.013524050.025275631-0.265188333-0.4221650930.3025092170.163299502-0.198775995-0.310678991-0.035090898-0.036048185-0.0410276280.0457592110.0288217790.137003222-0.51308222810.180739515-0.013213594
udmap_isunknown0.032973963-0.007149461-0.0356368140.0025695160.011675347-0.004499133-0.0245450010.1468564630.243788320.055496236-0.217929415-0.618771465-0.705316392-0.285418401-0.28742722-0.064500145-0.012602835-0.007572105-0.035594626-0.4066421630.1807395151-0.003727976
common_ts_hour-0.0030931960.1176105610.0005196350.00174124-0.002434963-0.007483399-0.0150393750.0010779180.005713371-0.0068964580.0140986590.014433420.0009434050.0056845390.0058458790.0057871560.0008022060.001346071-0.0003755870.001870099-0.013213594-0.0037279761

        可以看出:key2、key3、key4、key5属性与许多输入属性都强相关,如果全部保留有可能出现数据的冗余,因此可以删去一部分;key2、x1、common_ts_hour、key、key8属性与target若相关,可以删去。

        但是全部删去后效果会变差,因此做了一些测试。尝试了几种不同的属性组合进行删除操作,实验多次取最好一次效果。

删除的属性最优f1_score得分
key80.775
key8 key30.781
key8 key3 key20.788
key8 key3 key2 x10.769
key8 key3 key2 key40.788
key8 key3 key2 key50.791
key8 key3 key2 key5 key40.796

        用最优的一组属性删除方案,在评分系统中得分0.64431,优于原始Baseline的得分0.62左右。

 四、失败案例——神经网络预测

import pandas as pd
import numpy as np
# import lightgbm as lgb
# from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import tensorflow as tf


def udmap_onethot(d):
    v = np.zeros(9)
    if d == 'unknown':
        return v

    d = eval(d)
    for i in range(1, 10):
        if 'key' + str(i) in d:
            v[i - 1] = d['key' + str(i)]

    return v


threshold = 0.5
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

train_data['common_ts'] = pd.to_datetime(train_data['common_ts'], unit='ms')
test_data['common_ts'] = pd.to_datetime(test_data['common_ts'], unit='ms')

train_udmap_df = pd.DataFrame(np.vstack(train_data['udmap'].apply(udmap_onethot)))
test_udmap_df = pd.DataFrame(np.vstack(test_data['udmap'].apply(udmap_onethot)))

train_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
test_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]

train_data = pd.concat([train_data, train_udmap_df], axis=1)
test_data = pd.concat([test_data, test_udmap_df], axis=1)

train_data['eid_freq'] = train_data['eid'].map(train_data['eid'].value_counts())
test_data['eid_freq'] = test_data['eid'].map(train_data['eid'].value_counts())

train_data['eid_mean'] = train_data['eid'].map(train_data.groupby('eid')['target'].mean())
test_data['eid_mean'] = test_data['eid'].map(train_data.groupby('eid')['target'].mean())

train_data['udmap_isunknown'] = (train_data['udmap'] == 'unknown').astype(int)
test_data['udmap_isunknown'] = (test_data['udmap'] == 'unknown').astype(int)

train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
test_data['common_ts_hour'] = test_data['common_ts'].dt.hour

# pd.DataFrame(train_data.drop(['udmap', 'common_ts', 'uuid'], axis=1)).to_csv('submitd.csv', index=None)

# 按照指定比例拆分成训练集和测试集
train_df, test_df = train_test_split(train_data, test_size=0.2, random_state=None)

# 构建神经网络模型 26个属性
model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='sigmoid', input_shape=(17,)),
tf.keras.layers.Dense(16, activation='sigmoid'),
tf.keras.layers.Dense(8, activation='sigmoid'),
tf.keras.layers.Dense(4, activation='sigmoid'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
# 编译模型
model.compile(optimizer='adam', loss='mse')
model.fit(
    train_df.drop(['udmap', 'common_ts', 'uuid', 'target','key8','key3','key2','key4','key5'], axis=1),
    train_df['target'], epochs=10
)  # 使用训练集进行训练4
test_y = model.predict(
    test_df.drop(['udmap', 'common_ts', 'uuid', 'target','key8','key3','key2','key4','key5'], axis=1))
#  测试集预测结果
for i in range(0, len(test_y)):
    if test_y[i] > threshold:
        test_y[i] = 1
    else:
        test_y[i] = 0
ok = 0
for i in range(0, len(test_y)):
    # print(test_y[i],test_df['target'][i])
    if float(test_y[i]) - float(test_df['target'].iloc[i]) == 0:
        ok = ok + 1
print(ok, ok / len(test_y))
out = model.predict(
    test_data.drop(['udmap', 'common_ts', 'uuid','key8','key3','key2','key4','key5'], axis=1))
for i in range(0, len(out)):
    if out[i] > threshold:
        out[i][0] = 1
    else:
        out[i][0] = 0
yy = []
score = f1_score(y_true=test_df['target'], y_pred=test_y, average='macro')
print(score)
for i in range(0, len(out)):
    yy.append(out[i][0])
pd.DataFrame({
    'uuid': test_data['uuid'],
    'target': yy
}).to_csv('net-submit.csv', index=None)

    该问题使用神经网络算法后效果较差,得分仅为0.462。

五、总结

        现在的分数为0.64431。目前删去了一些与target弱相关的属性,且删除了呈现强相关的一组属性中的一个或多个属性,并为决策树添加了一条属性。接下来的想法是结合后续的课程,挖掘时间戳字段中的更多信息。并对训练集数据进行异常检测操作,删去离群点,进行数据预处理操作。还可能用一下数据降维的知识。

六、2023-8-17补充

        根据每一列的时间戳属性,提取一个新属性,即时间戳所在的星期(星期一为0,星期二为1,以此类推)。代码如下:

train_data['week'] = train_data['common_ts'].dt.weekday
test_data['week'] = test_data['common_ts'].dt.weekday  # 星期属性

        增添这个属性后,算法得分大大提升,分数为0.7449

 

 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值