【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task3特征工程

智慧海洋建设-Task3 特征工程

此部分为智慧海洋建设竞赛的特征工程模块,通过特征工程,可以最大限度地从原始数据中提取特征以供算法和模型使用。通俗而言,就是通过X,创造新的X’以获得更好的训练、预测效果。

“数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已”——机器学习界;

类似的,吴恩达曾说过:“特征工程不仅操作困难、耗时,而且需要专业领域知识。应用机器学习基本上就是特征工程。”

赛题:智慧海洋建设

特征工程的目的:

  • 特征工程是一个包含内容很多的主题,也被认为是成功应用机器学习的一个很重要的环节。如何充分利用数据进行预测建模就是特征工程要解决的问题! “实际上,所有机器学习算法的成功取决于如何呈现数据。” “特征工程是一个看起来不值得在任何论文或者书籍中被探讨的一个主题。但是他却对机器学习的成功与否起着至关重要的作用。机器学习算法很多都是由于建立一个学习器能够理解的工程化特征而获得成功的。”——ScottLocklin,in “Neglected machine learning ideas”

  • 数据中的特征对预测的模型和获得的结果有着直接的影响。可以这样认为,特征选择和准备越好,获得的结果也就越好。这是正确的,但也存在误导。预测的结果其实取决于许多相关的属性:比如说能获得的数据、准备好的特征以及模型的选择。

  • 上分!😃 毫不夸张的说在基本的数据挖掘类比赛中,特征工程就是你和topline的距离。

项目地址:https://github.com/datawhalechina/team-learning-data-mining/tree/master/wisdomOcean

比赛地址:https://tianchi.aliyun.com/competition/entrance/231768/introduction?spm=5176.12281957.1004.8.4ac63eafE1rwsY

学习目标

  1. 学习特征工程的基本概念

  2. 学习topline代码的特征工程构造方法,实现构建有意义的特征工程

  3. 完成相应学习打卡任务

内容介绍

  1. 特征工程概述

  2. 赛题特征工程

    • 业务特征,根据先验知识进行专业性的特征构建
  3. 分箱特征

    • v、x、y的分箱特征
    • x、y分箱后并构造区域
  4. DataFramte特征

    • count计数值
    • shift偏移量
    • 统计特征
  5. Embedding特征

    • Word2vec构造词向量
    • NMF提取文本的主题分布
  6. 总结

特征工程概述

特征工程大体可分为3部分,特征构建、特征提取和特征选择。

  • 特征构建

“从数学的角度讲,特征工程就是将原始数据空间变换到新的特征空间,或者说是换一种数据的表达方式,在新的特征空间中,模型能够更好地学习数据中的规律。因此,特征抽取就是对原始数据进行变换的过程。大多数模型和算法都要求输入是维度相同的实向量,因此特征工程首先需要将原始数据转化为实向量。”
其主要包含内容有:

+ 探索性数据分析
+ 数值特征
+ 类别特征
+ 时间特征
+ 文本特征
  • 特征提取和特征选择

特征提取和特征选择概念上来说很像,其实特征提取指的是通过特征转换得到一组具有明显物理或统计意义的特征。而特征选择就是在特征集里直接挑出具有明显物理或统计意义的特征。

与特征提取是从原始数据中构造新的特征不同,特征选择是从这些特征集合中选出一个子集。特征选择对于机器学习应用来说非常重要。特征选择也称为属性选择或变量选择,是指为了构建模型而选择相关特征子集的过程。特征选择的目的有如下三个。

+ 简化模型,使模型更易于研究人员和用户理解。可解释性不仅让我们对模型效果的稳定性有更多的把握,而且也能为业务运营等工作提供指引和决策支持。

+  改善性能。特征选择的另一个作用是节省存储和计算开销。

+  改善通用性、降低过拟合风险。特征的增多会大大增加模型的搜索空间,大多数模型所需要的训练样本数目随着特征数量的增加而显著增加,特征的增加虽然能更好地拟合训练数据,但也可能增加方差。

————————————————————————————————————————————————————————————————————

注:本ipynb着重学习topline代码的特征工程构造方法,效果需要模型方面进行预测打分

————————————————————————————————————————————————————————————————————

导入所需库和数据

补充:
下述库中的geopandas安装可能会遇到问题,可通过如下博客解决:

https://qianni1997.github.io/2019/07/26/geopandas-install/

import gc
import multiprocessing as mp
import os
import pickle
import time
import warnings
from collections import Counter
from copy import deepcopy
from datetime import datetime
from functools import partial
from glob import glob

import geopandas as gpd
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from gensim.models import FastText, Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pyproj import Proj
from scipy import sparse
from scipy.sparse import csr_matrix
from sklearn import metrics
from sklearn.cluster import DBSCAN
from sklearn.decomposition import NMF, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score, precision_recall_fscore_support
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

os.environ['PYTHONHASHSEED'] = '0'
warnings.filterwarnings('ignore')
# 不直接对DataFrame做append操作,提升运行速度
def get_data(file_path,max_lines = 2000):
    paths = os.listdir(file_path)
    tmp = []
    for t in tqdm(range(len(paths))):
        if len(tmp) > max_lines:break
            
        p = paths[t]
        with open('{}/{}'.format(file_path, p), encoding='utf-8') as f:
            next(f)
            for line in f.readlines():
                tmp.append(line.strip().split(','))
                if len(tmp) > max_lines:break
                    
    tmp_df = pd.DataFrame(tmp)
    tmp_df.columns = ['渔船ID', 'x', 'y', '速度', '方向', 'time', 'type']
    return tmp_df

TRAIN_PATH = "../input/hy_round1_train_20200102/"
# 采样数据行数
max_lines = 2000
df = get_data(TRAIN_PATH,max_lines=max_lines)
  0%|                                                                                         | 0/7000 [00:00<?, ?it/s]
# 基本预处理
label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2}
label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'}
name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label'}

df.rename(columns = name_dict, inplace = True)
df['label'] = df['label'].map(label_dict1)
cols = ['x','y','v']
for col in cols:
    df[col] = df[col].astype('float')
df['dir'] = df['dir'].astype('int')
df['time'] = pd.to_datetime(df['time'], format='%m%d %H:%M:%S')
df['date'] = df['time'].dt.date
df['hour'] = df['time'].dt.hour
df['month'] = df['time'].dt.month
df['weekday'] = df['time'].dt.weekday
df.head()
idxyvdirtimelabeldatehourmonthweekday
006.152038e+065.124873e+062.591021900-11-10 11:58:1901900-11-1011115
106.151230e+065.125218e+062.701131900-11-10 11:48:1901900-11-1011115
206.150421e+065.125563e+062.701161900-11-10 11:38:1901900-11-1011115
306.149612e+065.125907e+063.29951900-11-10 11:28:1901900-11-1011115
406.148803e+065.126252e+063.181081900-11-10 11:18:1901900-11-1011115

数据说明:

- id:渔船ID,整数
- x:记录位置横坐标,浮点数
- y:记录位置纵坐标,浮点数
- v:记录速度,浮点数
- dir:记录航向,整数
- time:时间,文本
- label:需要预测的标签,整数

赛题特征工程

构造各点的(x、y)坐标与特定点(6165599,5202660)的距离

df['x_dis_diff'] = (df['x'] - 6165599).abs()
df['y_dis_diff'] = (df['y'] - 5202660).abs()
df['base_dis_diff'] = ((df['x_dis_diff']**2)+(df['y_dis_diff']**2))**0.5    
del df['x_dis_diff'],df['y_dis_diff'] 
df['base_dis_diff'].head()
0    78959.780945
1    78763.845006
2    78577.185266
3    78399.867568
4    78231.955018
Name: base_dis_diff, dtype: float64

对时间,小时进行白天、黑天进行划分,5-20为白天1,其余为黑天0

df['day_nig'] = 0
df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_nig'] = 1
df['day_nig'].head()
0    1
1    1
2    1
3    1
4    1
Name: day_nig, dtype: int64

根据月份划分季度

# 季度
df['quarter'] = 0
df.loc[(df['month'].isin([1, 2, 3])), 'quarter'] = 1
df.loc[(df['month'].isin([4, 5, 6, ])), 'quarter'] = 2
df.loc[(df['month'].isin([7, 8, 9])), 'quarter'] = 3
df.loc[(df['month'].isin([10, 11, 12])), 'quarter'] = 4

动态速度,速度变化,角度变化,xy相似性等特征

temp = df.copy()
temp.rename(columns={'id':'ship','dir':'d'},inplace=True)

# 给速度一个等级
def v_cut(v):
    if v < 0.1:
        return 0
    elif v < 0.5:
        return 1
    elif v < 1:
        return 2
    elif v < 2.5:
        return 3
    elif v < 5:
        return 4
    elif v < 10:
        return 5
    elif v < 20:
        return 5
    else:
        return 6
# 统计每个ship的对应速度等级的个数
def get_v_fea(df):

    df['v_cut'] = df['v'].apply(lambda x: v_cut(x))
    tmp = df.groupby(['ship', 'v_cut'], as_index=False)['v_cut'].agg({'v_cut_count': 'count'})
    # 通过pivot构建透视表
    tmp = tmp.pivot(index='ship', columns='v_cut', values='v_cut_count')

    new_col_nm = ['v_cut_' + str(col) for col in tmp.columns.tolist()]
    tmp.columns = new_col_nm
    tmp = tmp.reset_index()  # 把index恢复成data

    return tmp

c1 = get_v_fea(temp)
# 方位进行16均分
def add_direction(df):
    df['d16'] = df['d'].apply(lambda x: int((x / 22.5) + 0.5) % 16 if not np.isnan(x) else np.nan)
    return df
def get_d_cut_count_fea(df):
    df = add_direction(df)
    tmp = df.groupby(['ship', 'd16'], as_index=False)['d16'].agg({'d16_count': 'count'})
    tmp = tmp.pivot(index='ship', columns='d16', values='d16_count')
    new_col_nm = ['d16_' + str(col) for col in tmp.columns.tolist()]
    tmp.columns = new_col_nm
    tmp = tmp.reset_index()
    return tmp

c2 = get_d_cut_count_fea(temp)
def get_v0_fea(df):
    # 统计速度为0的个数,以及速度不为0的统计量
    df_zero_count = df.query("v==0")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(
        {'num_zero_v': 'count'})
    df_not_zero_agg = df.query("v!=0")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(
        {'v_max_drop_0': 'max',
         'v_min_drop_0': 'min',
         'v_mean_drop_0': 'mean',
         'v_std_drop_0': 'std',
         'v_median_drop_0': 'median',
         'v_skew_drop_0': 'skew'})
    tmp = df_zero_count.merge(df_not_zero_agg, on='ship', how='left')

    return tmp

c3 = get_v0_fea(temp)
def get_percentiles_fea(df_raw):
    key = ['x', 'y', 'v', 'd']
    temp = df_raw[['ship']].drop_duplicates('ship')
    for i in range(len(key)):
        # 加入x,v,d,y的中位数和各种位数
        tmp_dscb = df_raw.groupby('ship')[key[i]].describe(
            percentiles=[0.05] + [ii / 1000 for ii in range(125, 1000, 125)] + [0.95])
        raw_col_nm = tmp_dscb.columns.tolist()
        new_col_nm = [key[i] + '_' + col for col in raw_col_nm]
        tmp_dscb.columns = new_col_nm
        tmp_dscb = tmp_dscb.reset_index()
        # 删掉多余的统计特征
        tmp_dscb = tmp_dscb.drop([f'{key[i]}_count', f'{key[i]}_mean', f'{key[i]}_std',
                                  f'{key[i]}_min', f'{key[i]}_max'], axis=1)

        temp = temp.merge(tmp_dscb, on='ship', how='left')
    return temp

c4 = get_percentiles_fea(temp)
def get_d_change_rate_fea(df):
    import math
    import time
    temp = df.copy()
    # 以ship、time为主键进行排序
    temp.sort_values(['ship', 'time'], ascending=True, inplace=True)
    # 通过shift求相邻差异值,注意学习.shift(-1,1)的含义
    temp['timenext'] = temp.groupby('ship')['time'].shift(-1)
    temp['ynext'] = temp.groupby('ship')['y'].shift(-1)
    temp['xnext'] = temp.groupby('ship')['x'].shift(-1)
    # 将shift得到的差异量进行填充,为什么会有空值NaN?
    # 因为shift的起始位置是没法比较的,故用空值来代替
    temp['ynext'] = temp['ynext'].fillna(method='ffill')
    temp['xnext'] = temp['xnext'].fillna(method='ffill')
    # 这里笔者的理解是ynext/xnext,而不需要减去y和x,因为ynext和xnext本身就是偏移量了
    temp['angle_next'] = (temp['ynext'] - temp['y']) / (temp['xnext'] - temp['x'])
    temp['angle_next'] = np.arctan(temp['angle_next']) / math.pi * 180
    temp['angle_next_next'] = temp['angle_next'].shift(-1)
    temp['timediff'] = np.abs(temp['timenext'] - temp['time'])
    temp['timediff'] = temp['timediff'].fillna(method='ffill')
    temp['hc_xy'] = abs(temp['angle_next_next'] - temp['angle_next'])
    # 对于hc_xy这列的值>180度的,进行修改成360度求差,仅考虑与水平线的角度
    temp.loc[temp['hc_xy'] > 180, 'hc_xy'] = (360 - temp.loc[temp['hc_xy'] > 180, 'hc_xy'])
    temp['hc_xy_s'] = temp.apply(lambda x: x['hc_xy'] / x['timediff'].total_seconds(), axis=1)

    temp['d_next'] = temp.groupby('ship')['d'].shift(-1)
    temp['hc_d'] = abs(temp['d_next'] - temp['d'])
    temp.loc[temp['hc_d'] > 180, 'hc_d'] = 360 - temp.loc[temp['hc_d'] > 180, 'hc_d']
    temp['hc_d_s'] = temp.apply(lambda x: x['hc_d'] / x['timediff'].total_seconds(), axis=1)

    temp1 = temp[['ship', 'hc_xy_s', 'hc_d_s']]
    xy_d_rate = temp1.groupby('ship')['hc_xy_s'].agg({'hc_xy_s_max': 'max',
                                                      })
    xy_d_rate = xy_d_rate.reset_index()
    d_d_rate = temp1.groupby('ship')['hc_d_s'].agg({'hc_d_s_max': 'max',
                                                    })
    d_d_rate = d_d_rate.reset_index()

    tmp = xy_d_rate.merge(d_d_rate, on='ship', how='left')
    return tmp

c5 = get_d_change_rate_fea(temp)
f1 = temp.merge(c1,on='ship',how='left')
f1 = f1.merge(c2,on='ship',how='left')
f1 = f1.merge(c3,on='ship',how='left')
f1 = f1.merge(c4,on='ship',how='left')
f1 = f1.merge(c5,on='ship',how='left')

分箱特征

v、x、y的分箱特征

pre_cols = df.columns

df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') # 速度进行 200分位数分箱
df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique(), range(df['v_bin'].nunique())))) # 分箱后映射编码
for f in ['x', 'y']:
    df[f + '_bin1'] = pd.qcut(df[f], 1000, duplicates='drop') # x,y位置分箱1000
    df[f + '_bin1'] = df[f + '_bin1'].map(dict(zip(df[f + '_bin1'].unique(), range(df[f + '_bin1'].nunique()))))#编码
    df[f + '_bin2'] = df[f] // 10000 # 取整操作
    df[f + '_bin1_count'] = df[f + '_bin1'].map(df[f + '_bin1'].value_counts()) #x,y不同分箱的数量映射
    df[f + '_bin2_count'] = df[f + '_bin2'].map(df[f + '_bin2'].value_counts()) #数量映射
    df[f + '_bin1_id_nunique'] = df.groupby(f + '_bin1')['id'].transform('nunique')#基于分箱1 id数量映射
    df[f + '_bin2_id_nunique'] = df.groupby(f + '_bin2')['id'].transform('nunique')#基于分箱2 id数量映射
for i in [1, 2]:
    # 特征交叉x_bin1(2),y_bin1(2) 形成类别 统计每类数量映射到列  
    df['x_y_bin{}'.format(i)] = df['x_bin{}'.format(i)].astype('str') + '_' + df['y_bin{}'.format(i)].astype('str')
    df['x_y_bin{}'.format(i)] = df['x_y_bin{}'.format(i)].map(
        dict(zip(df['x_y_bin{}'.format(i)].unique(), range(df['x_y_bin{}'.format(i)].nunique())))
    )
    df['x_bin{}_y_bin{}_count'.format(i, i)] = df['x_y_bin{}'.format(i)].map(df['x_y_bin{}'.format(i)].value_counts())
for stat in ['max', 'min']:
    # 统计x_bin1 y_bin1的最大最小值
    df['x_y_{}'.format(stat)] = df['y'] - df.groupby('x_bin1')['y'].transform(stat)
    df['y_x_{}'.format(stat)] = df['x'] - df.groupby('y_bin1')['x'].transform(stat)

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
v_binx_bin1x_bin2x_bin1_countx_bin2_countx_bin1_id_nuniquex_bin2_id_nuniquey_bin1y_bin2y_bin1_count...y_bin1_id_nuniquey_bin2_id_nuniquex_y_bin1x_bin1_y_bin1_countx_y_bin2x_bin2_y_bin2_countx_y_maxy_x_maxx_y_miny_x_min
00.00615.01168220512.02...210103-115954.6751570.0000000.00000049790.106760
10.01615.028221512.02...1111030.0000000.00000053070.048324808.872353
20.02615.028221512.02...1121030.000000-808.87235354707.5120920.000000
31.03614.0277222512.02...1131180.0000000.00000052951.293120808.787673
42.04614.0277222512.02...1141180.000000-808.78767355461.6530280.000000

5 rows × 21 columns

将x、y进行分箱并构造区域

def traj_to_bin(traj=None, x_min=12031967.16239096, x_max=14226964.881853,
                y_min=1623579.449434373, y_max=4689471.1780792,
                row_bins=4380, col_bins=3136):

    # Establish bins on x direction and y direction
    x_bins = np.linspace(x_min, x_max, endpoint=True, num=col_bins + 1)
    y_bins = np.linspace(y_min, y_max, endpoint=True, num=row_bins + 1)

    # Determine each x coordinate belong to which bin
    traj.sort_values(by='x', inplace=True)
    x_res = np.zeros((len(traj), ))
    j = 0
    for i in range(1, col_bins + 1):
        low, high = x_bins[i-1], x_bins[i]
        while( j < len(traj)):
            # low - 0.001 for numeric stable.
            if (traj["x"].iloc[j] <= high) & (traj["x"].iloc[j] > low - 0.001):
                x_res[j] = i
                j += 1
            else:
                break
    traj["x_grid"] = x_res
    traj["x_grid"] = traj["x_grid"].astype(int)
    traj["x_grid"] = traj["x_grid"].apply(str)

    # Determine each y coordinate belong to which bin
    traj.sort_values(by='y', inplace=True)
    y_res = np.zeros((len(traj), ))
    j = 0
    for i in range(1, row_bins + 1):
        low, high = y_bins[i-1], y_bins[i]
        while( j < len(traj)):
            # low - 0.001 for numeric stable.
            if (traj["y"].iloc[j] <= high) & (traj["y"].iloc[j] > low - 0.001):
                y_res[j] = i
                j += 1
            else:
                break
    traj["y_grid"] = y_res
    traj["y_grid"] = traj["y_grid"].astype(int)
    traj["y_grid"] = traj["y_grid"].apply(str)

    # Determine which bin each coordinate belongs to.
    traj["no_bin"] = [i + "_" + j for i, j in zip(
        traj["x_grid"].values.tolist(), traj["y_grid"].values.tolist())]
    traj.sort_values(by='time', inplace=True)
    return traj

bin_size = 800
col_bins = int((14226964.881853 - 12031967.16239096) / bin_size)
row_bins = int((4689471.1780792 - 1623579.449434373) / bin_size)
pre_cols = df.columns
# 特征x_grid,y_grid,no_bin
df = traj_to_bin(df)

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols]
x_gridy_gridno_bin
1606000_0
1605000_0
1604000_0
1603000_0
1602000_0
............
1988000_0
1987000_0
1986000_0
1985000_0
1984000_0

2001 rows × 3 columns

DataFrame特征

count计数值

def find_save_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
    """Find and save the visit frequency of each bin."""
    visit_count_df = traj_data_df.groupby(["no_bin"]).count().reset_index()
    visit_count_df = visit_count_df[["no_bin", "x"]]
    visit_count_df.rename({"x":"visit_count"}, axis=1, inplace=True)
    return visit_count_df

def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
    """Find and save the unique boat visit count of each bin."""
    unique_boat_count_df = traj_data_df.groupby(["no_bin"])["id"].nunique().reset_index()
    unique_boat_count_df.rename({"id":"visit_boat_count"}, axis=1, inplace=True)

    unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df,
                                         on="no_bin", how="left")
    return unique_boat_count_df

traj_df = df[["id","x", "y",'time',"no_bin"]]
bin_to_coord_df = traj_df.groupby(["no_bin"]).median().reset_index()
pre_cols = df.columns

# DataFrame tmp for finding POIs
visit_count_df = find_save_visit_count_table(
    traj_df, bin_to_coord_df)
unique_boat_count_df = find_save_unique_visit_count_table(
    traj_df, bin_to_coord_df)

# # 特征'visit_count','visit_boat_count'
df = df.merge(visit_count_df,on='no_bin',how='left')
df = df.merge(unique_boat_count_df,on='no_bin',how='left')

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
visit_countvisit_boat_count
020016
120016
220016
320016
420016

shift偏移量特征

pre_cols = df.columns

g = df.groupby('id')
for f in ['x', 'y']:
    #对x,y坐标进行时间平移 1 -1 2
    df[f + '_prev_diff'] = df[f] - g[f].shift(1)
    df[f + '_next_diff'] = df[f] - g[f].shift(-1)
    df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)
    ## 三角形求解上时刻1距离  下时刻-1距离 2距离 
df['dist_move_prev'] = np.sqrt(np.square(df['x_prev_diff']) + np.square(df['y_prev_diff']))
df['dist_move_next'] = np.sqrt(np.square(df['x_next_diff']) + np.square(df['y_next_diff']))
df['dist_move_prev_next'] = np.sqrt(np.square(df['x_prev_next_diff']) + np.square(df['y_prev_next_diff']))
df['dist_move_prev_bin'] = pd.qcut(df['dist_move_prev'], 50, duplicates='drop')# 2时刻距离等频分箱50
df['dist_move_prev_bin'] = df['dist_move_prev_bin'].map(
    dict(zip(df['dist_move_prev_bin'].unique(), range(df['dist_move_prev_bin'].nunique())))
) #上一时刻映射编码

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
x_prev_diffx_next_diffx_prev_next_diffy_prev_diffy_next_diffy_prev_next_diffdist_move_prevdist_move_nextdist_move_prev_nextdist_move_prev_bin
0NaN-911.903731NaNNaN455.919062NaNNaN1019.524696NaNNaN
1911.903731-911.965576-1823.869307-455.919062455.831205911.7502671019.5246961019.5407302039.0654231.0
2911.965576-918.791508-1830.757085-455.83120520.360332476.1915381019.540730919.0170721891.6738311.0
3918.791508-597.354368-1516.145877-20.360332993.1313651013.491697919.0170721158.9400971823.6950782.0
4597.354368-910.468269-1507.822637-993.131365564.4350061557.5663701158.9400971071.2326282167.8427303.0

统计特征

基本统计特征用法

补充:

分组统计特征agg的使用非常重要,在此进行代码示例,详细请参考:
http://joyfulpandas.datawhale.club/Content/ch4.html

  • 请注意{}和[]的使用

分组标准格式:

df.groupby(分组依据)[数据来源].使用操作

先分组,得到

gb = df.groupby([‘School’, ‘Grade’])

  • 【a】使用多个函数

gb.agg([‘具体方法(如内置函数)’])

如gb.agg([‘sum’])

  • 【b】对特定的列使用特定的聚合函数

gb.agg({‘指定列’:‘具体方法’})

如gb.agg({‘Height’:[‘mean’,‘max’], ‘Weight’:‘count’})

  • 【c】使用自定义函数

gb.agg(函数名或匿名函数)

如gb.agg(lambda x: x.mean()-x.min())

  • 【d】聚合结果重命名

gb.agg([
(‘重命名的名字’,具体方法(如内置函数、自定义函数))
])

如gb.agg([(‘range’, lambda x: x.max()-x.min()), (‘my_sum’, ‘sum’)])

另外需要注意,使用对一个或者多个列使用单个聚合的时候,重命名需要加方括号,否则就不知道是新的名字还是手误输错的内置函数字符串:

  • 下述代码主要使用了

一种是df.groupby(‘id’).agg{‘列名’:‘方法’},另一种是df.groupby(‘id’)[‘列名’].agg(字典)

pre_cols = df.columns

def start(x):
    try:
        return x[0]
    except:
        return None

def end(x):
    try:
        return x[-1]
    except:
        return None


def mode(x):
    try:
        return pd.Series(x).value_counts().index[0]
    except:
        return None

for f in ['dist_move_prev_bin', 'v_bin']:
    # 上一时刻类别 速度类别映射处理
    df[f + '_sen'] = df['id'].map(df.groupby('id')[f].agg(lambda x: ','.join(x.astype(str))))
    
    # 一系列基本统计量特征 每列执行相应的操作
g = df.groupby('id').agg({
    'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode],
    'x': ['mean', 'max', 'min', 'std', np.ptp, start, end],
    'y': ['mean', 'max', 'min', 'std', np.ptp, start, end],
    'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'],
    'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'],
    'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'],
    'x_bin1_y_bin1_count': ['mean', 'max', 'min'],
    'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'],
    'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'],
    'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'],
}).reset_index()
g.columns = ['_'.join(col).strip() for col in g.columns] #提取列名
g.rename(columns={'id_': 'id'}, inplace=True) #重命名id_
cols = [f for f in g.keys() if f != 'id'] #特征列名提取
df = df.merge(g,on='id',how='left')

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
dist_move_prev_bin_senv_bin_senid_countx_bin1_modey_bin1_modex_bin2_modey_bin2_modex_y_bin1_modex_meanx_max...dist_move_prev_mindist_move_prev_sumx_y_min_meanx_y_min_miny_x_min_meany_x_min_minx_y_max_meanx_y_max_miny_x_max_meany_x_max_min
0nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
1nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
2nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
3nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
4nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374

5 rows × 54 columns

划分数据后进行统计

def group_feature(df, key, target, aggs,flag):   
    """通过字典的形式来构建方法和重命名"""
    agg_dict = {}
    for ag in aggs:
        agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag
#     print(agg_dict)
    t = df.groupby(key)[target].agg(agg_dict).reset_index()
    return t

def extract_feature(df, train, flag):
    '''
    统计feature
    注意理解group_feature的使用和效果
    '''
    if (flag == 'on_night') or (flag == 'on_day'): 
        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        # return train
    
    
    if flag == "0":
        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')  
    elif flag == "1":
        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left') 
        # .nunique().to_dict() 将nunique得到的对应唯一值统计量做成字典
        # to_dict() 与 map的使用可以很方便地构建一些统计量映射特征,如CTR(分类)问题中的转化率
        # 提问: 如果根据训练集给定的label(0,1)来构建训练集+测试集的转化率特征,注:测试集与训练集存在部分id相同
        hour_nunique = df.groupby('ship')['speed'].nunique().to_dict()
        train['speed_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)   
        hour_nunique = df.groupby('ship')['direction'].nunique().to_dict()
        train['direction_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)  

    t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')
    t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')
    t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')

       
    train['x_max_x_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
    train['y_max_y_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
    train['y_max_x_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
    train['x_max_y_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
    train['slope_{}'.format(flag)] = train['y_max_y_min_{}'.format(flag)] / np.where(train['x_max_x_min_{}'.format(flag)]==0, 0.001, train['x_max_x_min_{}'.format(flag)])
    train['area_{}'.format(flag)] = train['x_max_x_min_{}'.format(flag)] * train['y_max_y_min_{}'.format(flag)] 
    
    mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()
    train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)
    train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])

    return train
data  = df.copy()
data.rename(columns={
    'id':'ship',
    'v':'speed',
    'dir':'direction'
},inplace=True)
# 去重
data_label = data.drop_duplicates(['ship'],keep = 'first')

data_1 = data[data['speed']==0]
data_2 = data[data['speed']!=0]
data_label = extract_feature(data_1, data_label,"0")
data_label = extract_feature(data_2, data_label,"1")

data_1 = data[data['day_nig'] == 0]
data_2 = data[data['day_nig'] == 1]
data_label = extract_feature(data_1, data_label,"on_night")
data_label = extract_feature(data_2, data_label,"on_day")
data_label.rename(columns={'ship':'id','speed':'v','direction':'dir'},inplace=True)
new_cols = [i for i in data_label.columns if i not in df.columns]
df = df.merge(data_label[new_cols+['id']],on='id',how='left')

df[new_cols].head()
direction_max_0direction_median_0direction_mean_0direction_std_0direction_skew_0x_max_0x_min_0x_mean_0x_median_0x_std_0...base_dis_diff_std_on_daybase_dis_diff_skew_on_dayx_max_x_min_on_dayy_max_y_min_on_dayy_max_x_min_on_dayx_max_y_min_on_dayslope_on_dayarea_on_daymode_hour_on_dayslope_median_on_day
000.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
100.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
200.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
300.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
400.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333

5 rows × 127 columns

统计特征的具体使用

temp = df.copy()
temp.rename(columns={'id':'ship','dir':'d'},inplace=True)

def coefficient_of_variation(x):
    x = x.values
    if np.mean(x) == 0:
        return 0
    return np.std(x) / np.mean(x)

def max_2(x):
    x = list(x.values)
    x.sort(reverse=True)
    return x[1]

def max_3(x):
    x = list(x.values)
    x.sort(reverse=True)
    return x[2]

def diff_abs_mean(x):  # 统计特征 deta绝对值均值
    return np.mean(np.abs(np.diff(x)))

f1 = pd.DataFrame()
for col in ['x', 'y', 'v', 'd']:
    features = temp.groupby('ship', as_index=False)[col].agg({
        '{}_min'.format(col): 'min',
        '{}_max'.format(col): 'max',
        '{}_mean'.format(col): 'mean',
        '{}_median'.format(col): 'median',
        '{}_std'.format(col): 'std',
        '{}_skew'.format(col): 'skew',
        '{}_sum'.format(col): 'sum',
        '{}_diff_abs_mean'.format(col): diff_abs_mean,
        '{}_mode'.format(col): lambda x: x.value_counts().index[0],
        '{}_coefficient_of_variation'.format(col): coefficient_of_variation,
        '{}_max2'.format(col): max_2,
        '{}_max3'.format(col): max_3
    })
    if f1.shape[0] == 0:
        f1 = features
    else:
        f1 = f1.merge(features, on='ship', how='left')

f1['x_max_x_min'] = f1['x_max'] - f1['x_min']
f1['y_max_y_min'] = f1['y_max'] - f1['y_min']
f1['y_max_x_min'] = f1['y_max'] - f1['x_min']
f1['x_max_y_min'] = f1['x_max'] - f1['y_min']
f1['slope'] = f1['y_max_y_min'] / np.where(f1['x_max_x_min'] == 0, 0.001, f1['x_max_x_min'])
f1['area'] = f1['x_max_x_min'] * f1['y_max_y_min']
f1['dis_max_min'] = (f1['x_max_x_min'] ** 2 + f1['y_max_y_min'] ** 2) ** 0.5
f1['dis_mean'] = (f1['x_mean'] ** 2 + f1['y_mean'] ** 2) ** 0.5
f1['area_d_dis_max_min'] = f1['area'] / f1['dis_max_min']

# 加速度
temp.sort_values(['ship', 'time'], ascending=True, inplace=True)
temp['ynext'] = temp.groupby('ship')['y'].shift(-1)
temp['xnext'] = temp.groupby('ship')['x'].shift(-1)
temp['ynext'] = temp['ynext'].fillna(method='ffill')
temp['xnext'] = temp['xnext'].fillna(method='ffill')
temp['timenext'] = temp.groupby('ship')['time'].shift(-1)
temp['timediff'] = np.abs(temp['timenext'] - temp['time'])
temp['a_y'] = temp.apply(lambda x: (x['ynext'] - x['y']) / x['timediff'].total_seconds(), axis=1)
temp['a_x'] = temp.apply(lambda x: (x['xnext'] - x['x']) / x['timediff'].total_seconds(), axis=1)
for col in ['a_y', 'a_x']:
    f2 = temp.groupby('ship', as_index=False)[col].agg({
        '{}_max'.format(col): 'max',
        '{}_mean'.format(col): 'mean',
        '{}_min'.format(col): 'min',
        '{}_median'.format(col): 'median',
        '{}_std'.format(col): 'std'})
    f1 = f1.merge(f2, on='ship', how='left')

# 曲率
temp['y_pre'] = temp.groupby('ship')['y'].shift(1)
temp['x_pre'] = temp.groupby('ship')['x'].shift(1)
temp['y_pre'] = temp['y_pre'].fillna(method='bfill')
temp['x_pre'] = temp['x_pre'].fillna(method='bfill')
temp['d_pre'] = ((temp['x'] - temp['x_pre']) ** 2 + (temp['y'] - temp['y_pre']) ** 2) ** 0.5
temp['d_next'] = ((temp['xnext'] - temp['x']) ** 2 + (temp['ynext'] - temp['y']) ** 2) ** 0.5
temp['d_pre_next'] = ((temp['xnext'] - temp['x_pre']) ** 2 + (temp['ynext'] - temp['y_pre']) ** 2) ** 0.5
temp['curvature'] = (temp['d_pre'] + temp['d_next']) / temp['d_pre_next']

f2 = temp.groupby('ship', as_index=False)['curvature'].agg({
    'curvature_max': 'max',
    'curvature_mean': 'mean',
    'curvature_min': 'min',
    'curvature_median': 'median',
    'curvature_std': 'std'})
f1 = f1.merge(f2, on='ship', how='left')

embedding特征

  • Question!

为什么在数据挖掘类比赛中,我们需要word2vec或NMF(方法有很多,但这两种常用)来构造 “词嵌入特征”?

答: 上分!

确实,上分是现象,但背后却是对整体数据的考虑,上述的统计特征、业务特征等也都是考虑了数据的整体性,但是却难免忽略了数据间的关系。举个例子,对于所有人的年龄特征,如果仅做一些统计特征如平均值、最值,业务特征如标准体重=体重/年龄等,这些都是人为理解的。那将这些特征想象成一个个词,并将所有数据(或同一组数据)的这些词组合当成一篇文章来考虑,是不是就可以得到一些额外的规律,即特征。

  • 简介

所谓word embedding就是把一个词用编码的方式表示以便于feed到网络中。Word Embedding有的时候也被称作为分布式语义模型或向量空间模型等,所以从名字和其转换的方式我们就可以明白, Word Embedding技术可以将相同类型的词归到一起,例如苹果,芒果香蕉等,在投影之后的向量空间距离就会更近,而书本,房子这些则会与苹果这些词的距离相对较远。

  • 使用场景

目前为止,Word Embedding可以用到特征生成,文件聚类,文本分类和自然语言处理等任务,例如:

计算相似的词:Word Embedding可以被用来寻找与某个词相近的词。

构建一群相关的词:对不同的词进行聚类,将相关的词聚集到一起;

用于文本分类的特征:在文本分类问题中,因为词没法直接用于机器学习模型的训练,所以我们将词先投影到向量空间,这样之后便可以基于这些向量进行机器学习模型的训练;

用于文件的聚类

上面列举的是文本相关任务,当然目前词嵌入模型已经被扩展到方方面面。典型的,例如:

在微博上面,每个人都用一个词来表示,对每个人构建Embedding,然后计算人之间的相关性,得到关系最为相近的人;

在推荐问题里面,依据每个用户的购买的商品记录,对每个商品进行Embedding,就可以计算商品之间的相关性,并进行推荐;

在此次天池的航海问题中,对相同经纬度上不同的船进行Embedding,就可以得到每个船只的向量,就可以得到经常在某些区域工作的船只;

可以说,词嵌入为寻找物体之间相关性带来了巨大的帮助。现在基本每个数据竞赛都会见到Embedding技术。让我们来看看用的最多的Word2Vec模型。

  • Word2Vec在做什么?

Word2vec在向量空间中对词进行表示, 或者说词以向量的形式表示,在词向量空间中:相似含义的单词一起出现,而不同的单词则位于很远的地方。这也被称为语义关系。

神经网络不理解文本,而只理解数字。词嵌入提供了一种将文本转换为数字向量的方法。

Word2vec就是在重建词的语言上下文。那什么是语言上下文?在一般的生活情景中,当我们通过说话或写作来交流,其他人会试图找出句子的目的。例如,“印度的温度是多少”,这里的上下文是用户想知道“印度的温度”即上下文。

简而言之,句子的主要目标是语境。围绕口头或书面语言的单词或句子(披露)有助于确定上下文的意义。Word2vec通过上下文学习单词的矢量表示。

  • 参考文献

[NLP] 秒懂词向量Word2vec的本质:https://zhuanlan.zhihu.com/p/26306795

Word2vec构造词向量

def traj_cbow_embedding(traj_data_corpus=None, embedding_size=70,
                        iters=40, min_count=3, window_size=25,
                        seed=9012, num_runs=5, word_feat="no_bin"):
    """CBOW embedding for trajectory data."""
    boat_id = traj_data_corpus['id'].unique()
    sentences, embedding_df_list, embedding_model_list = [], [], []
    for i in boat_id:
        traj = traj_data_corpus[traj_data_corpus['id']==i]
        sentences.append(traj[word_feat].values.tolist())

    print("\n@Start CBOW word embedding at {}".format(datetime.now()))
    print("-------------------------------------------")
    for i in tqdm(range(num_runs)):
        model = Word2Vec(sentences, size=embedding_size,
                                  min_count=min_count,
                                  workers=mp.cpu_count(),
                                  window=window_size,
                                  seed=seed, iter=iters, sg=0)

        # Sentance vector
        embedding_vec = []
        for ind, seq in enumerate(sentences):
            seq_vec, word_count = 0, 0
            for word in seq:
                if word not in model:
                    continue
                else:
                    seq_vec += model[word]
                    word_count += 1
            if word_count == 0:
                embedding_vec.append(embedding_size * [0])
            else:
                embedding_vec.append(seq_vec / word_count)
        embedding_vec = np.array(embedding_vec)
        embedding_cbow_df = pd.DataFrame(embedding_vec, 
            columns=["embedding_cbow_{}_{}".format(word_feat, i) for i in range(embedding_size)])
        embedding_cbow_df["id"] = boat_id
        embedding_df_list.append(embedding_cbow_df)
        embedding_model_list.append(model)
    print("-------------------------------------------")
    print("@End CBOW word embedding at {}".format(datetime.now()))
    return embedding_df_list, embedding_model_list
embedding_size=70
iters=70
min_count=3
window_size=25
num_runs=1

df_list, model_list = traj_cbow_embedding(df,
                                          embedding_size=embedding_size,
                                          iters=iters, min_count=min_count,
                                          window_size=window_size,
                                          seed=9012,
                                          num_runs=num_runs,
                                          word_feat="no_bin")

train_embedding_df_list = [d.reset_index(drop=True) for d in df_list]
fea = train_embedding_df_list[0]
fea = pd.DataFrame(fea)
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]


@Start CBOW word embedding at 2021-04-06 17:41:14.143589
-------------------------------------------


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.39it/s]

-------------------------------------------
@End CBOW word embedding at 2021-04-06 17:41:14.373201
pre_cols = df.columns
df = df.merge(fea,on='id',how='left')


new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
embedding_cbow_no_bin_0embedding_cbow_no_bin_1embedding_cbow_no_bin_2embedding_cbow_no_bin_3embedding_cbow_no_bin_4embedding_cbow_no_bin_5embedding_cbow_no_bin_6embedding_cbow_no_bin_7embedding_cbow_no_bin_8embedding_cbow_no_bin_9...embedding_cbow_no_bin_60embedding_cbow_no_bin_61embedding_cbow_no_bin_62embedding_cbow_no_bin_63embedding_cbow_no_bin_64embedding_cbow_no_bin_65embedding_cbow_no_bin_66embedding_cbow_no_bin_67embedding_cbow_no_bin_68embedding_cbow_no_bin_69
00.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
10.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
20.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
30.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
40.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282

5 rows × 70 columns

boat_id = df['id'].unique()
total_embedding = pd.DataFrame(boat_id, columns=["id"])
traj_data = df[['v','dir','id']].rename(columns = {'v':'speed','dir':'direction'})

# Step 1: Construct the words
traj_data_corpus = []
traj_data["speed_str"]     = traj_data["speed"].apply(lambda x: str(int(x*100)))
traj_data["direction_str"] = traj_data["direction"].apply(str)
traj_data["speed_dir_str"] = traj_data["speed_str"] + "_" + traj_data["direction_str"]
traj_data_corpus = traj_data[["id", "speed_str",
                                  "direction_str", "speed_dir_str"]]
print("\n@Round 2 speed embedding:")
df_list, model_list = traj_cbow_embedding(traj_data_corpus,
                                          embedding_size=10,
                                          iters=40, min_count=3,
                                          window_size=25, seed=9102,
                                          num_runs=1, word_feat="speed_str")
speed_embedding = df_list[0].reset_index(drop=True)
total_embedding = pd.merge(total_embedding, speed_embedding,
                           on="id", how="left")


print("\n@Round 2 direction embedding:")
df_list, model_list = traj_cbow_embedding(traj_data_corpus,
                                          embedding_size=12,
                                          iters=70, min_count=3,
                                          window_size=25, seed=9102,
                                          num_runs=1, word_feat="speed_dir_str")
speed_dir_embedding = df_list[0].reset_index(drop=True)
total_embedding = pd.merge(total_embedding, speed_dir_embedding,
                           on="id", how="left")
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.47it/s]


@Round 2 speed embedding:

@Start CBOW word embedding at 2021-04-06 17:41:15.054905
-------------------------------------------


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.44it/s]
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]

-------------------------------------------
@End CBOW word embedding at 2021-04-06 17:41:15.241547

@Round 2 direction embedding:

@Start CBOW word embedding at 2021-04-06 17:41:15.249564
-------------------------------------------


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.54it/s]

-------------------------------------------
@End CBOW word embedding at 2021-04-06 17:41:15.470688
pre_cols = df.columns
df = df.merge(total_embedding,on='id',how='left')

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
embedding_cbow_speed_str_0embedding_cbow_speed_str_1embedding_cbow_speed_str_2embedding_cbow_speed_str_3embedding_cbow_speed_str_4embedding_cbow_speed_str_5embedding_cbow_speed_str_6embedding_cbow_speed_str_7embedding_cbow_speed_str_8embedding_cbow_speed_str_9...embedding_cbow_speed_dir_str_2embedding_cbow_speed_dir_str_3embedding_cbow_speed_dir_str_4embedding_cbow_speed_dir_str_5embedding_cbow_speed_dir_str_6embedding_cbow_speed_dir_str_7embedding_cbow_speed_dir_str_8embedding_cbow_speed_dir_str_9embedding_cbow_speed_dir_str_10embedding_cbow_speed_dir_str_11
0-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
1-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
2-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
3-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
4-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452

5 rows × 22 columns

NMF提取文本的主题分布

class nmf_list(object):
    def __init__(self,data,by_name,to_list,nmf_n,top_n):
        self.data = data
        self.by_name = by_name
        self.to_list = to_list
        self.nmf_n = nmf_n
        self.top_n = top_n

    def run(self,tf_n):
        df_all = self.data.groupby(self.by_name)[self.to_list].apply(lambda x :'|'.join(x)).reset_index()
        self.data =df_all.copy()

        print('bulid word_fre')
        # 词频的构建
        def word_fre(x):
            word_dict = []
            x = x.split('|')
            docs = []
            for doc in x:
                doc = doc.split()
                docs.append(doc)
                word_dict.extend(doc)
            word_dict = Counter(word_dict)
            new_word_dict = {}
            for key,value in word_dict.items():
                new_word_dict[key] = [value,0]
            del word_dict  
            del x
            for doc in docs:
                doc = Counter(doc)
                for word in doc.keys():
                    new_word_dict[word][1] += 1
            return new_word_dict 
        self.data['word_fre'] = self.data[self.to_list].apply(word_fre)

        print('bulid top_' + str(self.top_n))
        # 设定100个高频词
        def top_100(word_dict):
            return sorted(word_dict.items(),key = lambda x:(x[1][1],x[1][0]),reverse = True)[:self.top_n]
        self.data['top_'+str(self.top_n)] = self.data['word_fre'].apply(top_100)
        def top_100_word(word_list):
            words = []
            for i in word_list:
                i = list(i)
                words.append(i[0])
            return words 
        self.data['top_'+str(self.top_n)+'_word'] = self.data['top_' + str(self.top_n)].apply(top_100_word)
        # print('top_'+str(self.top_n)+'_word的shape')
        print(self.data.shape)

        word_list = []
        for i in self.data['top_'+str(self.top_n)+'_word'].values:
            word_list.extend(i)
        word_list = Counter(word_list)
        word_list = sorted(word_list.items(),key = lambda x:x[1],reverse = True)
        user_fre = []
        for i in word_list:
            i = list(i)
            user_fre.append(i[1]/self.data[self.by_name].nunique())
        stop_words = []
        for i,j in zip(word_list,user_fre):
            if j>0.5:
                i = list(i)
                stop_words.append(i[0])

        print('start title_feature')
        # 讲融合后的taglist当作一句话进行文本处理
        self.data['title_feature'] = self.data[self.to_list].apply(lambda x: x.split('|'))
        self.data['title_feature'] = self.data['title_feature'].apply(lambda line: [w for w in line if w not in stop_words])
        self.data['title_feature'] = self.data['title_feature'].apply(lambda x: ' '.join(x))

        print('start NMF')
        # 使用tfidf对元素进行处理
        tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))
        tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)
        #使用nmf算法,提取文本的主题分布
        text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)


        # 整理并输出文件
        name = [str(tf_n) + self.to_list + '_' +str(x) for x in range(1,self.nmf_n+1)]
        tag_list = pd.DataFrame(text_nmf)
        print(tag_list.shape)
        tag_list.columns = name
        tag_list[self.by_name] = self.data[self.by_name]
        column_name = [self.by_name] + name
        tag_list = tag_list[column_name]
        return tag_list
data = df.copy()
data.rename(columns={'v':'speed','id':'ship'},inplace=True)
for j in range(1,4):
    print('********* {} *******'.format(j))
    for i in ['speed','x','y']:
        data[i + '_str'] = data[i].astype(str)
        nmf = nmf_list(data,'ship',i + '_str',8,2)
        nmf_a = nmf.run(j)
        nmf_a.rename(columns={'ship':'id'},inplace=True)
        data_label = data_label.merge(nmf_a,on = 'id',how = 'left')
********* 1 *******
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
********* 2 *******
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
********* 3 *******
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
bulid word_fre
bulid top_2
(6, 5)
start title_feature
start NMF
(6, 8)
new_cols = [i for i in data_label.columns if i not in df.columns]
df = df.merge(data_label[new_cols+['id']],on='id',how='left')

df[new_cols].head()
1speed_str_11speed_str_21speed_str_31speed_str_41speed_str_51speed_str_61speed_str_71speed_str_81x_str_11x_str_2...3x_str_73x_str_83y_str_13y_str_23y_str_33y_str_43y_str_53y_str_63y_str_73y_str_8
00.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
10.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
20.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
30.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
40.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0

5 rows × 72 columns

总结与思考

  • 赛题特征工程:该如何构建有效果的赛题特征工程

      参考:通过数据EDA、查阅对应赛题的参考文献,寻找并构建有实际意义的业务特征
    
  • 分箱特征:几乎所有topline代码中均有分箱特征的构造,为何分箱特征如此重要且有效。在什么情况下使用分箱特征的效果好?(为什么本赛题需要分箱特征)

      参考:分箱的原理
    
  • DataFrame特征:针对pandas DataFrame的内置方法的使用,可以构造出大量的统计特征。建议:自行整理一份针对表格数据的统计特征构造函数

      参考:DataWhale的joyful pandas
    
  • Embedding特征:上分秘籍,将序列转换成NLP文本中的一句话或一篇文章进行特征向量化为何效果如此之好。如何针对给定数据,调整参数构造较好的词向量?

      参考:Word2vec的学习
    

附录

学习来源

1 团队名称:Pursuing the Past Youth
链接:
https://github.com/juzstu/TianChi_HaiYang

2 团队名称:liu123的航空母舰队
链接:
https://github.com/MichaelYin1994/tianchi-trajectory-data-mining

3 团队名称:天才海神号
链接:
https://github.com/fengdu78/tianchi_haiyang?spm=5176.12282029.0.0.5b97301792pLch

4 团队名称:大白
链接:
https://github.com/Ai-Light/2020-zhihuihaiyang

5 团队名称:抗毒救灾
链接:
https://github.com/wudejian789/2020DCIC_A_Rank7_B_Rank12

6 团队名称:蜗牛坐车里团队
链接:
https://tianchi.aliyun.com/notebook-ai/detail?postId=114808

7 团队名称:用欧气驱散疫情
链接:
https://github.com/tudoulei/2020-Digital-China-Innovation-Competition

数据

所用数据是 hy_round1_train_20200102(初赛数据)

运行过程

针对各团队的整理的详细运行代码见 ipynb/*.ipynb
数字序号与上面相同

运行结果

文件输出见 result/*.csv

部分解释

  • 【天池智慧海洋建设】Topline源码——特征工程学习(大白):
    https://blog.csdn.net/qq_44574333/article/details/115188086
    s

  • 【天池智慧海洋建设】Topline源码——特征工程学习(Pursuing the Past Youth):
    https://blog.csdn.net/qq_44574333/article/details/112547081

  • 【天池智慧海洋建设】Topline源码——特征工程学习(天才海神号):
    https://blog.csdn.net/qq_44574333/article/details/115185634

  • 【天池智慧海洋建设】Topline源码——特征工程学习(liu123的航空母舰队):
    https://blog.csdn.net/qq_44574333/article/details/115091764

推荐的学习资料

实战类:知名比赛的topline代码,如kaggle、天池等平台的开源代码

书籍类:

+《阿里云天池大赛赛题解析》
   
   【笔者也有博客笔记学习(https://blog.csdn.net/qq_44574333/article/details/109611764)】
   
+《美团机器学习实战》

教程类:

+ Joyful Pandas 强烈推荐!基础且高效
http://joyfulpandas.datawhale.club/
  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值