【天池智慧海洋建设】Topline源码——特征工程学习（liu123的航空母舰队）

最新推荐文章于 2022-02-23 16:07:31 发布

阿芒Aris

最新推荐文章于 2022-02-23 16:07:31 发布

阅读量566

点赞数

分类专栏：比赛向 DataWhale

本文链接：https://blog.csdn.net/qq_44574333/article/details/115091764

版权

比赛向同时被 2 个专栏收录

9 篇文章 3 订阅

订阅专栏

DataWhale

3 篇文章 0 订阅

订阅专栏

【天池智慧海洋建设】Topline源码——特征工程学习

团队名称：liu123的航空母舰队
链接：
https://github.com/MichaelYin1994/tianchi-trajectory-data-mining

前言

topline代码开源学习，仅关注特征工程部分，具体为输入，输出，作用、原理及部分个人理解。

I 数据部分

在这里插入图片描述

原始数据描述：

渔船ID：渔船的唯一识别，结果文件以此ID为标示
x: 渔船在平面坐标系的x轴坐标
y: 渔船在平面坐标系的y轴坐标
速度：渔船当前时刻航速，单位节
方向：渔船当前时刻航首向，单位度
time：数据上报时刻，单位月日时：分
type：渔船label，三种作业类型（围网、刺网、拖网）

II 特征工程部分

特征工程: 特征工程分为两部分, 第一部分为基础统计特征, 对于每条轨迹的x与y坐标, 速度与方向以及一些交叉的结果提取了分位数, 方向直方图, 地理位置信息等基础统计信息; 第二部分为word embedding的特征.
我们将每条轨迹的坐标所在的网格id视为一个词, 每条轨迹视为一个句子。随后对每一个词做了word embedding[4] [5],
每条句子的句子向量为句子包含词的向量的平均, 可直接作为特征feed进统计模型。
————原作者

2.1 PNPOLY算法

判断一个点是否在多边形内部（PNPOLY算法）

“ 在 GIS（地理信息管理系统）中，判断一个坐标是否在多边形内部是个经常要遇到的问题。乍听起来还挺复杂。根据 W. Randolph Franklin 提出的 PNPoly 算法，只需区区几行代码就解决了这个问题。
在多边形的顶点中分别找出 X 坐标和 Y坐标的最小/最大值。比如，你有点(9,1),(4,3),(2,7),(8,2),(3,6)所围成的多边形。那么 Xmin 为2，Xmax为9，Ymin 为1，Ymax 为7。现在我们知道你的多边形中没有一个点的 X 坐标比2小或者比9大，也没有一个点的 Y坐标比1小或者比7大。这样你就可以快速排除很多不在多边形中的点。
作者：与蟒唯舞
链接：https://www.jianshu.com/p/3187832cb6cc ”

def pnpoly(poly_vert_list=None, test_point=None):
    """Which polygon the test_point belongs to ?

    Each element in the poly_vert_list is a polygon with list type.
    """
    for i, polygon in enumerate(poly_vert_list):
        vert_count = len(polygon)
        is_inside = False
        ii, jj = 0, vert_count - 1

        while(ii < vert_count):
            if (polygon[ii][1] > test_point[1]) != (polygon[jj][1] > test_point[1]):
                if test_point[0] < ((polygon[jj][0] - polygon[ii][0]) * (test_point[1] - polygon[ii][1]) / (polygon[jj][1] - polygon[ii][1]) + polygon[ii][0]):
                    is_inside = not is_inside
            jj = ii
            ii += 1

        if is_inside:
            return i
    return np.nan


def find_fishing_ground(traj=None, poly_vert_list=None):
    test_point_list = traj[["lon", "lat"]].values.tolist()

    fishing_ground_ind = []
    for test_point in test_point_list:
        fishing_ground_ind.append(pnpoly(poly_vert_list, test_point))
    traj["fishing_ground"] = fishing_ground_ind
    return traj

根据上述PNPOLY算法可以得到针对[“lon”,“lat”]是否位于多边形内部的特征

2.2 针对x、y坐标的分箱

def traj_to_bin(traj=None, x_min=12031967.16239096, x_max=14226964.881853,
                y_min=1623579.449434373, y_max=4689471.1780792,
                row_bins=4380, col_bins=3136):
    # col_bins = (14226964.881853 - 12031967.16239096) / 700
    # row_bins = (4689471.1780792 - 1623579.449434373) / 3000
    # Establish bins on x direction and y direction
    x_bins = np.linspace(x_min, x_max, endpoint=True, num=col_bins + 1)
    y_bins = np.linspace(y_min, y_max, endpoint=True, num=row_bins + 1)

    # Determine each x coordinate belong to which bin
    traj.sort_values(by='x', inplace=True)
    x_res = np.zeros((len(traj), ))
    j = 0
    for i in range(1, col_bins + 1):
        low, high = x_bins[i-1], x_bins[i]
        while( j < len(traj)):
            # low - 0.001 for numeric stable.
            if (traj["x"].iloc[j] <= high) & (traj["x"].iloc[j] > low - 0.001):
                x_res[j] = i
                j += 1
            else:
                break
    traj["x_grid"] = x_res
    traj["x_grid"] = traj["x_grid"].astype(int)
    traj["x_grid"] = traj["x_grid"].apply(str)

    # Determine each y coordinate belong to which bin
    traj.sort_values(by='y', inplace=True)
    y_res = np.zeros((len(traj), ))
    j = 0
    for i in range(1, row_bins + 1):
        low, high = y_bins[i-1], y_bins[i]
        while( j < len(traj)):
            # low - 0.001 for numeric stable.
            if (traj["y"].iloc[j] <= high) & (traj["y"].iloc[j] > low - 0.001):
                y_res[j] = i
                j += 1
            else:
                break
    traj["y_grid"] = y_res
    traj["y_grid"] = traj["y_grid"].astype(int)
    traj["y_grid"] = traj["y_grid"].apply(str)

    # Determine which bin each coordinate belongs to.
    traj["no_bin"] = [i + "_" + j for i, j in zip(
        traj["x_grid"].values.tolist(), traj["y_grid"].values.tolist())]
    traj.sort_values(by='time', inplace=True)
    return traj

上述特征构造时，传入的单位是一部分ID相同的DataFrame，记作traj
将构造出x_grid、y_grid和no_bin三个分箱特征

构造原理描述如下
示例：x_bins = [0,2,4,6,8,10]，10个数据行，分成5个箱
首先，根据最值，将x、y分别进行等差切分
然后，如对x、y均进行排序
最后，对每一行数据，如第一行中x=5，将会被分到3号箱，即x_grid = 3
补充：no_bin = x_grid + ‘_’ + y_grid

2.3 x_y区域

2.3.1 提取xy区域

traj_data_df = [traj[["x", "y", "no_bin", "lon",
                      "lat", "boat_id", "time_array"]] for traj in res]
traj_data_df = pd.concat(traj_data_df, axis=0, ignore_index=True)
bin_to_coord_df = traj_data_df.groupby(
    ["no_bin"]).median().reset_index().drop(["boat_id"], axis=1)

通过上面构造的no_bin便可以代表x、y坐标所属的区域，如代表0_0的“原点”,
并取出其中的基本特征列：
“x”, “y”, “no_bin”, “lon”,“lat”, “boat_id”, “time_array”

2.3.2 统计xy区域的统计值

“总访问量”

def find_save_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
    """Find and save the visit frequency of each bin."""
    visit_count_df = traj_data_df.groupby(["no_bin"]).count().reset_index()
    visit_count_df = visit_count_df[["no_bin", "x"]]
    visit_count_df.rename({"x":"visit_count"}, axis=1, inplace=True)

    visit_count_df_save = pd.merge(bin_to_coord_df, visit_count_df, on="no_bin", how="left")
    return visit_count_df

visit_count_df = find_save_visit_count_table(
    traj_data_df, bin_to_coord_df)

以“no_bin”为主键，进行count统计访问量，最终保留visit_count特征，即区域的统计特征
在这里插入图片描述

“各区域被独立船只访问的次数”

def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
    """Find and save the unique boat visit count of each bin."""
    unique_boat_count_df = traj_data_df.groupby(["no_bin"])["boat_id"].nunique().reset_index()
    unique_boat_count_df.rename({"boat_id":"visit_boat_count"}, axis=1, inplace=True)

    unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df,
                                         on="no_bin", how="left")
    return unique_boat_count_df

unique_boat_count_df = find_save_unique_visit_count_table(
    traj_data_df, bin_to_coord_df)

以“no_bin”为主键，进行unique统计各船只独立访问量，得到visit_boat_count特征
在这里插入图片描述

各船在xy区域的时间平均值

def find_save_mean_stay_time_table(traj_data_df=None, bin_to_coord_df=None):
    """Find and save the mean stay time of each bin."""
    mean_stay_time_df = traj_data_df.groupby(
        ["no_bin", "boat_id"])["time"].sum().reset_index()
    mean_stay_time_df.rename({"time":"total_stay_time"}, axis=1, inplace=True)
    mean_stay_time_df = mean_stay_time_df.groupby(
        ["no_bin"])["total_stay_time"].mean().reset_index()
    mean_stay_time_df.rename(
        {"total_stay_time":"mean_stay_time"}, axis=1, inplace=True)

    mean_stay_time_df_save = pd.merge(bin_to_coord_df, mean_stay_time_df,
                                      on="no_bin", how="left")
    return mean_stay_time_df

mean_stay_time_df = find_save_mean_stay_time_table(
    traj_data_df, bin_to_coord_df)

首先，以 “no_bin”, “boat_id” 为主键，得到time的累计值
然后，再以"no_bin" 为主键，得到time的平均值，这样就可以得到每个船只在某一个区域xy下的时间均值

2.4 Cbow Embedding “词向量特征”

在这里插入图片描述
CBOW模型根据某个中心词前后A个连续的词，来计算该中心词出现的概率，即用上下文预测目标词。模型结构简易示意图如上，详细原理可以阅读论文《Efficient Estimation of Word Representations in Vector Space》

在此，推荐一个有趣的CBOW解释博文：
https://ask.hellobi.com/blog/wangdawei/36569

Embedding特征构造是一种很常用的NLP方法，也在各种数据挖掘竞赛中表现出色，推荐大家构造类似如下的词向量函数来用于自己实验等。

@timefn
def traj_cbow_embedding(traj_data_corpus=None, embedding_size=70,
                        iters=40, min_count=3, window_size=25,
                        seed=9012, num_runs=5, word_feat="no_bin"):
    """CBOW embedding for trajectory data."""
    boat_id = traj_data_corpus['boat_id'].unique()
    sentences, embedding_df_list, embedding_model_list = [], [], []
    for i in boat_id:
        traj = traj_data[traj_data_corpus['boat_id']==i]
        sentences.append(traj[word_feat].values.tolist())

    print("\n@Start CBOW word embedding at {}".format(datetime.now()))
    print("-------------------------------------------")
    for i in tqdm(range(num_runs)):
        model = word2vec.Word2Vec(sentences, size=embedding_size,
                                  min_count=min_count,
                                  workers=mp.cpu_count(),
                                  window=window_size,
                                  seed=seed, iter=iters, sg=0)

        # Sentance vector
        embedding_vec = []
        for ind, seq in enumerate(sentences):
            seq_vec, word_count = 0, 0
            for word in seq:
                if word not in model:
                    continue
                else:
                    seq_vec += model[word]
                    word_count += 1
            if word_count == 0:
                embedding_vec.append(embedding_size * [0])
            else:
                embedding_vec.append(seq_vec / word_count)
        embedding_vec = np.array(embedding_vec)
        embedding_cbow_df = pd.DataFrame(embedding_vec, 
            columns=["embedding_cbow_{}_{}".format(word_feat, i) for i in range(embedding_size)])
        embedding_cbow_df["boat_id"] = boat_id
        embedding_df_list.append(embedding_cbow_df)
        embedding_model_list.append(model)
    print("-------------------------------------------")
    print("@End CBOW word embedding at {}".format(datetime.now()))
    return embedding_df_list, embedding_model_list

cbow_emddding构造方法如下，取每个boat_id下的某一列特征，如默认参数"no_bin"，即xy区域
将相同boat_id的"no_bin"当作一个句子
比如
id0: [(0,0),(1,1),(2,2)]
id1: [(0,0),(-1,1),(-2,2)]
将上面的这样一个列表当作一个“句子”，其中的(0,0)即xy区域的值就是构成句子的“单词”

得到预构建特征的句子后，便可以进行Word2vec的向量化训练，并可以得到训练出的各个“单词”的词向量，这个词向量便是我们embedding得到的“词向量特征”

print("\n@Round 2 speed embedding:")
df_list, model_list = traj_cbow_embedding(traj_data_corpus,
                                          embedding_size=10,
                                          iters=40, min_count=3,
                                          window_size=25, seed=9102,
                                          num_runs=1, word_feat="speed_str")
speed_embedding = df_list[0].reset_index(drop=True)
total_embedding = pd.merge(total_embedding, speed_embedding,
                           on="boat_id", how="left")

代码作者选取了"no_bin",“speed_str”,"speed_dir_str"这三个特征分别构建了对应的词向量

其中"speed_str","speed_dir_str"构造方法如下，就是简单的文本相加

traj_data["speed_str"]     = traj_data["speed"].apply(lambda x: str(int(x*100)))
traj_data["direction_str"] = traj_data["direction"].apply(str)
traj_data["speed_dir_str"] = traj_data["speed_str"] + "_" + traj_data["direction_str"]