最近忙于找工作,之前做的项目还没有总结过,这里就先简单的总结一下
项目要求通过分析渔船北斗设备位置数据,具体判断出是拖网作业、围网作业还是流刺网作业。即:“轨迹(序列数据)+多分类”的任务,评估指标选用的是F1值。
本项目的关键点就是对原始数据的特征工程:
baseline
(本人的baseline代码找不到了,所以这里借用已经公开的方法,方法基本一样,只不过本人的baseline引入了更多的统计量)
import os, sys, glob
import numpy as np
import pandas as pd
import time
import datetime
from joblib import Parallel, delayed
from sklearn.metrics import f1_score, log_loss, classification_report
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
%pylab inline
def read_feat(path, test_mode=False):
df = pd.read_csv(path)
df = df.iloc[::-1]
if test_mode:
df_feat = [df['渔船ID'].iloc[0], df['type'].iloc[0]]
df = df.drop(['type'], axis=1)
else:
df_feat = [df['渔船ID'].iloc[0]]
df['time'] = df['time'].apply(lambda x: datetime.datetime.strptime(x, "%m%d %H:%M:%S"))
df_diff = df.diff(1).iloc[1:]
df_diff['time_seconds'] = df_diff['time'].dt.total_seconds()
df_diff['dis'] = np.sqrt(df_diff['x']**2 + df_diff['y']**2)
df_feat.append(df['time'].dt.day.nunique())
df_feat.append(df['time'].dt.hour.min())
df_feat.append(df['time'].dt.hour.max())
df_feat.append(df['time'].dt.hour.value_counts().index[0])
df_feat.append(df['速度'].min())
df_feat.append(df[&