first_notebook

YC_lemon

已于 2023-08-06 17:11:14 修改

阅读量98

点赞数

文章标签： python

于 2023-08-06 15:10:57 首次发布

本文链接：https://blog.csdn.net/YC_lemon/article/details/132131200

版权

Jupyter notebook 使用方法

基本用法

安装完anaconda后，搜索Jupyter即可使用
在这里插入图片描述

Jupyter中设置为markdown格式下(默认为代码模式），可以编辑文本。jupyter默认环境为cmd，若为anaconda，即为base环境

标题
#+空格表示标题，每加一级标题多一个#
小标题
数字加一个点号加空格表示小标题，小标题变蓝说明操作成功，无序列表同理

2.1 两格后使用*、-或+加空格实现下一级

无序列表

+空格即可
写完后ctrl + enter，运行即可显示代码模式

换行
两次空格+enter
删除代码点击代码块前空白处，D+D

安装扩展jupyter_contrib_nbextension遇坑总结

第一次安装运行代码（最好在anaconda prompt中运行）
conda install -c conda-forge jupyter_nbextensions_configurator
conda install -c conda-forge jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextensions_configurator enable --user
之前装过要卸载掉
conda remove jupyter_nbextensions_configurator
出现无权限问题，提示EnvironmentNotWritableError
右键管理员运行 Anaconda prompt即可
提示conda无法连接，CondaHTTPError: HTTP 000 CONNECTION FAILED for url
安装报错提示来即可，retry will be useful! try again！
或者参考https://blog.csdn.net/u012961177/article/details/105808889

解题方案

利用过往及当前数据预测未来中间价移动问题，属于时间序列回归预测问题。采用CatBoost,LightGBM, XGBoost等树模型机器学习方法。树模型能很好处理数值型数据，可解释性较高。

数据探索
特征工程
模型训练
模型验证 -> 对特征工程优化
结果输出

Baseline代码解读

关于sklearn库介绍

sklearn是python中的机器学习库，涵盖了机器学习中的样例数据、数据预处理、模型验证、特征选择、分类、回归、聚类、降维等几乎所有环节，功能十分强大。

model_selection 模型选择，包括数据集切分、参数调整、验证

KFold：KFold 将所有的样例划分为随机 k 个组，称为折叠 (fold) ，每组数据都具有相同的大小。每一次分割会将其中的 K-1 组作为训练数据，剩下的一组用作测试数据，一共会分割K次。kf=KFold(n_splits=4)
StratifiedKFold：是KFold()的变种，采用分层分组的形式（类似分层抽样），使每个分组中各类别的比例同整体数据中各类别的比例尽可能的相同。（它相对于KFold()方法更完善）skf=StratifiedKFold(n_splits=4)
GroupKFold：保证同一个group的数据不会同时出现在训练集和测试集上

参考链接：https://zhuanlan.zhihu.com/p/52515873
2. metrics度量指标
在这里插入图片描述

accuracy_score 准确度，衡量准确预测的标签占实例的总数的比例
f1_score 结合了精确度和召回率的指标，F1 分数 = 2 * ((精度 * 召回率) / (精度 + 召回率))其中：精度 = (真阳性) / (真阳性 + 假阳性)、召回率 = (真阳性) / (真阳性 + 假阴性) ）
roc_auc_score ROC（Receiver Operating Characteristic）曲线是以假正率（FPR）和真正率（TPR）为轴的曲线，ROC曲线下面的面积称为AUC。纵坐标为真阳性率（True Positive Rate, TPR）： TPR = TP / P，其中P是真实正样本的个数，TP是P个正样本中被分类器预测为正样本的个数。横坐标为假阳性率（False Positive Rate, FPR）： FPR = FP / N ，N是真实负样本的个数，FP是N个负样本中被分类器预测为正样本的个数。
log_loss 预测概率和真实标签之间的对数损失 = -(1/N) * Σ(y * log(y_pred) + (1-y) * log(1-y_pred))其中：N 是实例数，y 是真实标签（ 0或1），y_pred是正类的预测概率。
mean_squared_log_error均方对数误差，预测对数值与真实对数值之间差异的对数平方平均值 MSLE=(1/N) * Σ(log(1 + y_pred) - log(1 + y_true))^2，其中：N 是实例数，y_true 是真实目标值， y_pred是预测的目标值。

# _ = !pip install --upgrade catboost xgboost lightgbm  升级并安装第三方库，在本地环境下载
import numpy as np
import pandas as pd # numpy，pandas环境自带，但版本较低，可以升级，pip install --ungrade numpy pandas
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss, mean_squared_log_error
import tqdm, sys, os, gc, argparse, warnings  # tqdm表示进度条库
# argsparse可以直接在命令行中向程序中传入参数并让程序运行
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')  # 忽略警告消息

过滤警告种类
在这里插入图片描述

数据探索及可视化

# 读取数据
path = 'AI量化模型预测挑战赛公开数据/'
# os.listdir() 方法用于返回指定的文件夹包含的文件或文件夹的名字的列表，括号内表示文件路径
train_files = os.listdir(path+'train')
# 创建了两个空的DataFrame，train_df和test_df，用于存储读取的数据。
train_df = pd.DataFrame()
for filename in tqdm.tqdm(train_files):
    tmp = pd.read_csv(path+'train/'+filename)
    # 将文件名作为新的一列添加到读取的数据中
    tmp['file'] = filename
    # pd.concat()方法将tmp和train_df按行连接起来，形成一个完整的训练数据集。
    train_df = pd.concat([train_df, tmp], axis=0, ignore_index=True)

test_files = os.listdir(path+'test')
test_df = pd.DataFrame()
for filename in tqdm.tqdm(test_files):
    tmp = pd.read_csv(path+'test/'+filename)
    tmp['file'] = filename
    test_df = pd.concat([test_df, tmp], axis=0, ignore_index=True)

对买价卖价进行可视化分析

选择任意一个股票数据进行可视化分析，观察买价和卖价的关系。下面是对买价和卖价的简单介绍：

买价指的是买方愿意为一项股票/资产支付的最高价格。
卖价指的是卖方愿意接受的一项股票/资产的最低价格。
这两个价格之间的差异被称为点差；点差越小，该品种的流动性越高。

cols = ['n_bid1','n_bid2','n_ask1','n_ask2']
# 筛选出train_df中file列等于’snapshot_sym7_date22_pm.csv’的数据，
#并将最后500行的数据赋值给tmp_df
tmp_df = train_df[train_df['file']=='snapshot_sym7_date22_pm.csv'].reset_index(drop=True)[-500:]
# 重置tmp_df的索引，再次调用reset_index将索引作为新的一列添加到tmp_df中
tmp_df = tmp_df.reset_index(drop=True).reset_index()
for num, col in enumerate(cols):
    plt.figure(figsize=(20,5))
   
    plt.subplot(4,1,num+1)
    plt.plot(tmp_df['index'],tmp_df[col])
    plt.title(col)
plt.show()
plt.figure(figsize=(20,5))
# 在同一图像绘制索引和对应列的折线图，使用label设置每条线标签
for num, col in enumerate(cols):
    plt.plot(tmp_df['index'],tmp_df[col],label=col)
plt.legend(fontsize=12)

中间价可视化，中间价即买价与卖价的均值

plt.figure(figsize=(20,5))

for num, col in enumerate(cols):
    
    plt.plot(tmp_df['index'],tmp_df[col],label=col)
# lw 表示linewidth    
plt.plot(tmp_df['index'],tmp_df['n_midprice'],label="n_midprice",lw=10)
plt.legend(fontsize=12)

波动率是给定股票价格变化的重要统计指标，因此要计算价格变化，我们首先需要在固定间隔进行股票估值。
使用已提供的数据的加权平均价格（WAP）进行可视化，WAP的变化反映股票波动情况。

train_df['wap1'] = (train_df['n_bid1']*train_df['n_bsize1'] + train_df['n_ask1']*train_df['n_asize1'])/(train_df['n_bsize1'] + train_df['n_asize1'])
test_df['wap1'] = (test_df['n_bid1']*test_df['n_bsize1'] + test_df['n_ask1']*test_df['n_asize1'])/(test_df['n_bsize1'] + test_df['n_asize1'])

tmp_df = train_df[train_df['file']=='snapshot_sym7_date22_pm.csv'].reset_index(drop=True)[-500:]
tmp_df = tmp_df.reset_index(drop=True).reset_index()
plt.figure(figsize=(20,5))
plt.plot(tmp_df['index'], tmp_df['wap1'])

特征工程

构建基本时间特征，包括小时、分钟

# 时间相关特征
# 对time列处理，拆分为hour、 minutes。使用(:)作为分隔符
train_df['hour'] = train_df['time'].apply(lambda x:int(x.split(':')[0]))
test_df['hour'] = test_df['time'].apply(lambda x:int(x.split(':')[0]))

train_df['minute'] = train_df['time'].apply(lambda x:int(x.split(':')[1]))
test_df['minute'] = test_df['time'].apply(lambda x:int(x.split(':')[1]))

# 入模特征
# 把除了'uuid','time','file'外的列名存储在cols列表
cols = [f for f in test_df.columns if f not in ['uuid','time','file']]

模型训练与验证

选择使用CatBoost模型，也是通常作为机器学习比赛的基线模型，在不需要过程调参的情况下也能得到比较稳定的分数。这里使用五折交叉验证的方式进行数据切分验证，最终将五个模型结果取平均作为最终提交。

CatBoost模型介绍

CatBoost（categorical boosting）是一种能够很好地处理类别型特征的梯度提升算法库，由Categorical 和 Boosting 组成，解决了梯度偏差（Gradient Bias）以及预测偏移（Prediction shift）的问题，从而减少过拟合的发生，进而提高算法的准确性和泛化能力。
详细代码介绍：https://zhuanlan.zhihu.com/p/540956200

# train_x: 训练集的特征数据；train_y: 训练集的标签数据；test_x: 测试集的特征数据
def cv_model(clf, train_x, train_y, test_x, clf_name, seed = 2023):
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    # 保存交叉验证的Out-Of-Fold预测结果
    oof = np.zeros([train_x.shape[0], 3])
    # 保存测试集的预测结果
    test_predict = np.zeros([test_x.shape[0], 3])
    cv_scores = []
    # 使用kf.split()方法依次划分训练集为训练集和验证集
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        # iloc函数： 位置索引，并提取数据
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
       
        if clf_name == "cat":
            params = {'learning_rate': 0.2, 'depth': 6, 'bootstrap_type':'Bernoulli','random_seed':2023,
                      'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False,
                      'loss_function': 'MultiClass'}
            
            model = clf(iterations=100, **params)
            # 在训练集上训练模型，指定验证集用于提前停止训练和选择最佳模型。
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      metric_period=20,
                      use_best_model=True, 
                      cat_features=[],
                      verbose=1)
            # 使用训练好的模型对验证集和测试集进行预测，得到概率矩阵
            val_pred  = model.predict_proba(val_x)
            test_pred = model.predict_proba(test_x)
        # 将验证集的预测结果存储到相应的Out-Of-Fold位置
        oof[valid_index] = val_pred
        # 对测试集的预测结果进行累加，并除以折数，得到平均预测结果
        test_predict += test_pred / kf.n_splits
        # 计算验证集的F1分数，并将其添加到cv_scores列表中
        F1_score = f1_score(val_y, np.argmax(val_pred, axis=1), average='macro')
        cv_scores.append(F1_score)
        print(cv_scores)
    # 返回Out-Of-Fold预测结果（oof）和测试集预测结果（test_predict）
    return oof, test_predict
    
for label in ['label_5','label_10','label_20','label_40','label_60']:
    print(f'=================== {label} ===================')
    cat_oof, cat_test = cv_model(CatBoostClassifier, train_df[cols], train_df[label], test_df[cols], 'cat')
    # 
    train_df[label] = np.argmax(cat_oof, axis=1)
    test_df[label] = np.argmax(cat_test, axis=1)

结果输出

test_df.head(5) # 看是否符合要求

import pandas as pd
import os


# 指定输出文件夹路径
output_dir = './submit'

# 如果文件夹不存在则创建
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# 首先按照'file'字段对 dataframe 进行分组
grouped = test_df.groupby('file')

# 对于每一个group进行处理
for file_name, group in grouped:
    # 选择你所需要的列
    selected_cols = group[['uuid', 'label_5', 'label_10', 'label_20', 'label_40', 'label_60']]
    
    # 将其保存为csv文件，file_name作为文件名
    selected_cols.to_csv(os.path.join(output_dir, f'{file_name}'), index=False)

# 现在就可以得到答案的压缩包啦~~~
_ = !zip -r submit.zip submit/

YC_lemon

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
first_notebook

第一次安装运行代码（最好在anaconda prompt中运行）之前装过要卸载掉出现无权限问题，提示EnvironmentNotWritableError右键管理员运行 Anaconda prompt即可提示conda无法连接，CondaHTTPError: HTTP 000 CONNECTION FAILED for url安装报错提示来即可，retry will be useful!try again！
复制链接

扫一扫