人口预测和阻尼-增长模型_使用分类模型预测利率-第3部分

最新推荐文章于 2024-07-16 23:38:30 发布

weixin_26752765

最新推荐文章于 2024-07-16 23:38:30 发布

阅读量609

点赞数

文章标签：机器学习 python 人工智能深度学习大数据

原文链接：https://towardsdatascience.com/predicting-interest-rate-with-classification-models-part-3-3eef38dd7b32

版权

本文介绍了如何运用机器学习中的分类模型来预测利率，特别是关注人口预测和阻尼-增长模型的应用。内容源自一篇翻译文章，深入探讨了大数据背景下，利用Python和深度学习技术进行金融数据预测的方法。

摘要由CSDN通过智能技术生成

人口预测和阻尼-增长模型

This is the final article of the series “ Predicting Interest Rate with Classification Models”. Here are the links if you didn't read the First or the Second articles of the series where I explain the challenge I had when started at M2X Investments. As I mentioned before, I will try my best to make this article understandable per se. I will skip the explanation of assumptions regarding the data for “article-length” reasons. Nevertheless, you can check them in previous posts of the series. Let’s do it!

这是“使用分类模型预测利率”系列的最后一篇文章。如果您没有阅读该系列的第一篇或第二篇文章，这些链接将为您解释我在M2X Investments创业时遇到的挑战。如前所述，我将尽力使本文本身易于理解。由于“文章长度”的原因，我将跳过有关数据的假设的解释。不过，您可以在本系列的先前文章中进行检查。我们开始做吧！

快速回顾 (Fast Recap)

In previous articles, I applied a couple of classification models to the problem of predicting up movements of the Fed Fund Effective Rate. In short, it is a binary classification problem where 1 represents up movement and 0, neutral or negative movement. The models applied were Logistic Regression, Naive Bayes, and Random Forest. Random Forest was the one that yielded the best results so far, without hyperparameter optimization, with an F1-score of 0.76.

在先前的文章中，我将几个分类模型应用于预测联邦基金有效利率上升趋势的问题。简而言之，这是一个二进制分类问题，其中1代表上移而0代表中立或负向运动。应用的模型是Logistic回归，朴素贝叶斯和随机森林。到目前为止，Random Forest是在没有超参数优化的情况下获得最佳结果的F1分数为0.76。

If you are curious to know more about the data, please refer to Part 1 or Part 2 of the series. I omitted the explanation about them in this article for practical purposes only.

如果您想了解更多有关数据的信息，请参阅 本系列的 第1 部分或 第2部分 。 我仅出于实际目的省略了对它们的解释。

Catboost和支持向量机简介 (A brief introduction to Catboost and Support Vector Machines)

Catboost (Catboost)

Catboost is an open-source library for gradient boosting on decision trees. Ok, so what is Gradient Boosting?

Catboost是用于在决策树上进行梯度增强的开源库。好的，什么是梯度提升 ？

Gradient boosting is a machine learning algorithm that can be used for classification and regression problems. It usually gives great results when tackling heterogeneous data and small data sets problems. But what in essence is this algorithm? Let's start by defining Boosting.

梯度提升是一种机器学习算法，可用于分类和回归问题。在处理异构数据和小数据集问题时，通常会产生很好的结果。但是这个算法本质上是什么？让我们从定义Boosting开始。

Boosting is an ensemble technique that tries to transform weak learners in strong learners by sequentially training them with the objective of making them better than their predecessors. The sequentially part means that each learner (or usually a tree) is made by taking the previous tree error into account (in the case of the AdaBoost algorithm the trees are called stumps).

提拔是一种合奏技术，它通过依次培训他们以使其比其前任更好为目标，尝试将弱者转变为强者。顺序部分意味着每个学习者(或通常是一棵树)都是通过考虑先前的树错误而制成的(在AdaBoost算法的情况下，树称为树桩)。

As an example, imagine that we train a tree and give each observation equal weights. Next, we evaluate the tree and get its errors. Then, for the next tree, we increase the weight of the observations that were incorrectly classified by the first one and lower the weights of the ones correctly classified. This is basically we saying that the next tree should give more importance to that mistakenly classified observation and classify it correctly. Thus, it goes until we stop and get the final votes of our trees.

例如，假设我们训练一棵树，并给每个观察值相等的权重。接下来，我们评估树并得到其错误。然后，对于下一棵树，我们增加第一个分类错误的观测值的权重，并降低正确分类的观测值的权重。 基本上，这是我们在说，下一棵树应更加重视错误分类的观察并将其正确分类。 因此，直到我们停下来并获得树木的最终票数为止。

Let's go back to the Gradient Boosting now. With the concept of Boosting in mind, we can think of Gradient Boosting as an algorithm that takes the same process described above. The difference is that now we will define a loss function to be optimized (minimized). This means that, after calculating the loss, the next tree that we create, will have to reduce the loss (follow the gradient by reducing the residual loss).

现在让我们回到“ 梯度增强” 。考虑到Boosting的概念，我们可以将Gradient Boosting视为采用上述相同过程的算法。不同之处在于，现在我们将定义要优化(最小化)的损失函数。这意味着，在计算了损失之后，我们创建的下一棵树将必须减少损失(通过减少残留损失来遵循梯度)。

What about Catboost?

那Catboost呢？

Catboost is a gradient boost decision tree library. In their page, they say that it performs well with default parameters, and also that it has categorical features support, built-in model analysis tools, and presents high speed in training on CPU and GPU.

Catboost是梯度提升决策树库。在他们的页面中，他们说它在使用默认参数的情况下表现良好，并且具有分类功能支持，内置的模型分析工具，并且在CPU和GPU方面提供了很高的培训速度。

支持向量机— SVC (Support Vector Machines— SVC)

SVMs are supervised learning algorithms used for classification, regression, and outlier detection. We will use Support Vector Classifiers (SVC) to find a hyperplane in n-dimensional space that accurately classifies the data. This hyperplane will have the maximum distance between the data points of the different classes — this distance is called maximum margin. Let’s take a two-dimensional space as an example. The hyperplane will be a line dividing the space into two parts with maximum distance between the classification labels.

SVM是用于分类，回归和离群值检测的监督学习算法。我们将使用支持向量分类器(SVC) 在n维空间中找到可对数据进行准确分类的超平面 。该超平面将在不同类别的数据点之间具有最大距离-此距离称为最大余量 。让我们以二维空间为例。超平面将是一条将空间分为两部分的线，分类标签之间的距离最大。

Image for post — Image by LAMFO 图片由LAMFO

The data points closer to the line separating the space are called Support Vectors and will dictate the hyperplane margin. So we start with our data in a low dimension and if we cant classify it in that dimension, we move to a higher dimension to find a Support Vector Classifier that will best divide our data into two groups. And so on…To transform the plane that the data relies on and find our Support Vector Classifier, we use a function called Kernel.

靠近分隔线的数据点称为支持向量 ，将决定超平面的余量。因此，我们从低维度的数据开始，如果无法在该维度上对其进行分类，那么我们将移至更高的维度以找到一种支持向量分类器 ，该分类器可以将我们的数据最好地分为两组。依此类推……为了转换数据所依赖的平面并找到我们的支持向量分类器，我们使用了一个称为Kernel的函数。

Ther Kernel Function can have different shapes. As an example, it can be a polynomial kernel or a radial kernel. It is important to notice that, for the sake of the computational cost, kernel functions calculate data relationships as if they were in a higher dimension. However, in reality, they are not transformed into that dimension. This trick is called The Kernel Trick.

内核功能可以具有不同的形状。例如，它可以是多项式核或径向核。 重要的是要注意到，出于计算成本的考虑，内核函数计算数据关系就好像它们在更高维度中一样。 但是，实际上，它们并没有转变为那个维度。 此技巧称为 “内核技巧” 。

代码 (The code)

Catboost (Catboost)

Starting with Catboost. In previous articles, we talked about the data and the assumptions we made to binarize it and deal with NaNs. So will skip the explanation of this part and focus on the model's results and their application.

从Catboost开始。在先前的文章中，我们讨论了数据以及为将其二值化并处理NaN而做出的假设。因此，将跳过这一部分的说明，而将重点放在模型的结果及其应用上。

import numpy as np
import pandas as pd
import quandl as qdl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from catboost import CatBoostClassifier
from sklearn import metrics# get data from Quandl
data = pd.DataFrame()
meta_data = ['RICIA','RICIM','RICIE']
for code in meta_data:
    df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")
    df.columns = [code]
    data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']
for code in meta_data:
    df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")
    df.columns = [code]
    data = pd.concat([data, df], axis=1)# dealing with possible empty values (not much attention to this part, but it is very important)
data.fillna(data.mean(), inplace=True)
print(data.head())
print("\nData shape:\n",data.shape)#histograms
data.hist()
plt.show()# scaling values to maked them vary between 0 and 1
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)# pulling dependent variable from Quandl (par yield curve)
par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")
par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values
par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])
par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize it
par_data=par_data.pct_change()
par_data.fillna(0, inplace=True)
par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])
print("Number of 0 and 1s:\n",par_data.value_counts())# plot number of 0 and 1s 
sns.countplot(x='FED/RIFSPFF_N_D', data=par_data, palette='Blues')
plt.title('0s and 1s')
plt.savefig('0s and 1s')# Over-sampling with ADASYN method
sampler = ADASYN(random_state=13)
X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())
columns = data_scaled.columns
data_scaled = pd.DataFrame(data=X_os,columns=columns )
par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))
print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))

After adjusting the proportions of 0s and 1s in our label set, we split the data in training and test sets and create the model.

调整标签集中0和1s的比例后，我们将数据分为训练集和测试集，然后创建模型。

# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# Catboost model
clf=CatBoostClassifier(iterations=None, learning_rate=None, depth=Non, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function=None, border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None,od_wait=None, od_type=None,nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, verbose=None, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, classes_count=None, class_weights=None, one_hot_max_size=None, random_strength=None, name=None, ignored_features=None, train_dir=None, custom_loss=None, custom_metric=None, eval_metric=None, bagging_temperature=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, fold_len_multiplier=None, used_ram_limit=None, gpu_ram_part=None, allow_writing_files=None, final_ctr_computation_mode=None, approx_on_full_history=None, boosting_type=None, simple_ctr=None, combinations_ctr=None, per_feature_ctr=None, task_type=None, device_config=None, devices=None, bootstrap_type=None, subsample=None, sampling_unit=None, dev_score_calc_obj_block_size=None, max_depth=None, n_estimators=None, num_boost_round=None, num_trees=None, colsample_bylevel=None, random_state=None, reg_lambda=None, objective=None, eta=None, max_bin=None, scale_pos_weight=None, gpu_cat_features_storage=None, data_partition=None, metadata=None, early_stopping_rounds=None, cat_features=None, grow_policy=None, min_data_in_leaf=None, min_child_samples=None, max_leaves=None, num_leaves=None, score_function=None, leaf_estimation_backtracking=None, ctr_history_unit=None, monotone_constraints=None, feature_weights=None, penalties_coefficient=None, first_feature_use_penalties=None, model_shrink_rate=None, model_shrink_mode=None, langevin=None, diffusion_temperature=None, boost_from_average=None, text_features=None, tokenizers=None, dictionaries=None, feature_calcers=None, text_processing=None)

As you can see, I made sure to put all parameters that the model accepts. Yes…a lot! But don’t worry, it is not the time to optimize them; it is time to get a glimpse of the model’s performance. So we are not going to change any of them (all of them will be None).

如您所见，我确保放置了模型接受的所有参数。 是的很多！ 但是不用担心，现在不是优化它们的时候了。 现在该瞥一眼模型的性能了。 因此，我们不会更改其中的任何一个(所有这些都将为None)。

clf.fit(X_train, y)
y_pred = clf.predict(X_test)
print('\nAccuracy of Catboost Classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')

The results show an F1-score of 0.72, close to the Random Forest model results. It seems that this model will enter our “Potential Good Models” list for further investigation and hyperparameter optimization! Let’s see what the SVC model tells us!

结果显示F1分数为0.72，接近随机森林模型的结果。似乎该模型将进入我们的“潜在良好模型”列表，以进行进一步的研究和超参数优化！让我们看看SVC模型告诉我们什么！

支持向量分类器 (Support Vector Classifier)

The first part of the code until the oversampling is pretty much the same posted above. So we will dive into the model code.

直到过度采样为止的代码的第一部分与上面发布的内容几乎相同。因此，我们将深入研究模型代码。

# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# Support Vector Classifier model
clf=SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=True, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=13)# fit model
clf.fit(X_train, y)# predict
y_pred = clf.predict(X_test)
print('\nAccuracy of SVC classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

The accuracy of the SVC is 0.65. Let’s see what the classification report shows us.

SVC的精度为0.65。让我们看看分类报告向我们展示了什么。

# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')# roc curve
logit_roc_auc = metrics.roc_auc_score(y_test, clf.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='SVC (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Support Vector Classifier')
plt.legend(loc="lower right")
plt.savefig('SVC_ROC')

Ok, so it turns out that our F1-score with the SVC model is 0.65. As we saw earlier, the CatBoost model performed better as well as the Random Forest. So we will stick with those two in our “Potential Good Models” list for the hyperparameters optimization step. With the results of the five models at hand, we ended up with two promising models. The next step would be to optimize the two models to see which performs best, but that will be another series on how to optimize and compare models.

好的，事实证明，带有SVC模型的F1得分是0.65。正如我们前面所看到的，CatBoost模型的性能优于随机森林。因此，我们将在“潜在良好模型”列表中坚持使用这两个参数进行超参数优化。有了这五个模型的结果，我们最终得到了两个有希望的模型。下一步将是优化两个模型，以查看哪种模型效果最佳，但这将是有关如何优化和比较模型的另一系列文章。

This article was written in conjunction with Guilherme Bezerra Pujades Magalhães.

本文与 Guilherme Bezerra PujadesMagalhães 一起撰写 。