人口预测和阻尼-增长模型_使用分类模型预测利率-第3部分

人口预测和阻尼-增长模型

This is the final article of the series “ Predicting Interest Rate with Classification Models”. Here are the links if you didn't read the First or the Second articles of the series where I explain the challenge I had when started at M2X Investments. As I mentioned before, I will try my best to make this article understandable per se. I will skip the explanation of assumptions regarding the data for “article-length” reasons. Nevertheless, you can check them in previous posts of the series. Let’s do it!

这是“使用分类模型预测利率”系列的最后一篇文章。 如果您没有阅读该系列的第一篇或第二篇文章,这些链接将为您解释我在M2X Investments创业时遇到的挑战。 如前所述,我将尽力使本文本身易于理解。 由于“文章长度”的原因,我将跳过有关数据的假设的解释。 不过,您可以在本系列的先前文章中进行检查。 我们开始做吧!

快速回顾 (Fast Recap)

In previous articles, I applied a couple of classification models to the problem of predicting up movements of the Fed Fund Effective Rate. In short, it is a binary classification problem where 1 represents up movement and 0, neutral or negative movement. The models applied were Logistic Regression, Naive Bayes, and Random Forest. Random Forest was the one that yielded the best results so far, without hyperparameter optimization, with an F1-score of 0.76.

在先前的文章中,我将几个分类模型应用于预测联邦基金有效利率上升趋势的问题。 简而言之,这是一个二进制分类问题,其中1代表上移而0代表中立或负向运动。 应用的模型是Logistic回归,朴素贝叶斯和随机森林。 到目前为止,Random Forest是在没有超参数优化的情况下获得最佳结果的F1分数为0.76。

If you are curious to know more about the data, please refer to Part 1 or Part 2 of the series. I omitted the explanation about them in this article for practical purposes only.

如果您想了解更多有关数据的信息,请参阅 本系列的 第1 部分 第2部分 我仅出于实际目的省略了对它们的解释。

Catboost和支持向量机简介 (A brief introduction to Catboost and Support Vector Machines)

Catboost (Catboost)

Catboost is an open-source library for gradient boosting on decision trees. Ok, so what is Gradient Boosting?

Catboost是用于在决策树上进行梯度增强的开源库。 好的,什么是梯度提升

Gradient boosting is a machine learning algorithm that can be used for classification and regression problems. It usually gives great results when tackling heterogeneous data and small data sets problems. But what in essence is this algorithm? Let's start by defining Boosting.

梯度提升是一种机器学习算法,可用于分类和回归问题。 在处理异构数据和小数据集问题时,通常会产生很好的结果。 但是这个算法本质上是什么? 让我们从定义Boosting开始。

Boosting is an ensemble technique that tries to transform weak learners in strong learners by sequentially training them with the objective of making them better than their predecessors. The sequentially part means that each learner (or usually a tree) is made by taking the previous tree error into account (in the case of the AdaBoost algorithm the trees are called stumps).

提拔是一种合奏技术,它通过依次培训他们以使其比其前任更好为目标,尝试将弱者转变为强者。 顺序部分意味着每个学习者(或通常是一棵树)都是通过考虑先前的树错误而制成的(在AdaBoost算法的情况下,树称为树桩)。

As an example, imagine that we train a tree and give each observation equal weights. Next, we evaluate the tree and get its errors. Then, for the next tree, we increase the weight of the observations that were incorrectly classified by the first one and lower the weights of the ones correctly classified. This is basically we saying that the next tree should give more importance to that mistakenly classified observation and classify it correctly. Thus, it goes until we stop and get the final votes of our trees.

例如,假设我们训练一棵树,并给每个观察值相等的权重。 接下来,我们评估树并得到其错误。 然后,对于下一棵树,我们增加第一个分类错误的观测值的权重,并降低正确分类的观测值的权重。 基本上,这是我们在说,下一棵树应更加重视错误分类的观察并将其正确分类。 因此,直到我们停下来并获得树木的最终票数为止。

Let's go back to the Gradient Boosting now. With the concept of Boosting in mind, we can think of Gradient Boosting as an algorithm that takes the same process described above. The difference is that now we will define a loss function to be optimized (minimized). This means that, after calculating the loss, the next tree that we create, will have to reduce the loss (follow the gradient by reducing the residual loss).

现在让我们回到“ 梯度增强” 。 考虑到Boosting的概念,我们可以将Gradient Boosting视为采用上述相同过程的算法。 不同之处在于,现在我们将定义要优化(最小化)的损失函数。 这意味着,在计算了损失之后,我们创建的下一棵树将必须减少损失(通过减少残留损失来遵循梯度)。

What about Catboost?

Catboost呢?

Catboost is a gradient boost decision tree library. In their page, they say that it performs well with default parameters, and also that it has categorical features support, built-in model analysis tools, and presents high speed in training on CPU and GPU.

Catboost是梯度提升决策树库。 在他们的页面中 ,他们说它在使用默认参数的情况下表现良好,并且具有分类功能支持,内置的模型分析工具,并且在CPU和GPU方面提供了很高的培训速度。

支持向量机— SVC (Support Vector Machines— SVC)

SVMs are supervised learning algorithms used for classification, regression, and outlier detection. We will use Support Vector Classifiers (SVC) to find a hyperplane in n-dimensional space that accurately classifies the data. This hyperplane will have the maximum distance between the data points of the different classes — this distance is called maximum margin. Let’s take a two-dimensional space as an example. The hyperplane will be a line dividing the space into two parts with maximum distance between the classification labels.

SVM是用于分类,回归和离群值检测的监督学习算法。 我们将使用支持向量分类器(SVC) 在n维空间中找到可对数据进行准确分类的超平面 。 该超平面将在不同类别的数据点之间具有最大距离-此距离称为最大余量 。 让我们以二维空间为例。 超平面将是一条将空间分为两部分的线,分类标签之间的距离最大。

Image for post
Image by LAMFO 图片由LAMFO

The data points closer to the line separating the space are called Support Vectors and will dictate the hyperplane margin. So we start with our data in a low dimension and if we cant classify it in that dimension, we move to a higher dimension to find a Support Vector Classifier that will best divide our data into two groups. And so on…To transform the plane that the data relies on and find our Support Vector Classifier, we use a function called Kernel.

靠近分隔线的数据点称为支持向量 ,将决定超平面的余量。 因此,我们从低维度的数据开始,如果无法在该维度上对其进行分类,那么我们将移至更高的维度以找到一种支持向量分类器 ,该分类器可以将我们的数据最好地分为两组。 依此类推……为了转换数据所依赖的平面并找到我们的支持向量分类器,我们使用了一个称为Kernel的函数。

Ther Kernel Function can have different shapes. As an example, it can be a polynomial kernel or a radial kernel. It is important to notice that, for the sake of the computational cost, kernel functions calculate data relationships as if they were in a higher dimension. However, in reality, they are not transformed into that dimension. This trick is called The Kernel Trick.

内核功能可以具有不同的形状。 例如,它可以是多项式核或径向核。 重要的是要注意到,出于计算成本的考虑,内核函数计算数据关系就好像它们在更高维度中一样。 但是,实际上,它们并没有转变为那个维度。 此技巧称为 “内核技巧”

代码 (The code)

Catboost (Catboost)

Starting with Catboost. In previous articles, we talked about the data and the assumptions we made to binarize it and deal with NaNs. So will skip the explanation of this part and focus on the model's results and their application.

从Catboost开始。 在先前的文章中,我们讨论了数据以及为将其二值化并处理NaN而做出的假设。 因此,将跳过这一部分的说明,而将重点放在模型的结果及其应用上。

import numpy as np
import pandas as pd
import quandl as qdl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from catboost import CatBoostClassifier
from sklearn import metrics# get data from Quandl
data = pd.DataFrame()
meta_data = ['RICIA','RICIM','RICIE']
for code in meta_data:
df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']
for code in meta_data:
df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)# dealing with possible empty values (not much attention to this part, but it is very important)
data.fillna(data.mean(), inplace=True)
print(data.head())
print("\nData shape:\n",data.shape)#histograms
data.hist()
plt.show()# scaling values to maked them vary between 0 and 1
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)# pulling dependent variable from Quandl (par yield curve)
par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")
par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values
par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])
par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize it
par_data=par_data.pct_change()
par_data.fillna(0, inplace=True)
par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])
print("Number of 0 and 1s:\n",par_data.value_counts())# plot number of 0 and 1s
sns.countplot(x='FED/RIFSPFF_N_D', data=par_data, palette='Blues')
plt.title('0s and 1s')
plt.savefig('0s and 1s')# Over-sampling with ADASYN method
sampler = ADASYN(random_state=13)
X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())
columns = data_scaled.columns
data_scaled = pd.DataFrame(data=X_os,columns=columns )
par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))
print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))

After adjusting the proportions of 0s and 1s in our label set, we split the data in training and test sets and create the model.

调整标签集中0和1s的比例后,我们将数据分为训练集和测试集,然后创建模型。

# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# Catboost model
clf=CatBoostClassifier(iterations=None, learning_rate=None, depth=Non, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function=None, border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None,od_wait=None, od_type=None,nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, verbose=None, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, classes_count=None, class_weights=None, one_hot_max_size=None, random_strength=None, name=None, ignored_features=None, train_dir=None, custom_loss=None, custom_metric=None, eval_metric=None, bagging_temperature=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, fold_len_multiplier=None, used_ram_limit=None, gpu_ram_part=None, allow_writing_files=None, final_ctr_computation_mode=None, approx_on_full_history=None, boosting_type=None, simple_ctr=None, combinations_ctr=None, per_feature_ctr=None, task_type=None, device_config=None, devices=None, bootstrap_type=None, subsample=None, sampling_unit=None, dev_score_calc_obj_block_size=None, max_depth=None, n_estimators=None, num_boost_round=None, num_trees=None, colsample_bylevel=None, random_state=None, reg_lambda=None, objective=None, eta=None, max_bin=None, scale_pos_weight=None, gpu_cat_features_storage=None, data_partition=None, metadata=None, early_stopping_rounds=None, cat_features=None, grow_policy=None, min_data_in_leaf=None, min_child_samples=None, max_leaves=None, num_leaves=None, score_function=None, leaf_estimation_backtracking=None, ctr_history_unit=None, monotone_constraints=None, feature_weights=None, penalties_coefficient=None, first_feature_use_penalties=None, model_shrink_rate=None, model_shrink_mode=None, langevin=None, diffusion_temperature=None, boost_from_average=None, text_features=None, tokenizers=None, dictionaries=None, feature_calcers=None, text_processing=None)

As you can see, I made sure to put all parameters that the model accepts. Yes…a lot! But don’t worry, it is not the time to optimize them; it is time to get a glimpse of the model’s performance. So we are not going to change any of them (all of them will be None).

如您所见,我确保放置了模型接受的所有参数。 是的很多! 但是不用担心,现在不是优化它们的时候了。 现在该瞥一眼模型的性能了。 因此,我们不会更改其中的任何一个(所有这些都将为None)。

clf.fit(X_train, y)
y_pred = clf.predict(X_test)
print('\nAccuracy of Catboost Classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')
Image for post
Classification report | Image by Author
分类报告| 图片作者
Image for post
Catboost ROC curve | Image by Author
Catboost ROC曲线| 图片作者

The results show an F1-score of 0.72, close to the Random Forest model results. It seems that this model will enter our “Potential Good Models” list for further investigation and hyperparameter optimization! Let’s see what the SVC model tells us!

结果显示F1分数为0.72,接近随机森林模型的结果。 似乎该模型将进入我们的“潜在良好模型”列表,以进行进一步的研究和超参数优化! 让我们看看SVC模型告诉我们什么!

支持向量分类器 (Support Vector Classifier)

The first part of the code until the oversampling is pretty much the same posted above. So we will dive into the model code.

直到过度采样为止的代码的第一部分与上面发布的内容几乎相同。 因此,我们将深入研究模型代码。

# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# Support Vector Classifier model
clf=SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=True, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=13)# fit model
clf.fit(X_train, y)# predict
y_pred = clf.predict(X_test)
print('\nAccuracy of SVC classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Image for post
SVC accuracy | Image by Author
SVC精度| 图片作者

The accuracy of the SVC is 0.65. Let’s see what the classification report shows us.

SVC的精度为0.65。 让我们看看分类报告向我们展示了什么。

# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')# roc curve
logit_roc_auc = metrics.roc_auc_score(y_test, clf.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='SVC (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Support Vector Classifier')
plt.legend(loc="lower right")
plt.savefig('SVC_ROC')
Image for post
SVC Confusion Matrix | Image by Author
SVC混淆矩阵| 图片作者
Image for post
Classification report | Image by Author
分类报告| 图片作者
Image for post
SVCs ROC curve | Image by Author
SVC的ROC曲线| 图片作者

Ok, so it turns out that our F1-score with the SVC model is 0.65. As we saw earlier, the CatBoost model performed better as well as the Random Forest. So we will stick with those two in our “Potential Good Models” list for the hyperparameters optimization step. With the results of the five models at hand, we ended up with two promising models. The next step would be to optimize the two models to see which performs best, but that will be another series on how to optimize and compare models.

好的,事实证明,带有SVC模型的F1得分是0.65。 正如我们前面所看到的,CatBoost模型的性能优于随机森林。 因此,我们将在“潜在良好模型”列表中坚持使用这两个参数进行超参数优化。 有了这五个模型的结果,我们最终得到了两个有希望的模型。 下一步将是优化两个模型,以查看哪种模型效果最佳,但这将是有关如何优化和比较模型的另一系列文章。

This article was written in conjunction with Guilherme Bezerra Pujades Magalhães.

本文与 Guilherme Bezerra PujadesMagalhães 一起撰写

翻译自: https://towardsdatascience.com/predicting-interest-rate-with-classification-models-part-3-3eef38dd7b32

人口预测和阻尼-增长模型

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值