人口预测和阻尼-增长模型_使用分类模型预测利率-第2部分-CSDN博客

人口预测和阻尼-增长模型

We are back! This post is a continuation of the series “Predicting Interest Rate with Classification Models”. I will make my best to make each article independent in a way that it won't need the previous ones for you to make the most of it.

我们回来了！ 这篇文章是“使用分类模型预测利率”系列的延续。 我将尽我所能，使每篇文章都独立起来，而无需您以前的文章来充分利用它。

快速回顾 (Fast Recap)

In the first article of the series, we applied a Logistic Regression model to predict up movements of the Fed Fund Effective Rate. For that, we used Quandl to retrieve data from Commodity Indices, Merrill Lynch, and US Federal Reserve.

在该系列的第一篇文章中，我们应用了Logistic回归模型来预测联邦基金有效利率的上升趋势。为此，我们使用Quandl从商品指数，美林和美联储中检索数据。

The variables used in the series articles are RICIA the Euronext Rogers International Agriculture Commodity Index, RICIM the Euronext Rogers International Metals Commodity Index, RICIE the Euronext Rogers International Energy Commodity Index, EMHYY the Emerging Markets High Yield Corporate Bond Index Yield, AAAEY the US AAA-rated Bond Index (yield) and, finally, USEY the US Corporate Bond Index Yield. All of them are daily values ranging from 2005–01–03 to 2020–07–01.

系列文章中使用的变量是RICIA，泛欧罗杰斯国际农业商品指数，RICIM 泛欧罗杰斯国际金属商品指数，RICIE 泛欧罗杰斯国际能源商品指数，EMHYY 新兴市场高收益企业债券指数收益率，AAAEY 美国AAA级评级债券指数(收益率) ，最后是USEY 美国公司债券指数收益率。所有这些都是每天的值，范围是2005-01-03至2020-07-01。

Now let’s move to the intuition of the models that we will use!

现在，让我们转到将要使用的模型的直觉上！

朴素贝叶斯和随机森林简介 (A brief introduction to Naive Bayes and Random Forest)

朴素贝叶斯 (Naive Bayes)

Naive Bayes is a probabilistic classifier method based on Bayes Theorem. The theorem gives us the occurrence’s probability of an event (A)given that another event (B) has occurred.

朴素贝叶斯是一种基于贝叶斯定理的概率分类器方法。该定理为我们给出了另一个事件(B)发生的事件(A)的发生概率。

Given a vector of features X = (x₁, x₂, x₃,…, xₙ), we can rewrite the equation above as

给定的特征X =(X₁ 中，x₂，X₃，...，Xₙ)，我们可以改写上述公式作为矢量

It is very important to keep in mind that the model relies on the assumption of conditional independence. Which means that xᵢ are conditionally independent, given y.

请记住，该模型依赖于 条件独立性 的假设，这一点非常重要 。哪一个表示给定y时 xᵢ是条件独立的。

Assuming conditional independence in features,

假设功能具有条件独立性，

For our problem, we are interested in taking the category with maximum probability and labeling our prediction as 0 or 1.

对于我们的问题，我们感兴趣的是以最大概率选择类别并将我们的预测标记为0或1。

There are three types of Naive Bayes methods: Multinomial, Bernoulli and Gaussian. We are going to use the Gaussian type that is used when the predictor has continuous values.

朴素贝叶斯方法有三种类型：多项式，伯努利和高斯。我们将使用在预测变量具有连续值时使用的高斯类型。

随机森林 (Random Forest)

Random Forest can be used for classification and regression tasks. It is formed by a set of decision trees that are formed by randomly choosing features to make predictions. In the end, the most voted prediction is the outcome of the model.

随机森林可用于分类和回归任务。它由一组决策树组成，这些决策树是通过随机选择要素进行预测而形成的。最后，投票最多的预测是模型的结果。

As measures of purity, it is possible to apply the Gini Index or Entropy. Gini is the measure of the probability of incorrect labeling a randomly chosen value from the data set. Its maximum value of impurity is 0.5 and maximum purity is 0.

作为纯度的度量，可以应用基尼系数或熵。基尼系数是对数据集中随机选择的值进行错误标注的概率的度量。它的最大杂质值为0.5，最大纯度为0。

Entropy, as well as Gini, is a measurement of disorder of data. In other words, it is essentially a measure of uncertainty. Its maximum impurity is 1 and maximum purity is 0.

熵和基尼系数都是对数据混乱的一种度量。换句话说，它实质上是不确定性的量度。其最大杂质为1，最大纯度为0。

These measures are used to calculate what we call Information Gain that will give us how much information is gained as we go down the tree. If a node is created that does not improve our information, it shouldn't be there. That’s why impurity is so important.

这些度量用于计算我们称为信息增益的信息，该信息增益将为我们提供从树上下来时获得的信息量。如果创建的节点不能改善我们的信息，则该节点不应存在。这就是为什么杂质如此重要的原因。

Important note: which one to use depends on the problem you have in your hands.

重要说明：要使用哪一个取决于您遇到的问题。

代码 (The code)

Usually, the first step is to download the data and take a look at it. The purpose of doing this is to have insights that will help increase our knowledge about our features and dealing with possible NaNs. In this article, we will import the libraries that we are going to use and transform possible NaN values into the average value of each variable before looking at the data.

通常，第一步是下载数据并进行查看。这样做的目的是获得见解，这将有助于增加我们对我们的功能的了解并应对可能的NaN。在本文中，我们将导入将要使用的库，并在查看数据之前将可能的NaN值转换为每个变量的平均值。

As we are more interested in applying the classification method and studying it, turning NaNs into average values will suit us just fine. But be aware that this is an essential part of any machine learning problem.

由于我们对应用分类方法并对其进行研究更感兴趣，因此将NaN转换为平均值将非常适合我们。 但是请注意，这是任何机器学习问题的重要组成部分。

import numpy as np
import pandas as pd
import quandl as qdl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics# get data from Quandl
data = pd.DataFrame()
meta_data = ['RICIA','RICIM','RICIE']
for code in meta_data:
    df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")
    df.columns = [code]
    data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']
for code in meta_data:
    df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")
    df.columns = [code]
    data = pd.concat([data, df], axis=1)# dealing with possible empty values (not much attention to this part, but it is very important)
data.fillna(data.mean(), inplace=True)
print(data.head())
print("\nData shape:\n",data.shape)

#histograms
data.hist()
plt.show()

There are a couple of conclusions to be made looking at the data. But, for the sake of simplicity, we are going to skip the majority of these conclusions and just notice that the values of variables range a lot from each other. So we will scale the data with Min-Max scaler. Next, we are going to download our dependent variable and binarize it.

查看数据有两个结论。但是，为简单起见，我们将跳过这些结论中的大多数，而只是注意到变量的值彼此之间有很大的不同。因此，我们将使用Min-Max缩放器缩放数据。接下来，我们将下载我们的因变量并将其二值化。

# scaling values to maked them vary between 0 and 1
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)# pulling dependent variable from Quandl (par yield curve)
par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")
par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values
par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])
par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize it
par_data=par_data.pct_change()
par_data.fillna(0, inplace=True)
par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])
print("Number of 0 and 1s:\n",par_data.value_counts())

The binarization rule that we used was: if y ≤ 0 then 0 else 1. This rule gives the same label to neutral and down movements and that's why we got a data set that has 3143 zeros and 909 ones, which means that 77% of our data is composed of zeros. If we leave it as it is, we will probably end up with a biased estimator. It will probably have high accuracy because if it classifies everything as zeros it will be right 77% of the time, but that does not mean it is good. So let’s oversample the data with a method called ADASYN.

我们使用的二值化规则为：如果y≤0，则为0，否则为1。该规则为中立和下降运动赋予相同的标签，这就是为什么我们得到的数据集包含3143个零和909个零的原因，这意味着77％的零我们的数据由零组成。如果我们保持原样，我们可能最终会得到有偏估计。它可能具有很高的准确性，因为如果将所有内容归类为零，那么77％的时间是正确的，但这并不意味着它很好。因此，让我们使用称为ADASYN的方法对数据进行过采样。

# Over-sampling with ADASYN method
sampler = ADASYN(random_state=13)
X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())
columns = data_scaled.columns
data_scaled = pd.DataFrame(data=X_os,columns=columns )
par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))
print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))

Ok! Now we are good to go! Let’s split our data and apply the methods.

好！现在我们可以出发了！让我们分割数据并应用方法。

# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']

朴素贝叶斯 (Naive Bayes)

# Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y)
y_pred = gnb.predict(X_test)
print('\nAccuracy of naive bayes classifier on test set: {:.2f}'.format(gnb.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(gnb, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')# roc curve
logit_roc_auc = metrics.roc_auc_score(y_test, gnb.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, gnb.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Naive Bayes (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Naive Bayes')
plt.legend(loc="lower right")
plt.savefig('NB_ROC')

If we compare the classification report of the Logistic Regression model applied in Part 1 and the Naive Bayes method, it seems that we were able to increase our F1-score from 0.60 to 0.61.

如果我们将第1部分中应用的Logistic回归模型的分类报告与朴素贝叶斯方法进行比较，看来我们能够将F1分数从0.60增加到0.61。

For the Gaussian Naive Bayes Classifier, we got an accuracy of 66%, pretty much equal the Logistic Regression model in Part 1. We can notice by looking at the Confusion Matrix that it predicted right 817 values. The Logistic Regression model predicted right 810. Let’s look at the ROC curve of the Naive Bayes model.

对于高斯朴素贝叶斯分类器，我们获得了66％的准确度，几乎与第1部分中的Logistic回归模型相当。通过查看混淆矩阵，我们可以注意到它预测了正确的817值。 Logistic回归模型预测为810。让我们看一下朴素贝叶斯模型的ROC曲线。

Now, comparing the Logistic Regression ROC curve with the Naive Bayes ROC curve, we can see an increase in the area below the ROC curve of 0.01. Going from 0.65 to 0.66. It seems that we found a slightly better model for our prediction problem. Now we will apply the Random Forest model.

现在，将Logistic回归ROC曲线与Naive Bayes ROC曲线进行比较，我们可以看到ROC曲线下方的面积增加了0.01。从0.65升至0.66。看来我们为预测问题找到了更好的模型。现在我们将应用随机森林模型。

随机森林 (Random Forest)

The Random Forest model can have its hyperparameters optimized to improve the model. However, to have a first feeling of the model, we will apply it with its default values. If it shows us promising results, then it will be optimized. This approach will save us time!

可以对随机森林模型的超参数进行优化以改善模型。但是，为了对模型有初步了解，我们将其默认值应用于模型。如果它显示出令人鼓舞的结果，那么它将得到优化。 这种方法将节省我们的时间！

# Random Forest model
clf=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2,
                            min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                            max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
                            bootstrap=True, oob_score=False, n_jobs=None, random_state=13, verbose=0,
                            warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
clf.fit(X_train, y)
y_pred = clf.predict(X_test)
print('\nAccuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')# roc curve
logit_roc_auc = metrics.roc_auc_score(y_test, clf.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Random Forest Classifier (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Random Forest Classifier')
plt.legend(loc="lower right")
plt.savefig('RF_ROC')

WOW! Now we substantially improved our results! We were able to find an accuracy of 76% while increasing our Precision and Recall measures!

哇！现在，我们大大改善了结果！我们能够找到76％的准确度，同时增加了“精确度”和“召回率”指标！

We can see by looking at the confusion matrix that we labeled correctly 943 values. It is an increase of 15% compared to the Gaussian Naive Bayes classification model. Finally, let’s see what the ROC curve can tell us!

通过查看混淆矩阵，可以看到我们正确标记了943个值。与高斯朴素贝叶斯分类模型相比，它增加了15％。最后，让我们看看ROC曲线可以告诉我们什么！

That is a much more beautiful curve! The area below the curve is 0.10 bigger then Naive Bayes ROC curve. What a great improvement indeed! Now we can separate this model and put it in our “Potential Good Models” list to be optimized after we finish testing two other models, CatBoost and Support Vector Machines. See you in Part 3!

那是一条更加美丽的曲线！曲线下方的区域比Naive Bayes ROC曲线大0.10。确实有很大的进步！现在，我们可以分离该模型，并将其放入“潜在良好模型”列表中，在我们完成对另外两个模型CatBoost和Support Vector Machines的测试之后，可以对其进行优化。 第三部分见！

This article was written in conjunction with Guilherme Bezerra Pujades Magalhães.

本文与 Guilherme Bezerra PujadesMagalhães 一起撰写 。

参考和重要链接 (References and great links)

[1] T. Mitchell, Machine Learning Course (2009)

[1] T. Mitchell，机器学习课程 (2009)

[2] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning (2008) IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322–1328.

[2]何海波，杨洋，EA Garcia和李树涛， ADASYN：用于不平衡学习的自适应合成采样方法 (2008年)，IEEE国际神经网络联合会议(IEEE，世界计算智能大会)，香港，2008年，第pp 1322–1328。