Pandas和Scikit.Learn、scipy库的数据挖掘应用（前传）

最新推荐文章于 2022-10-20 16:32:40 发布

忘记可否

最新推荐文章于 2022-10-20 16:32:40 发布

阅读量193

点赞数

分类专栏： Python 文章标签： Python

本文链接：https://blog.csdn.net/qq_43548867/article/details/99089405

版权

Python 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

主要目的

前两篇的只涉及数据可视化，还没涉及到真正的‘‘挖掘’‘和结论部分
~~主要是因为自己也不熟~~ ，在假期结束前简单列列笔记，后续完善

任务概要

首先学习的是回归分析，需要特征分析和检验模型正确率。

中心思想

模型建立遵循原则：
Build model as lowest error and highest complexity.
在这里插入图片描述
进行回归模型的分析时，要注重特征量的筛选(select-feature)。判断拟合度和特征量间的相关度，选出合适的feature，然后训练模型进行分析。

特征筛选

F_Regression

sklearn中自带f_regression的函数，可以根据x-y的关系来给出各特征的F-value
F-value越大，说明契合度越高。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import f_regression  mutual_info_regression


np.random.seed(0)
X = np.random.rand(1000, 3)#（3，1000）的arr
y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)#建立关系

f_test, _ = f_regression(X, y)
f_test /= np.max(f_test)

mi = mutual_info_regression(X, y)
mi /= np.max(mi)


plt.figure(figsize=(15, 5))
for i in range(3):
    plt.subplot(1, 3, i + 1)
    plt.scatter(X[:, i], y, edgecolor='black', s=20)
    plt.xlabel("$x_{}$".format(i + 1), fontsize=14)
    if i == 0:
        plt.ylabel("$y$", fontsize=14)
    plt.title("F-test={:.2f}, MI={:.2f}".format(f_test[i], mi[i]),
              fontsize=16)
plt.show()

output：

在这里插入图片描述
其中混了 mutual_regression，不作介绍。
原code：https://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html#sphx-glr-auto-examples-feature-selection-plot-f-test-vs-mi-py

Pearsonr-Value

P值，在ARIMA中也有用到的表示特征值间相关性的值。
P值越小，特征值间相关性越低，越独立。
运用scipy库来拟合P值。

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
np.random.seed(0)
size = 300
x = np.random.normal(0, 1, size)
# pearsonr(x, y)的输入为特征矩阵和目标向量
print("Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)))
print("Higher noise", pearsonr(x, x + np.random.normal(0, 10, size)))

其他

还有其他一堆方法，比如：
RLR(sklearn新版删除了link…)
RFE
~~差分、方差等等。都还没学~~

总结

拥有了这些知识，就可以做到按自己的’想像‘，对已有数据进行预处理(Data clean，cluster(K-Means))，分析，挖掘，设计一些Expert Factor并建模(线性or逻辑回归模型)。但只会用库自带的参数拟合回归系数…ARIMA的pdq不会看，只能自动设置。现阶段也很满意了哈=_=

假期不会更新了，剩下时间要注重校内课程了

真是很有意思的领域啊，?了，?丢了。