【BI学习作业12-个性化推荐与金融数据分析】

最新推荐文章于 2024-04-12 14:48:21 发布

水花

最新推荐文章于 2024-04-12 14:48:21 发布

阅读量555

点赞数

分类专栏： BI_推荐系统文章标签： python 机器学习

本文链接：https://blog.csdn.net/weixin_43849871/article/details/113729843

版权

BI_推荐系统专栏收录该内容

49 篇文章 16 订阅

订阅专栏

内容目录

1.思考题

1.1P2P租车

阐述相似车型，搜索排序的设计方法
可能的embedding策略

在这里插入图片描述

P2P租车与Airbnb的房屋租赁在业务逻辑上是高度相似的，都属于短期租赁的双边市场，一个顾客很少会预定同一台车多次，一台车在某时间段内只能被一个顾客租用，车主有可能拒绝出租，数据存在严重的稀疏性，所以这里参考Airbnb在房屋租赁上的架构。主要分为两大模块，search ranking和similar listing recommendation，similar listing recommendation推荐相似车型，search ranking负责搜索排序。下面将分别阐述两大模块的设计方法，在此之前，要先介绍下embedding,其是两大模块都将用到的策略。

Embedding,最早是源于NLP领域，后来得到推广，现已被广泛应用于各个领域。这里用到的embedding策略有两种不同的方式，一种是listing embeddings,一种是user-type&listing-type embeddings, 分别用于short-term real-time personalization和long term personalization,捕捉的是用户的短期行为信息和长期行为信息。下面分别介绍两种不同的embedding策略。

(1)listing embeddings
每次用户的连续点击不同listing的连贯行为被称为一个session,如果这次的session最终有成功book一台车，则称为booked session,那个被预定的listing被称为booked listing;这次的session没有成功book一台车，则称为exploratory session。在剔除掉一些质量不好的session后,由这些session我们便可以构建出数据集S，将session中的listing视作word,整个session视作sentence,便可以使用the skip-gram model,并配合negative sampling approach技术进行训练，便可以得到listing embedding。同时，在这里，结合业务场景，可以对the skip-gram model进行一些修改，1在booked session中，将Booked listing作为global context,其将总是出现在每次的滑动窗口内参加训练，毕竟我们模型的目的就是为了更多地促进book行为。2租车行为通常在一个窗口期内都集中在一个固定的地区，所以在用negative sampling approach时应注意多从与正样本相同的一个地区内进行随机采样，以保证样本的均衡。使用listing embedding有很多的好处，其中一个体现在冷启动中，当一个新的listing上线时，我们便可以用与其相似的其它n个listing的embedding的均值作为其embedding。

(2)user-type&listing-type embeddings
listing只能捕获短期的用户行为信息，要利用用户的长期行为信息，我们应使用user-type&listing-type embeddings，将同样类型的user映射到同一个embedding上，同样类型的listing映射到跟user-type embedding在同一个空间的embedding上，以此来解决数据的稀疏性问题。这里，把一个用户连续book过的listing，配合每次book时用户所属的类型，形如，构建出一个session,把这些session组合成Sb数据集，同样使用the skip-gram model，配合negative sampling approach便可以训练出user-type & listing-type embeddings，同样，在这里，可以结合业务场景对the skip-gram model进行一些修改，将那些被rejection的listing作为负样本加入the skip-gram model训练会得到更好的结果，在对用户推荐时可以把可能拒绝他们的listing往后排。

在介绍完embedding技术后，我们来介绍similar listing recommendation和search ranking。

一，similar listing recommendation用于推荐相似车型，当用户点击某个具体的listing时，在旁边的similar listings carousel即使用similar listing recommendation给用户推荐与那个listing相似的其它车。可以注意到，用户在某时点击一个具体的listing,这属于一个短期的用户行为，所以这里可以直接使用前面训练得到的listing embedding。使用两辆车的listing embeddings间的cosine similarity作为衡量其相似的程度，把那些相似度最高的并且可以被用户租用（即跟用户的需求属于同一个地区并且没被其它人租走）的车推荐给用户。

二，search ranking，当用户使用自己的个人信息和意愿进行搜索时，其将返回给用户最匹配的listing,其中根据匹配程度对listing进行排序，越匹配的放在越前面。对search ranking，其可以转化为一个pairwise regression with search label的问题，我们需要构建数据集并使用Lambda Rank算法进行训练。把一次search构建为形如Ds =(x i ,y i ),i = 1 …K ,其中xi代表某个listing的特征，y i ∈ { 0 , 0 . 01 , 0 . 25 , 1 ,− 0 . 4 } ，每个标签的含义如下，0是用户看到了某个listing却将其忽略，0.01是用户点击了某个listing,0.25是用户联系了某listing的主人但最后并未book,1是用户book了某listing,-0.4是用户被某listing的主人拒绝了，注意在构建Ds的时候，里面必须含有book成功的数据，即对应的yi=1。最后把合格的Ds组合在一起形成数据集D。上面说了，xi代表某个listing的特征，这里说的特征包括listing features,user features,query feature和cross-features,以及在Airbnb架构中被证明极为有用的listing embedding feature和User-type & Listing-type Embedding Features。其中，listing embedding feature是使用listing embedding进行构建的，我们搜集用户在过去两周内的行为，比如点击过某个listing，忽略过某些本来在排序中排名很高的listing等，将这些行为对应的listing的list embedding与某个样本的listing embedding进行cosine similarity的计算，并将计算值作为这个样本的特征，这些特征名形如EmbClickSim,EmbWishlistSim等。User-type & Listing-type Embedding Features则只要使用对应用户的User-type embedding和对应某样本listing的 Listing-type Embedding进行cosine similarity的计算，并将计算值作为特征即可。这样把特征都一一构建完毕（据Airbnb的架构说其使用了一百个左右的特征），便可以投入模型用Lambda Rank算法进行训练，训练完毕后便可投入使用。

2.编程题

2.1信用卡违约率检测

数据集地址：https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

对信用卡使用数据进行建模，预测用户是否下个月产生违约 => 分类问题
机器学习算法有很多，比如SVM、决策树、随机森林和KNN => 该使用哪个模型
可以使用GridSearchCV工具，找到每个分类器的最优参数和最优分数，最终找到最适合数据集的分类器和此分类器的参数

信用卡违约项目选用的数据集是台湾某银行2005/4/9的信用卡数据，数据集一共包括25个字段，共30000行数据；字段含义如下：

字段	含义
ID	客户ID
LIMIT BAL	可透支金额
SEX	性别，男：1，女：2
EDUCATION	教育程度，研究生：1，本科：2，高中：3，其他：4
MARRIAGE	婚姻，已婚：1，单身：2，其他：3
AGE	年龄
PAY 0	2005年9月客户还款情况
PAY 1	2005年8月客户还款情况
PAY 2	2005年7月客户还款情况
PAY 3	2005年6月客户还款情况
PAY 4	2005年5月客户还款情况
PAY 5	2005年4月客户还款情况
BILL_AMT1	2005年9月客户每月账单金额
BILL_AMT2	2005年8月客户每月账单金额
BILL_AMT3	2005年7月客户每月账单金额
BILL_AMT4	2005年6月客户每月账单金额
BILL_AMT5	2005年5月客户每月账单金额
BILL_AMT6	2005年4月客户每月账单金额
PAY_AMT1	2005年9月客户每月还款金额
PAY_AMT2	2005年8月客户每月还款金额
PAY_AMT3	2005年7月客户每月还款金额
PAY_AMT4	2005年6月客户每月还款金额
PAY_AMT5	2005年5月客户每月还款金额
PAY_AMT6	2005年4月客户每月还款金额
default.payment.next.month	下个月是否违约，违约：1，守约：0

观察数据集是否符合”完全合一“原则：
在这里插入图片描述
从上面图中，我们需要注意的是：SEX性别、是否违约是个二值变量，EDUCATION教育程度、MARRING情感状态是哑变量，在模型眼里，以上所有特征都是连续变量。

解决这种二分类问题，有两个思路：一是对连续变量离散化，再one-hot编码用较简单的logistic模型拟合，甚至可以用线性回归；二是直接将数据提交给随机森林，完成训练。（毕竟随机森林处理连续特征的能力比较强）。

知道了思路，接下来是输入特征预处理的步骤，包括数据变换，维度约简，特征构造，每个步骤间没有明显的顺序之分，往往需要根据需求反复执行，甚至也没有严格区分的概念边界，例如特征构造可能会与数据预处理使用同样的数据变换技术等。

来吧，我打算用随机森林模型训练数据；随机森林是集成学习算法，集成学习有两种方式，一是boosting，adaboost就是采用boosting的方式从众多的弱分类器中学习一个强分类器，随机森林采用的是bagging的学习方式在众多的分类器中选择一个预测效果最好的分类器！（默认采用cart分类器）。

随机森林采用cart分类器，所以既可以做分类也可以做回归；分类时，输出结果是每个子分类器的分类结果中最多的那个。（选择投票最多的那个结果）。回归时，输出结果是每颗cart树回归结果的平均值。

sklearn中提供了可供选择的树模型：ID3，与cart树。ID3可以处理离散/连续数据，cart树可以将连续数据离散化，所以对噪声免疫好，增加了模型的泛化能力，且可以处理缺失数据，具体如下：

一般而言，可以处理连续数据的模型也可以处理离散数据，反之，不行！

n_estimators	随机森林里决策树的个数，默认是10
criterion	决策树分裂的标准，默认是基尼指数（CART算法），也可以选择entropy（ID3算法）
max_depth	决策树的最大深度，默认是None，也就是不限制决策树的深度。也可以设置一个整数，限制决策树的最大深度。
n_jobs	拟合和预测的时候CPU的核数，默认是1，也可以是整数，如果是-1则代表CPU的核数

然后用fit函数拟合，predict函数用来预测，accuracry_score函数计算准确率；

代码如下：

# -*- coding: utf-8 -*-
# 信用卡违约率分析
import pandas as pd
from sklearn.model_selection import learning_curve, train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot as plt
import seaborn as sns
# 数据加载
data = data = pd.read_csv('C:/Users/baihua/Desktop/code/UCI_Credit_Card.csv')
# 数据探索
print(data.shape) # 查看数据集大小
print(data.describe()) # 数据集概览
# 查看下一个月违约率的情况
next_month = data['default.payment.next.month'].value_counts()
print(next_month)
df = pd.DataFrame({'default.payment.next.month': next_month.index,'values': next_month.values})
plt.rcParams['font.sans-serif']=['SimHei'] # 用来正常显示中文标签
plt.figure(figsize = (6,6))
plt.title('信用卡违约率客户\n (违约：1，守约：0)')
sns.set_color_codes("pastel")
sns.barplot(x = 'default.payment.next.month', y="values", data=df)
locs, labels = plt.xticks()
plt.show()
# 特征选择，去掉 ID 字段、最后一个结果字段即可
data.drop(['ID'], inplace=True, axis =1) #ID 这个字段没有用
target = data['default.payment.next.month'].values
columns = data.columns.tolist()
columns.remove('default.payment.next.month')
features = data[columns].values
# 30% 作为测试集，其余作为训练集
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.30, stratify = target, random_state = 1)

#clf=DecisionTreeClassifier(random_state = 1, criterion = 'gini')
#SVC(random_state = 1, kernel = 'rbf')
#KNeighborsClassifier(metric = 'minkowski')
clf=RandomForestClassifier(random_state = 1, criterion = 'entropy')
clf.fit(train_x,train_y)
predict_y=clf.predict(test_x)
print(predict_y)
print('准确率: ', accuracy_score(predict_y,test_y))

在这里插入图片描述

用ID3得到的准确率是0.808还略高于gini算法，所以也并不是非得对连续数据离散化！大量的步骤再构造特征，用到模型的时候就三步！可见，数据处理才是最耗费时间的！

接下来，我们对数据进行分箱离散化，然后用逻辑回归拟合数据，看准确率与随机森林who高？

分箱分为：无序分箱one-hot编码，有序分箱就是哑编码；以及连续特征无监督分箱独热编码，有监督分箱独热编码！

在这里插入图片描述

# python 实现one-hot无序编码
import pandas as pd

df= pd.DataFrame([['专业技术人员','A',1],['国家机关人员','C',2],['国家机关人员','A',1],['商业人员','C',4],['国家机关人员','B',5]],columns=['job','class','value'])
df = pd.get_dummies(df,columns=['job','class'],drop_first=0)# columns表示你要引入分箱的变量，drop_first=0 代表使用 n-1个虚拟变量
print(df)

#python实现有序编码
import pandas as pd
df= pd.DataFrame(['正常','3级高血压','正常','2级高血压','正常','正常高值','1级高血压'],columns=['blood_pressure'])
dic_blood = {'正常':0,'正常高值':1,'1级高血压':2,'2级高血压':3,'3级高血压':4}
df['blood_pressure_enc'] = df['blood_pressure'].map(dic_blood)
print(df)

在这里插入图片描述

#连续变量有监督分箱离散化&独热编码
import pandas as pd
df = pd.DataFrame([[22,1],[13,1],[33,1],[52,0],[16,0],[42,1],[53,1],[39,1],[26,0],[66,0]],columns=['age','Y'])
#print(df)
df['age_bin_1'] = pd.qcut(df['age'],3) #新增一列存储等频划分的分箱特征
df['age_bin_2'] = pd.cut(df['age'],3)  #新增一列存储等距划分的分箱特征
print(df)
df = pd.get_dummies(df,columns=['age_bin_1','age_bin_2'],drop_first=0)# columns表示你要引入分箱的变量，drop_first=0 代表使用 n-1个虚拟变量
print(df)

在这里插入图片描述

分箱方法：
1、卡方分箱，具有最小卡方值的相邻区间合并在一起,直到满足确定的停止准则。自底向上的合并方法
2、最小熵分箱，熵是衡量信息混乱程度的指标，熵小，说明信息混乱低；
3、等频，等距分箱；（虽然简单，但不靠谱；太机械了）
要知道未必是对所有特征进行分箱离散化，效果就最好！有的特征进行分箱离散化后，会导致模型效果变差！
对每列进行分箱离散化，那么离散为几个区间为好了？要知道每个特征数据分布都不一样！

也就是说要回答两个问题

离散列如何选择？—这个只能试炼喽
离散列划分为几个区间？—可能业务建议比较重要

这个问题早有人回答了，计算每个特征每个类别WOE/IV值，然后用WOE值替换原始数据，剔除IV<0.02的特征，然后将满足条件的用woe完成替换的次生数据带入到模型中，完成分类。

代码放这里，有需要自取数据切割（分箱，离散化）及woe、iv值计算的python代码分享

在构建模型的过程中，如何选择模型，如何调参是个绕不开的话题？你可能会说，模型一个个试呗，参数一个个调呗；

今天就隆重介绍GridSearchCV 这个自动关调参工具，再结合Pipeline 管道机制进行流水线作业。让你的模型之路更加酷炫！

2.1.1使用 GridSearchCV 工具对模型参数进行调优

Python 给我们提供了一个很好用的工具 GridSearchCV，它是 Python 的参数自动搜索模块。我们只要告诉它想要调优的参数有哪些以及参数的取值范围，它就会把所有的情况都跑一遍，然后告诉我们哪个参数是最优的，结果如何。

加载这个模块及这个模块包括的主要参数：

from sklearn.model_selection import GridSearchCV

estimator	代表想要采用的分类器，比如随机森林、决策树SVM、KNN等
param_id	代表想要优化的参数及取值，输入的是字典或者列表的形式
cv	交叉验证的折数，默认为None，代表使用三折交叉验证。
scoring	准确度的评价标准，默认为None，也就是需要使用score函数。也可以设置具体的评价标准，比如accuray，F1等

构造完GridSearchCV之后，我们就可以使用fit函数拟合训练，使用predict函数预测，这时预测采用的就是最优参数情况下的分类器。

举一个例子，使用sklearn自带iris数据集，采用随机森岭对rsis进行分类，假设我们想知道n_estimator在1-10的范围内取哪个值的分类效果最好，

# -*- coding: utf-8 -*-
# 使用 RandomForest 对 IRIS 数据集进行分类
# 利用 GridSearchCV 寻找最优参数
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA#pca的作用是给特征降维
import random 
np.random.seed(1234)
rf = RandomForestClassifier()

parameters = {"n_estimators": range(1,11),"max_depth":range (1,10)}#选用随机森林分类器，就选用相应的参数；参数一般存储为字典或列表
iris = load_iris()#字典{'data': array([[5.1, 3.5, 1.4, 0.2],……，'target': array([0, 0, 0, 0, 0, 0, 0, 0, ……]),'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))
pca = PCA(n_components=0.9)#解释90%的样本信息，具体参数参考文献见
pca.fit(iris.data)#就是算法中的“训练”这一步骤；函数返回值：调用fit方法的对象本身
iris.data=pca.transform(iris.data)#返回降维后的数据
#pca.inverse_transform(iris.data)#将降维后的数据还原为原数据

# 使用 GridSearchCV 进行参数调优
clf = GridSearchCV(estimator=rf, param_grid=parameters)
# 对 iris 数据集进行分类
clf.fit(iris.data, iris.target)
print(" 最优分数： %.4lf" %clf.best_score_)
print(" 最优参数：", clf.best_params_)

最优分数： 0.9733
最优参数： {'max_depth': 4, 'n_estimators': 6}

也就是说，采用随机森林分类器时，当n_estimator决策树子树取6，最大决策树深度取4时，取得最优准确率97%。

2.1.2使用 Pipeline 管道机制进行流水线作业

做分类时，往往先对数据做规范化处理，再对数据降维，最后才使用分类器。

python有一种pipline管道机制，管道机制就是把每一步按照顺序罗列下来，从而创建pipline流水线作业，每一步采用（‘名称’，函数）
的方式表示。

#pipline举例
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA()),
        ('randomforestclassifier', RandomForestClassifier())
])

#具体使用
# -*- coding: utf-8 -*-
# 使用 RandomForest 对 IRIS 数据集进行分类
# 利用 GridSearchCV 寻找最优参数, 使用 Pipeline 进行流水作业
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
rf = RandomForestClassifier()
parameters = {"randomforestclassifier__n_estimators": range(1,11),"randomforestclassifier__max_depth":range(1,10)}#在随机森林参数前加个前缀，表示在pipline管道中
iris = load_iris()#字典{'data': array([[5.1, 3.5, 1.4, 0.2],……，'target': array([0, 0, 0, 0, 0, 0, 0, 0, ……]),'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))
pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('randomforestclassifier', rf)
])
# 使用 GridSearchCV 进行参数调优
clf = GridSearchCV(estimator=pipeline, param_grid=parameters)#参数调优时原本放分类器的地，放管道了；

# 对 iris 数据集进行分类
clf.fit(iris.data, iris.target)
print(" 最优分数： %.4lf" %clf.best_score_)
print(" 最优参数：", clf.best_params_)

最优分数： 0.9733
 最优参数： {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 9}

两点注意：参数设置中，字典中的参数前加对应分类器的识别，比如随机森林加randomforestclassifier__，决策树加decisiontreeclassifier__
GridSearchCV：原放分类器的地现在放管道

如今，我们将pipline和GridSearchCV应用到我们的模型中，来看看其威力：

在这里插入图片描述

# -*- coding: utf-8 -*-
# 信用卡违约率分析
import pandas as pd
from sklearn.model_selection import learning_curve, train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot as plt
import seaborn as sns
# 数据加载
data= pd.read_csv('C:/Users/baihua/Desktop/code/UCI_Credit_Card.csv')
# 数据探索
print(data.shape) # 查看数据集大小
print(data.describe()) # 数据集概览
# 查看下一个月违约率的情况
next_month = data['default.payment.next.month'].value_counts()
print(next_month)
df = pd.DataFrame({'default.payment.next.month': next_month.index,'values': next_month.values})
plt.rcParams['font.sans-serif']=['SimHei'] # 用来正常显示中文标签
plt.figure(figsize = (6,6))
plt.title('信用卡违约率客户\n (违约：1，守约：0)')
sns.set_color_codes("pastel")
sns.barplot(x = 'default.payment.next.month', y="values", data=df)
locs, labels = plt.xticks()
#plt.savefig('C:/Users/baihua/Desktop/zhifangtu.png') 保存生成的图形到本地
plt.show()
# 特征选择，去掉 ID 字段、最后一个结果字段即可
data.drop(['ID'], inplace=True, axis =1) #ID 这个字段没有用
target = data['default.payment.next.month'].values#y变量
columns = data.columns.tolist()#将矩阵或数组转成列表
columns.remove('default.payment.next.month')#数据集中去掉y，留下特征
features = data[columns].values
# 30% 作为测试集，其余作为训练集
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.30, stratify = target, random_state = 1)
    
# 构造各种分类器
classifiers = [
    SVC(random_state = 1, kernel = 'rbf'),    
    DecisionTreeClassifier(random_state = 1, criterion = 'gini'),
    RandomForestClassifier(random_state = 1, criterion = 'gini'),
    KNeighborsClassifier(metric = 'minkowski'),
    # AdaBoostClassifier( random_state=1)
]
# 分类器名称
classifier_names = [
            'svc', 
            'decisiontreeclassifier',
            'randomforestclassifier',
            'kneighborsclassifier',
             #'AdaBoostClassifier'
]
# 分类器参数
classifier_param_grid = [
            {'svc__C':[1], 'svc__gamma':[0.01]},
            {'decisiontreeclassifier__max_depth':[6,9,11]},
            {'randomforestclassifier__n_estimators':[3,5,6]} ,
            {'kneighborsclassifier__n_neighbors':[4,6,8]},
             #{'AdaBoostClassifier__n_estimators':[10,15,25]}
]
 
# 对具体的分类器进行 GridSearchCV 参数调优
def GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, param_grid, score = 'accuracy'):
    response = {}
    gridsearch = GridSearchCV(estimator = pipeline, param_grid = param_grid, scoring = score)
    # 寻找最优的参数 和最优的准确率分数
    search = gridsearch.fit(train_x, train_y)
    print("GridSearch 最优参数：", search.best_params_)
    print("GridSearch 最优分数： %0.4lf" %search.best_score_)
    predict_y = gridsearch.predict(test_x)
    print(" 准确率 %0.4lf" %accuracy_score(test_y, predict_y))
    response['predict_y'] = predict_y
    response['accuracy_score'] = accuracy_score(test_y,predict_y)
    return response

#for循环遍历模型，及对应的参数
for model, model_name, model_param_grid in zip(classifiers, classifier_names, classifier_param_grid):
    pipeline = Pipeline([
            ('scaler', StandardScaler()),
            (model_name, model)
    ])
    result = GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, model_param_grid , score = 'accuracy')

输出结果：
在这里插入图片描述

输出
(30000, 25)
                 ID       LIMIT_BAL           SEX     EDUCATION      MARRIAGE  \
count  30000.000000    30000.000000  30000.000000  30000.000000  30000.000000   
mean   15000.500000   167484.322667      1.603733      1.853133      1.551867   
std     8660.398374   129747.661567      0.489129      0.790349      0.521970   
min        1.000000    10000.000000      1.000000      0.000000      0.000000   
25%     7500.750000    50000.000000      1.000000      1.000000      1.000000   
50%    15000.500000   140000.000000      2.000000      2.000000      2.000000   
75%    22500.250000   240000.000000      2.000000      2.000000      2.000000   
max    30000.000000  1000000.000000      2.000000      6.000000      3.000000   

[8 rows x 25 columns]
0    23364
1     6636
Name: default.payment.next.month, dtype: int64

GridSearch 最优参数： {'svc__C': 1, 'svc__gamma': 0.01}
GridSearch 最优分数： 0.8174
 准确率 0.8172

GridSearch 最优参数： {'decisiontreeclassifier__max_depth': 6}
GridSearch 最优分数： 0.8186
 准确率 0.8113

GridSearch 最优参数： {'randomforestclassifier__n_estimators': 6}
GridSearch 最优分数： 0.7998
 准确率 0.7994

GridSearch 最优参数： {'kneighborsclassifier__n_neighbors': 8}
GridSearch 最优分数： 0.8040
 准确率 0.8036

可以自定义添加分类器，及参数，很简单实用；就是牺牲了计算效率！
zip()函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。

>>>a = [1,2,3]
>>> b = [4,5,6]
>>> c = [4,5,6,7,8]
>>> zipped = zip(a,b)     # 打包为元组的列表
[(1, 4), (2, 5), (3, 6)]
>>> zip(a,c)              # 元素个数与最短的列表一致
[(1, 4), (2, 5), (3, 6)]
>>> zip(*zipped)          # 与 zip 相反，*zipped 可理解为解压，返回二维矩阵式
[(1, 2, 3), (4, 5, 6)]

2.2信用卡欺诈分析

数据集地址：https://www.kaggle.com/mlg-ulb/creditcardfraud

数据集信息：2013年9月份两天时间内的信用卡交易数据，共有284807笔交易，492笔欺诈行为。

数据样本包括了28个特征V1，V2，……V28，以及交易时间Time和交易金额Amoun
因为数据隐私，28个特征值是通过PCA变换得到的结果。

任务目标：需要预测每笔交易的分类Class，该笔交易是否为欺诈

Class=0为正常（非欺诈），Class=1代表欺诈

在这里插入图片描述

2.2.1查看数据

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
data = pd.read_csv("creditcard.csv")
data.shape
# (284807, 31)

在这里插入图片描述
好的，它长这个样子。大致解释一下V1-V23都是一系列的指标(具体是什么不用知道)，Amount是交易金额，Class＝0表示是正常操作，而=1表示异常操作。

明确目标：检测是否异常，也就是说是一个二分类问题，接着想到用逻辑回归建模。

2.2.2.观察数据特征

Class=0的我们不妨称之为负样本，Class=1的称正样本 ，看一下正负样本的数量。

count_classes = pd.value_counts(data['Class'],sort = True).sort_index()
plt.figure(figsize=(10,6))
count_classes.plot(kind='bar')
plt.title("Fraud class histogram")
plt.xlabel("Class",size=20)
plt.xticks(rotation=0)
plt.ylabel("Number",size=20)

在这里插入图片描述
可以看出样本数据严重不均衡，样本类别不均衡将导致样本量少的分类所包含的特征过少，并很难从中提取规律。同时你的学习结果会过度拟合这种不均的结果，通俗来说就是将你的学习结果用到一组分布均匀的数据上，拟合度会很差。

那么怎么解决这个问题呢？有两种办法
1）下采样
对这个问题来说，下采样采取的方法就是取正样本中的一部分，使得正样本和负样本数量大致相同。 就是让样本变得一样少

2）过采样
相对的，过采样的做法即再生成更多的负样本数据，使得负样本和正样本一样多。 就是让样本变得一样多

2.2.3归一化处理

继续观察数据，我们可以发现Amount这一列数据的浮动差异和V1-V28数据的浮动相比差距很大。 在做模型之前要保证特征之间的分布差异是差不多的，否则会对我们的模型产生误导，所以先对Amount做归一化或者标准化 。做法如下，使用sklearn很方便

#在这里顺便删去了Time列，因为Time列对这个问题没什么帮助
from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
data.head()

2.2.4采用下采样处理数据

X = data.loc[:, data.columns != 'Class']
y = data.loc[:, data.columns == 'Class']#y=pd.DataFrame(data.loc[:,'Class'])或y=pd.DataFrame(data.Class)
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
normal_indices = data[data.Class == 0].index

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
#random.choince从所有正样本索引中随机选择负样本数量的正样本索引，replace=False表示不进行替换
random_normal_indices = np.array(random_normal_indices)
#拿出来后转成array格式
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
#合并随机得到的正样本index和负样本
under_sample_data = data.iloc[under_sample_indices,:]
#再用index定位得到数据
X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']
#X_undersample和y_undersampl即为经过下采样处理后样本
print("正样本占总样本: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("负样本占总样本 ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("总样本数量", len(under_sample_data))
X_undersample.head(3)
y_undersample.head(3)

在这里插入图片描述

2.2.5交叉验证

把数据集切分成train(训练集)和test(测试集)，通常八二分，再把train等分成3个集合

在这里插入图片描述
一.1+2------>3 表示用1和2建立model，用3当作验证集
二.1+3------>2 同理即1和3建model，2当作验证集
三.2+3------>1
这样做的好处如果只做一次操作，假若样本比较简单会造成模型的效率比真实值高，而如果样本存在离群值会使得模型效率比真实偏低。为了权衡两者，这样操作相当于求一个平均值，使得模型的拟合效果更理性
最后的评估效果： 分别把用3，2，1的评估结果求平均值

代码实现如下：

from sklearn.model_selection import train_test_split
#sklearn中已经废弃cross_validation,将其中的内容整合到model_selection中将sklearn.cross_validation 替换为 sklearn.model_selection

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
#随机切分，random_state=0类似设置随机数种子，test_size就是测试集比例，我这里设置为0.3即0.7训练集，0.3测试集

print("原始样本训练集:", len(X_train))
print("原始样本测试集: ", len(X_test))
print("原始样本总数:", len(X_train)+len(X_test))

#对下采样数据也进行切分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
print("")
print("下采样样本训练集: ", len(X_train_undersample))
print("下采样样本测试集: ", len(X_test_undersample))
print("下采样样本总数:", len(X_train_undersample)+len(X_test_undersample))

在这里插入图片描述

#Recall = TP/(TP+FN)通过召回率评估模型
#TP（true positives）FP（false positives）FN（false negatives）TN（true negatives）
from sklearn.linear_model import LogisticRegression#引入逻辑回归模型
from sklearn.model_selection import KFold, cross_val_score
#KFlod指做几倍的交叉验证，cross_val_score为交叉验证评估结果
from sklearn.metrics import confusion_matrix,recall_score,classification_report
#confusion_matrix混淆矩阵

关于Recall的解释这篇文章讲的很清楚

2.2.6正则化惩罚项

假设有两组权重参数A和B，它们的RECALL值相同，但是A这组的方差远大于B，那么A比B更容易出现**过拟合(在训练集效果良好但在测试集变现差)**的情况。所以为了得到B这样的模型，引入正则化惩罚项。即把目标函数变成损失函数+正则化惩罚项
正则化惩罚项分两种：

L1：
$Ω(\theta)=||W||_1=∑_i|w_i|$

L2:
$Ω(\theta)=\frac{1}{2}||W||_2^{2}$

def printing_Kfold_scores(x_train_data,y_train_data):#fold.split(y_train_data)
c_param_range = [0.01,0.1,1,10,100]
#正则化惩罚力度候选
results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range

# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
for c_param in c_param_range:#找出最合适的正则化惩罚力度
    print('-------------------------------------------')
    print('C parameter: ', c_param)
    print('-------------------------------------------')
    print('')
    recall_accs = []
    for iteration, indices in enumerate(fold.split(y_train_data),start=1):
        lr = LogisticRegression(C = c_param, penalty = 'l1',solver='liblinear')
        #C是惩罚力度，penalty是选择l1还是l2惩罚，solver可选参数:{‘liblinear’, ‘sag’, ‘saga’,‘newton-cg’, ‘lbfgs’}
        lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
        #lr.fit:训练lr模型,传入dataframe的X和转变成一行的y
        y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
        #lr.predict:用验证样本集进行预测
        recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
        #recall_score：传入结果集，和predict的结果得到评估结果
        recall_accs.append(recall_acc)
        print('Iteration ', iteration,': recall score = ', recall_acc)

    results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
    j += 1
    print('')
    print('Mean recall score ', np.mean(recall_accs))
    print('')

best_c = results_table.loc[np.argmax(np.array(results_table['Mean recall score']))]['C_parameter']

print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')
return best_c
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

具体迭代过程就不看了，感兴趣的可以复制过去跑一下，最终得到结果如下

def printing_Kfold_scores(x_train_data,y_train_data):#fold.split(y_train_data)
c_param_range = [0.01,0.1,1,10,100]
#正则化惩罚力度候选
results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range

# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
for c_param in c_param_range:#找出最合适的正则化惩罚力度
    print('-------------------------------------------')
    print('C parameter: ', c_param)
    print('-------------------------------------------')
    print('')
    recall_accs = []
    for iteration, indices in enumerate(fold.split(y_train_data),start=1):
        lr = LogisticRegression(C = c_param, penalty = 'l1',solver='liblinear')
        #C是惩罚力度，penalty是选择l1还是l2惩罚，solver可选参数:{‘liblinear’, ‘sag’, ‘saga’,‘newton-cg’, ‘lbfgs’}
        lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
        #lr.fit:训练lr模型,传入dataframe的X和转变成一行的y
        y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
        #lr.predict:用验证样本集进行预测
        recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
        #recall_score：传入结果集，和predict的结果得到评估结果
        recall_accs.append(recall_acc)
        print('Iteration ', iteration,': recall score = ', recall_acc)

    results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
    j += 1
    print('')
    print('Mean recall score ', np.mean(recall_accs))
    print('')

best_c = results_table.loc[np.argmax(np.array(results_table['Mean recall score']))]['C_parameter']

print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')
return best_c
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

在这里插入图片描述

2.2.7用下采样训练的模型画混淆矩阵

import itertools
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap,aspect='auto')
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
lr = LogisticRegression(C = best_c, penalty = 'l2')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

在这里插入图片描述
这个是用模型拟合下采样测试集结果，我这个由于matplotlib库版本问题数据有点错位。
不过可以看出TP=138,TN=9,FP=9,FN看不太清不过和TP差不多
RECALL值有0.878

再用模型拟合原数据的测试集画混淆矩阵

lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

在这里插入图片描述
RECALL值满意需求，但是还是存在问题。FP这类有7996个，也就是说** 原本正常被当初异常即“误杀”的样本有7996个，会使得精度降低**

2.2.8对比下采样和直接拿原始数据训练模型

best_c = printing_Kfold_scores(X_train,y_train)
#用原始数据训练，找最佳的正则化惩罚项
lr = LogisticRegression(C = best_c, penalty = 'l2')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

可以看到结果很不理想，RECALL值很低，所以样本不均的情况下不做处理做出的模型通常很差。
在这里插入图片描述

在这里插入图片描述

2.2.9逻辑回归阈值对结果的影响

lr = LogisticRegression(C = 0.01, penalty = 'l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
#lr.predict_proba 预测出一个概率值

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
#指定一系列阈值
plt.figure(figsize=(12,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    plt.subplot(3,3,j)
    j += 1
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix, classes=class_names,title='Threshold >= %s'%i) 
#右上角是误杀的，左下角是没被揪出来的异常

原来默认是概率大于0.5就认为是异常，这个阈值可以自己设定，阈值越大即表示越严格。
可以看出不同阈值对结果的影响，RECALL是一个递减的过程，精度逐渐增大

在这里插入图片描述

参考资料

1.Kaggle经典案例-信用卡欺诈检测的完整过程

水花

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
【BI学习作业12-个性化推荐与金融数据分析】

内容目录1.思考题1.1P2P租车2.编程题2.1信用卡违约率检测2.2信用卡欺诈分析2.2.1分析代码1.思考题1.1P2P租车阐述相似车型，搜索排序的设计方法可能的embedding策略这里是引用2.编程题2.1信用卡违约率检测数据集地址：https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset对信用卡使用数据进行建模，预测用户是否下个月产生违约 => 分类问题机器学习算法有很多，比
复制链接

扫一扫