2022年泰迪杯数据分析_B题:银行客户忠诚度分析赛题数据_任务一

另有题目文件,第四题第五题全部可实现文件均放在了我的资源里面

任务1.1(1)首先,使用pandas将数据导入,查看数据的详细信息,发现education、default、marital、job等字段存在明显缺失值,使用dropna()删除缺失值所在的行;其次查看数据是否存在重复值,整体不存在重复值,单独查看‘user_id’发现,存在56个重复值,drop_duplicates()删除重复值所在的行;最后 将处理好的数据保存到“result1_1.xlsx”中。

import pandas as pd
import numpy as np
import re
#导入数据
short_data1=pd.read_csv('short-customer-data.csv')
short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220000257servicesmarriedhigh schoolNaNnonotelephonemaymon149nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
.............................................
41171BA224117273retiredmarriedjunior collegenoyesnocellularnovfri334nonexistentyes
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

41176 rows × 14 columns

#查看数据详细信息
short_data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41176 entries, 0 to 41175
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      41176 non-null  object
 1   age          41176 non-null  int64 
 2   job          40846 non-null  object
 3   marital      41096 non-null  object
 4   education    39446 non-null  object
 5   default      32580 non-null  object
 6   housing      40186 non-null  object
 7   loan         40186 non-null  object
 8   contact      41176 non-null  object
 9   month        41176 non-null  object
 10  day_of_week  41176 non-null  object
 11  duration     41176 non-null  int64 
 12  poutcome     41176 non-null  object
 13  y            41176 non-null  object
dtypes: int64(2), object(12)
memory usage: 4.4+ MB
#查看是否存在重复值
short_data1.duplicated().sum()
0
#查看'user_id'是否存在重复值
short_data1['user_id'].duplicated().sum()
56
#删除重复值
short_data1=short_data1.drop_duplicates(subset=['user_id'])
short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220000257servicesmarriedhigh schoolNaNnonotelephonemaymon149nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
.............................................
41170BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

41120 rows × 14 columns

#查看重复值是否删除
short_data1['user_id'].duplicated().sum()
0
#查看缺失值
short_data1.isnull().sum()
user_id           0
age               0
job             328
marital          80
education      1726
default        8578
housing         988
loan            988
contact           0
month             0
day_of_week       0
duration          0
poutcome          0
y                 0
dtype: int64
#删除缺失值所在的行
short_data1=short_data1.dropna()
short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
6BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
41170BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

#查看缺失值是否删除
short_data1.isnull().sum()
user_id        0
age            0
job            0
marital        0
education      0
default        0
housing        0
loan           0
contact        0
month          0
day_of_week    0
duration       0
poutcome       0
y              0
dtype: int64
#将处理好的数据导出
short_data1.to_excel('result1_1.xlsx',index = False)
short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
6BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
41170BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

任务1.1.2

import pandas as pd
import numpy as np
import re
long_data1=pd.read_csv('long-customer-train.csv')
long_data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9300 entries, 0 to 9299
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       9300 non-null   int64  
 1   CreditScore      9300 non-null   int64  
 2   Gender           9300 non-null   int64  
 3   Age              9300 non-null   object 
 4   Tenure           9300 non-null   int64  
 5   Balance          9300 non-null   float64
 6   NumOfProducts    9300 non-null   int64  
 7   HasCrCard        9300 non-null   int64  
 8   IsActiveMember   9300 non-null   int64  
 9   EstimatedSalary  9300 non-null   float64
 10  Exited           9300 non-null   int64  
dtypes: float64(2), int64(8), object(1)
memory usage: 799.3+ KB
long_data1[long_data1['Age']=='-']
CustomerIdCreditScoreGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
464155704855580-4161766.8710092378.540
566155734526630-7115930.8711019862.780
651155760007650-6138033.5511167972.450
696155770645920-2104702.65210107948.720
796155800685260-50.00211105618.140
....................................
8176157841615830-8102945.0110052861.890
8270157870356021-80.00211152843.530
8591157958817760-8106365.29111148527.560
8776158010625571-40.00201105433.530
8794158014176570-482500.28111115260.720

78 rows × 11 columns

长期数据中的客户年龄“Age”列存在数值为-1、0 和“-”的异常值,删除存在该情况的行数据。
这里我们先将异常值赋值为空值

long_data2=long_data1.replace(to_replace=['-1','0','-','1'],value=[np.nan,np.nan,np.nan,np.nan])
long_data2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9300 entries, 0 to 9299
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       9300 non-null   int64  
 1   CreditScore      9300 non-null   int64  
 2   Gender           9300 non-null   int64  
 3   Age              9180 non-null   object 
 4   Tenure           9300 non-null   int64  
 5   Balance          9300 non-null   float64
 6   NumOfProducts    9300 non-null   int64  
 7   HasCrCard        9300 non-null   int64  
 8   IsActiveMember   9300 non-null   int64  
 9   EstimatedSalary  9300 non-null   float64
 10  Exited           9300 non-null   int64  
dtypes: float64(2), int64(8), object(1)
memory usage: 799.3+ KB

再用删除空值的方法将其所在的行删除

long_data3=long_data2.dropna()
long_data3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9180 entries, 0 to 9299
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       9180 non-null   int64  
 1   CreditScore      9180 non-null   int64  
 2   Gender           9180 non-null   int64  
 3   Age              9180 non-null   object 
 4   Tenure           9180 non-null   int64  
 5   Balance          9180 non-null   float64
 6   NumOfProducts    9180 non-null   int64  
 7   HasCrCard        9180 non-null   int64  
 8   IsActiveMember   9180 non-null   int64  
 9   EstimatedSalary  9180 non-null   float64
 10  Exited           9180 non-null   int64  
dtypes: float64(2), int64(8), object(1)
memory usage: 860.6+ KB
# long_data3['age_new']=long_data3.Age.str.extract(r"(\d+)")
# #long_data3['new_Age']=long_data3.Age.str.replace('岁','')
# #long_data3[long_data3.age_new.str.contains(" ")]
# long_data3['age_new']=long_data3['age_new'].astype(int)
# long_data4=long_data3.drop(columns=['Age'])
# long_data4.rename(columns={'age_new':'Age'},inplace=True)
# long_data4.info()

“Age”列存在空格和“岁”等异常字符,删除这些异常
字符但须保留年龄数值,将处理后的数值存于“Age”列。
这里通过正则表达式匹配异常字符中的正确年龄,并将年龄保存在了Age列,数据类型用int

long_data3['Age']=long_data3.Age.str.extract(r"(\d+)")
long_data3['Age']=long_data3['Age'].astype(int)
C:\Users\31214\AppData\Local\Temp\ipykernel_10860\3539377572.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long_data3['Age']=long_data3.Age.str.extract(r"(\d+)")
C:\Users\31214\AppData\Local\Temp\ipykernel_10860\3539377572.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long_data3['Age']=long_data3['Age'].astype(int)
long_data3.to_excel('result1_2.xlsx',index = False)
long_data3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9180 entries, 0 to 9299
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       9180 non-null   int64  
 1   CreditScore      9180 non-null   int64  
 2   Gender           9180 non-null   int64  
 3   Age              9180 non-null   int32  
 4   Tenure           9180 non-null   int64  
 5   Balance          9180 non-null   float64
 6   NumOfProducts    9180 non-null   int64  
 7   HasCrCard        9180 non-null   int64  
 8   IsActiveMember   9180 non-null   int64  
 9   EstimatedSalary  9180 non-null   float64
 10  Exited           9180 non-null   int64  
dtypes: float64(2), int32(1), int64(8)
memory usage: 824.8 KB
long_data3
CustomerIdCreditScoreGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0155532517131520185891.5411146369.571
11555325661914180.0031179866.731
215553283603142891611.12100144675.301
31555330858916110.0011061108.561
41555338768713920.00300188150.601
....................................
9295158156287111378113899.9210080215.200
9296158156454810378152303.66211175082.200
9297158156565411399100116.67111199808.101
9298158156607581341154139.4511160728.890
9299158156906141403113348.5011177789.010

9180 rows × 11 columns

任务1.2

import pandas as pd
short_data=pd.read_excel('result1_1.xlsx')
short_data
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
2BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
3BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
4BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
30439BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
30440BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
30441BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
30442BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
30443BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

对短期数据中的字符型数据进行特征编码,如将信用违约情况
{‘否’,‘是’}编码为{0,1}

在这种情况下,我们必须将数据进行编码,即是说,将文字型数据转换为数值型。
这里我们采用preprocessing.LabelEncoder将汉字进行编码

短期客户数据中,需要进行特征编码的字段都包含{job,marital,education,default,housing,loan,contact,poutcome,y}

from sklearn.preprocessing import LabelEncoder
short_data6=short_data.copy()
short_data6
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
2BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
3BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
4BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
30439BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
30440BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
30441BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
30442BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
30443BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

t1=short_data6.loc[:,'job']    #要输入的是标签,不是特征矩阵,所以允许一维数据
t2=short_data6.loc[:,'marital']
t3=short_data6.loc[:,'education']
t4=short_data6.loc[:,'default']
t5=short_data6.loc[:,'housing']
t6=short_data6.loc[:,'loan']
t7=short_data6.loc[:,'contact']
t8=short_data6.loc[:,'poutcome']
t9=short_data6.loc[:,'y']
le1 = LabelEncoder()     #实例化
le1 = le1.fit(t1) # 导入数据
label1 = le1.transform(t1) # transform接口调取结果
short_data6.loc[:,"job"] = label1
short_data6['job'].unique()
array([ 3,  7,  0,  9,  1, 10,  5,  2,  4,  8,  6])
le2 = LabelEncoder()     #实例化
le2 = le2.fit(t2) # 导入数据
label2 = le2.transform(t2) # transform接口调取结果
short_data6.loc[:,"marital"] = label2
short_data6['marital'].unique()
array([1, 2, 0])
le3 = LabelEncoder()     #实例化
le3 = le3.fit(t3) # 导入数据
label3 = le3.transform(t3) # transform接口调取结果
short_data6.loc[:,"education"] = label3
short_data6['education'].unique()
array([3, 0, 2, 4, 1])
le4 = LabelEncoder()     #实例化
le4 = le4.fit(t4) # 导入数据
label4 = le4.transform(t4) # transform接口调取结果
short_data6.loc[:,"default"] = label4
short_data6['default'].unique()
array([0, 1])
le5 = LabelEncoder()     #实例化
le5 = le5.fit(t5) # 导入数据
label5 = le5.transform(t5) # transform接口调取结果
short_data6.loc[:,"housing"] = label5
short_data6['housing'].unique()
array([0, 1])
le6 = LabelEncoder()     #实例化
le6 = le6.fit(t6) # 导入数据
label6 = le6.transform(t6) # transform接口调取结果
short_data6.loc[:,"loan"] = label6
short_data6['loan'].unique()
array([0, 1])
le7 = LabelEncoder()     #实例化
le7 = le7.fit(t7) # 导入数据
label7 = le7.transform(t7) # transform接口调取结果
short_data6.loc[:,"contact"] = label7
short_data6['contact'].unique()
array([1, 0])
le8 = LabelEncoder()     #实例化
le8 = le8.fit(t8) # 导入数据
label8 = le8.transform(t8) # transform接口调取结果
short_data6.loc[:,"poutcome"] = label8
short_data6['poutcome'].unique()
array([1, 0, 2])
le9 = LabelEncoder()     #实例化
le9 = le9.fit(t9) # 导入数据
label9 = le9.transform(t9) # transform接口调取结果
short_data6.loc[:,"y"] = label9
short_data6['y'].unique()
array([0, 1])
short_data7=short_data6.copy()
short_data7
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA2200001563130001maymon26110
1BA2200077377100101maymon22610
2BA2200004400130001maymon15110
3BA2200005567100011maymon30710
4BA2200007590120001maymon13910
.............................................
30439BA22411712910230100novfri11220
30440BA2241173461120000novfri38310
30441BA2241174565140100novfri18910
30442BA2241175449120000novfri44211
30443BA2241176745120100novfri23900

30444 rows × 14 columns

#short_data7.loc[short_data7['month']=='mar']['month'].unique()
short_data6['month'].unique()
array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'mar', 'apr',
       'sep'], dtype=object)

对上次拜访客户的月份(month)和最近拜访客户的月份(day_of_week)将不适用LabelEncoder库进行编码,因此这里将手动赋值,例如:一月:1,星期一:1

short_data8=short_data7.replace(to_replace=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],value=['1','2','3','4','5','6','7','8','9','10','11','12'])
short_data8
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA22000015631300015mon26110
1BA22000773771001015mon22610
2BA22000044001300015mon15110
3BA22000055671000115mon30710
4BA22000075901200015mon13910
.............................................
30439BA2241171291023010011fri11220
30440BA224117346112000011fri38310
30441BA224117456514010011fri18910
30442BA224117544912000011fri44211
30443BA224117674512010011fri23900

30444 rows × 14 columns

short_data8['day_of_week'].unique()
array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype=object)
short_data9=short_data8.replace(to_replace=['mon','tue','wed','thu','fri'],value=[1,2,3,4,5])
short_data9
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA22000015631300015126110
1BA22000773771001015122610
2BA22000044001300015115110
3BA22000055671000115130710
4BA22000075901200015113910
.............................................
30439BA2241171291023010011511220
30440BA224117346112000011538310
30441BA224117456514010011518910
30442BA224117544912000011544211
30443BA224117674512010011523900

30444 rows × 14 columns

short_data9.to_excel("result1_3.xlsx",encoding = 'openpyxl',index = False)
  • 6
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 12
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值