22年泰迪杯b题部分解析

data1=data.dropna().drop_duplicates(subset=['user_id'])
#首先，dropna() 删除所有包含缺失值的行；接着，drop_duplicates(subset=['user_id']) 在去除缺失值后的数据中，基于 user_id 列去重，只保留每个 user_id 的首个出现。这能确保每个用户在最终的数据框中只有一条记录。

任务1.1-（2）：

处理-1，0，“-”异常值：

# 创建一个布尔条件，用于识别异常值
# 条件为：年龄为 -1、0 或 '-' 时，被认为是异常值
condition = (long_data['Age'] == -1) | (long_data['Age'] == 0) | (long_data['Age'] == '-')

# 使用 ~ 运算符取反条件，筛选出不符合异常值条件的行
# 这样，我们将排除所有年龄为 -1、0 或 '-' 的行
long_data1 = long_data[~condition]

处理空格，“岁”异常字符：

定义一个函数来处理，这个正则表达式匹配所有空白字符（[\s]）和“岁”字（[岁]），然后将它们替换成空字符串，从而清理年龄数据中的这些不需要的字符。

def clean_age(age_str):
    clean_age=re.sub(r'[\s岁]','',age_str)
    return clean_age

任务1.2：

注意题目所说的“如将信用违约情况 {‘否’,‘是’}编码为{0,1}”，所以我们不仅要将{‘否’,‘是’}编码为{0,1}，还要以此类推将其他的也进行编码。

这是一个将job列转换为编码的示例（写的有点暴力😢），之后的以此类推写就好了

#job
column_value=res['job'].tolist()
unique_column_value=list(set(column_value))
unique_column_value
#%%
replace1={'admin.':1,
 'unemployed':2,
 'retired':3,
 'technician':4,
 'student':5,
 'self-employed':6,
 'blue-collar':7,
 'housemaid':8,
 'entrepreneur':9,
 'services':10,
 'management':11}
res['job']=res['job'].replace(replace1)

工作类型处理: 将 job 列中的工作类型转换为数字编码。
婚姻状况处理: 将 marital 列中的婚姻状态转换为数字编码。
教育水平处理: 将 education 列中的教育水平转换为数字编码。
默认信用卡处理: 将 default 列中的 yes 和 no 转换为 1 和 0。
住房贷款处理: 将 housing 列中的 yes 和 no 转换为 1 和 0。
个人贷款处理: 将 loan 列中的 yes 和 no 转换为 1 和 0。
联系类型处理: 将 contact 列中的联系类型转换为数字编码。
星期几处理: 将 day_of_week 列中的星期几转换为数字编码。
月份处理: 将 month 列中的月份名称转换为数字编码。
结果处理: 将 poutcome 列中的结果转换为数字编码。
目标变量处理: 将 y 列中的 yes 和 no 转换为 1 和 0。

任务二：

题目：

任务2.1：

correlation_matrix=res.corr()#计算相关系数矩阵
plt.rcParams['font.sans-serif'] = ['SimHei'] #中文
plt.rcParams['axes.unicode_minus'] = False #负号
plt.figure(figsize=(20,10))
sns.heatmap(correlation_matrix,annot=True,cmap='coolwarm',fmt='.4f',square=True)#绘制热力图
plt.xticks(rotation=45)
plt.title('短期数据所有指标间的相关系数热力图')

#绘制热力图：

sns.heatmap(): 使用 seaborn 库绘制热力图。
- correlation_matrix: 传入计算好的相关系数矩阵。
- annot=True: 在热力图上显示每个单元格的数值。
- cmap='coolwarm': 使用 coolwarm 颜色映射来显示相关系数。coolwarm 是一个渐变色调的颜色方案，适合显示数值的正负差异。
- fmt='.4f': 设置显示数值的格式为浮点数，保留四位小数。
- square=True: 使每个单元格为正方形。

任务2.3：

blue_collar_cnt=0
student_cnt=0
else_cnt=0
for index,row in res.iterrows():
    x=row['job']
    y=row['y']
    if x=='blue-collar' and y=='yes':
        blue_collar_cnt+=1
    elif x=='student' and y=='yes':
        student_cnt+=1
    elif y=='yes':
        else_cnt+=1    
#%%
blue_collar_cnt2=0
student_cnt2=0
else_cnt2=0
for index,row in res.iterrows():
    x=row['job']
    y=row['y']
    if x=='blue-collar' and y=='no':
        blue_collar_cnt2+=1
    elif x=='student' and y=='no':
        student_cnt2+=1
    elif y=='no':
        else_cnt2+=1 
#%%
#创建一个包含两个子图的画布
fig, axs = plt.subplots(1, 2, figsize=(12, 6)) 

plt.rcParams['font.sans-serif'] = ['SimHei']#中文
#准备数据
job_name=['蓝领','学生','其他']
cnt=[blue_collar_cnt,student_cnt,else_cnt]
cnt2=[blue_collar_cnt2,student_cnt2,else_cnt2]

#绘制饼图
axs[0].pie(cnt,labels=job_name,colors=['b','r','m'],autopct='%1.2f%%')
axs[0].set_title('已购买')
axs[1].pie(cnt2,labels=job_name,colors=['b','r','m'],autopct='%1.2f%%')
axs[1].set_title('未购买')
plt.show()#显示图像

代码所示饼图（临时修改了一部分，有错请指出）

任务三：

题目：

任务3.1：

先处理好数据（每个年龄段范围为5）：

#定义年龄段，步长为1(定义x轴)
age_step=list(range(1,102,5))
age_labels=[f'{age_step[i]}-{age_step[i+1]-1}'for i in range(len(age_step)-1)]
#创建了两个长度为 len(age_bins) - 1 的列表，初始值都是 0。
exited_no = [0] * (len(age_step) -1)
exited_yes = [0] * (len(age_step) -1)
#计算每个年龄段的流失情况
for i in res.index:
    age = res.loc[i, 'Age']
    exited = res.loc[i, 'Exited']
    idx =int(age / 5 ) 
    if exited == 0:
        exited_no[idx] += 1
    elif exited == 1:
        exited_yes[idx] += 1
#计算占比（定义y轴）
totalno=sum(exited_no)
totalyes=sum(exited_yes)
Percentage_no=[x / totalno for x in exited_no]
Percentage_yes=[x / totalyes for x in exited_yes]

然后进行画图（这里画的是两幅图是在不同画布下的，可以选择同一画布下画两幅图，或者将两个折线图放在一个图里，比较好看）：

plt.rcParams['font.sans-serif'] = ['SimHei']#字体

# 创建第一个折线图
plt.figure(figsize=(10, 5), dpi=100)
plt.plot(age_labels, Percentage_no, marker='o', label='未流失',color='blue')
plt.title('未流失客户占比')
plt.xlabel('年龄段')
plt.ylabel('占比')
plt.legend()
plt.xticks(rotation=45)  # 旋转x轴标签以避免重叠
plt.tight_layout()
plt.show()

# 创建第二个折线图
plt.figure(figsize=(10, 5), dpi=100)
plt.plot(age_labels, Percentage_yes, marker='o', label='已流失', color='red')
plt.title('已流失客户占比')
plt.xlabel('年龄段')
plt.ylabel('占比')
plt.legend()
plt.xticks(rotation=45)  # 旋转x轴标签以避免重叠
plt.tight_layout()
plt.show()

图形展示：

任务3.2：

代码展示（代码使用了 matplotlib 的 plt.scatter() 方法来画散点图。可以使用了 seaborn 的 sns.scatterplot() 方法，会提供更高级的绘图功能）：

plt.figure(figsize=(20,10))
plt.rcParams['font.sans-serif'] = ['SimHei']#字体
plt.rcParams['axes.unicode_minus'] = False#负号
Not_churned=res[res['Exited']==0]
lost=res[res['Exited']==1]
plt.scatter(Not_churned['Age'],Not_churned['CreditScore'],color='blue', marker='o', label='未流失')
plt.scatter(lost['Age'],lost['CreditScore'],color='orange', marker='o', label='已流失')
plt.legend()
plt.grid()
plt.title('两种流失情况下客户信用资格与年龄分布的散点图')
plt.xlabel('年龄')
plt.ylabel('信用资格')

散点图展示：

任务3.3-（1）：

该任务为画透视表（代码较为简洁，利用 pivot_table 方法直接生成透视表，按 Exited 和 Tenure 分组统计数量。）：

res1=res.pivot_table(res,index='Exited',columns='Tenure',aggfunc='size', fill_value=0)
res1['Total']=res1.sum(axis=1)
for i in [0,1,2,3,4,5,6,7,8,9,10]:
    res1[i]=res1[i]/res1['Total']
res1=res1.drop('Total',axis=1)
res1

透视表展示（这里只是在juptyer将透视表输出了，将其保存为excel更好）：

任务3.3-（2）：

绘制两种流失情况的客户各账号户龄占比量的堆叠柱状图（这里就是简单粗暴地画了一张图出来）：

plt.figure(figsize=(20,10))
plt.rcParams['font.sans-serif'] = ['SimHei']#字体
plt.rcParams['axes.unicode_minus'] = False#负号

plt.bar(res1.columns, res1.iloc[0], label='未流失', color='purple')
plt.bar(res1.columns, res1.iloc[1], bottom=res1.iloc[0], label='已流失', color='orange')

plt.title('各账号户龄占比量的堆叠柱状图')
plt.xlabel('户龄')
plt.ylabel('占比量')
plt.legend()

叠堆柱状图展示：

任务3.4（1）：

首先，从原始数据（经过处理的长期数据）中提取 CustomerId、Tenure 和 Balance 列，生成新的 DataFrame。

Tenure列是获取客户的户龄，以对客户的状态进行分类。分为三类：老客户，稳定客户，新客户。
Balance列就是获取客户的金融资产，以对客户的资产阶段进行分类。分为四类：低资产，中下资产，中上资产，高资产

以上两个分类定义两个函数用if else语句解决即可。

然后，通过 apply 方法将这两个函数应用于相应的列，添加分类后的结果存到excel表里，以便下一题的可视化处理。

任务3.4（2）：

将上一题得到的excel导入之后直接开始画图。

注意题目所说的是“统计新、老客户在各资产阶段中流失的客户量”，所以上一题考虑的稳定客户在这题不考虑。可使用 res2=res2[res2['Status']!='稳定客户'] 进行处理。

res2=pd.read_excel(r'churn_analysi.xlsx')
res2=res2[res2['Status']!='稳定客户']

exited_counts=res2.groupby(['Status','AssetStage'])['Exited'].sum().reset_index()
heatmap_data=exited_counts.pivot(index='AssetStage',columns='Status',values='Exited')

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(20,10))
sns.heatmap(heatmap_data,annot=True,fmt='.0f',cmap='RdYlGn_r',linewidths=.5,vmin=100,vmax=1300)
plt.title('新老客户在各资产阶段中流失热力图')
plt.xlabel('账号户龄划分情况')
plt.ylabel('客户金融资产划分情况')

热力图展示：

任务四：

题目：

任务4.1，4.2，4.3：

任务4.1：

首先将任务3.4处理后的数据导入。

首先在原始的数据上创建一个新的列出来，全部初始化为空字符串。该列的名字就为题目要求将结果存在那一列的那个名字(IsActiveStatus)。

之后用循环遍历这个数据框的每一行，获取当前行的 IsActiveMember 和 Status 的值。即客户的活动状态（活跃状态）以及户龄所对应的客户状态（户龄）。

然后根据if else语句的嵌套使用对条件进行判断，比如：

如果 IsActiveMember 为 0：
- 对于新客户 (新客户)，将 IsActiveStatus 设置为 0。
- 对于稳定客户 (稳定客户)，将其设置为 1。
- 对于老客户 (老客户)，将其设置为 2。
如果 IsActiveMember 为 1：
- 对于新客户，设置为 3。
- 对于稳定客户，设置为 4。
- 对于老客户，设置为 5。