1.题干
样本数据集总结了过去6个月约9000名活跃信用卡持卡人的使用行为,包含18个行为变量。
CUST_ID:信用卡持卡人编号 BALANCE:他们账户中用于购买的余额
BALANCE_FREQUENCY:余额更新的频率,得分在0到1之间(1=频繁更新,0=不频繁更新)
PURCHASES:从账户进行的采购金额 ONEOFF_PURCHASES:一次性完成的最大购买金额
INSTALLMENTS_PURCHASES:分期付款完成的购买金额 CASH_ADVANCE:用户预付现金
PURCHASES_FREQUENCY:购买的频率,得分在0到1之间(1=经常购买,0=不经常购买)
ONEOFFPURCHASESFREQUENCY:一次性购买的频率(1=频繁购买,0=不频繁购买)
PURCHASESINSTALLMENTSFREQUENCY:分期付款的频率(1=经常进行,0=不经常进行)
CASHADVANCEFREQUENCY:预付现金的支付频率
CASHADVANCETRX:使用“预付现金”进行的交易数量
PURCHASES_TRX:进行的采购交易数量 CREDIT_LIMIT:用户信用卡限额
PAYMENTS:用户完成的付款金额 MINIMUM_PAYMENTS:用户支付的最低金额
PRCFULLPAYMENT:用户支付的全额付款的百分比 TENURE:为用户提供信用卡服务的期限
读取数据,查看数据基本信息,查看有无缺失值,哪些变量有缺失值,使用均值填充缺失值,数据标准化,选择适当的聚类数,使用k-Means聚类,使用AgglomerativeClustering进行层次聚类,使用DBSCAN机型密度聚类,计算每一种聚类轮廓系数系数silhouette_score。
2.数据格式
3.代码
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
def printf(n, strf):
print()
print('-' * n)
print(f"\033[1m{strf}\033[0m")
print()
data = pd.read_csv("dataset/CC GENERAL.csv")
printf(100, '查看数据基本信息')
print(data.info())
printf(100, '查看是否有缺失值')
print(data.isnull().sum())
data_numeric = data.drop(columns=['CUST_ID'])
imputer = SimpleImputer(strategy='mean')
data_filled = pd.DataFrame(imputer.fit_transform(data_numeric), columns=data_numeric.columns)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_filled)
for k in range(2,8):
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(data_scaled)
silhouette_score_kmeans = silhouette_score(data_scaled, clusters)
print(f"K-Means with {k} clusters - Silhouette Score: {silhouette_score_kmeans}")
agglomerative = AgglomerativeClustering(n_clusters=3)
clusters = agglomerative.fit_predict(data_scaled)
silhouette_score_agg = silhouette_score(data_scaled, clusters)
print(f"Agglomerative Clustering - Silhouette Score: {silhouette_score_agg}")
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(data_scaled)
silhouette_score_dbscan = silhouette_score(data_scaled, clusters)
print(f"DBSCAN Clustering - Silhouette Score: {silhouette_score_dbscan}")