逻辑回归实践[1]
1数据处理
1 导入数据: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw
说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期
-
分析——原始大小为(4754,90)
- 将特征缺失太多的样本删除data.dropna(thesh=50)——(4476,90)
- 将数据分为data_num和data_obj——利用data.info()
- 对data_num和data_obj分别分析——data.describe()
data_obj.drop(['trade_no','bank_card_no','source','trade_no','id_name'], axis=1, inplace=True) data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True)
- 填充和转换
- 保存数据——(4476,89)
-
遇到问题分析
- 读中文乱码(加入encoding=‘gbk’)
- 删除样本后,保留的原来的索引,利用pd.contact(), 注意index是否相同
-
代码
import pandas as pd from sklearn.preprocessing import LabelBinarizer,Imputer '''================== 1.读取数据 ''' data = pd.read_csv('data.csv',encoding='gbk') #excel和csv的中文存错格式是GBK ''' 1_1.删除确实样本和重复样本 ''' data_del = data.dropna(thresh=50) data_del.drop_duplicates(inplace=True) #移除重复项 ''' 1_2.分析,将数据分为data_num和data_obj ''' object_column = ['trade_no','bank_card_no','reg_preference_for_trad', 'source','id_name', 'latest_query_time', 'loans_latest_time'] data_obj = data_del[object_column] data_num = data_del.drop(object_column,axis=1) ''' 1_3.删除一些特征 ''' data_obj.drop(['trade_no','bank_card_no','source','trade_no','id_name'], axis=1, inplace=True) data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True) ''' 1_4.对于数值型数据,采取均值填充 ''' imputer = Imputer(strategy='mean') num = imputer.fit_transform(data_num) data_num = pd.DataFrame(num, index=data_num.index,columns=data_num.columns) ''' 1_5.对于非数值数据,采取前值填充 ''' data_obj.fillna(method='ffill',inplace=True) '''====================================================== 1_6.将reg_preference_for_trad转为数值型特征 one-hot编码 ''' encoder = LabelBinarizer() reg_preference_1hot = encoder.fit_transform(data_obj['reg_preference_for_trad']) data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True) reg_preference_df = pd.DataFrame(reg_preference_1hot, index=data_obj.index,columns=encoder.classes_) data_obj = pd.concat([data_obj, reg_preference_df], axis=1) ''' 1_7.将latest_query_time和loans_time 两种日期型数据拆分为月份和星期两个特征 ''' data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time']) data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time']) data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1) '''========== 1_7.合并data_num,和data_obj ''' data_processed = pd.concat([data_num, data_obj], axis=1) '''=========== 1_8.保存处理后的数据 ''' data_processed.to_csv('data_processed.csv', index=False)
2. 逻辑回归实践
-
在将数据标准化后,准确率提升了很多
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import f1_score from sklearn.preprocessing import StandardScaler '''------------------------------------------ 1 读取数据 ''' data = pd.read_csv('data_processed.csv',encoding='gbk') '''------------------------------------------- 1.1 划分训练集何验证集 ''' train, test = train_test_split(data, test_size=0.1, random_state=666) '''---------------------------------------- 1.2 获取标签 ''' y_train = train.status train.drop(['status'], axis=1, inplace=True) y_test = test.status test.drop(['status'], axis=1, inplace=True) ''' 1.3 数据标准化 ''' scaler = StandardScaler() train = pd.DataFrame(scaler.fit_transform(train),index=train.index, columns=test.columns) test = pd.DataFrame(scaler.fit_transform(test),index=test.index, columns=test.columns) '''---------------------------------------- 1.4 训练模型 ''' model = LogisticRegression(C=1, dual=True) model.fit(train, y_train) '''-------------------------------------- 1.5 预测模型 ''' y_test_pre = model.predict(test) '''---------------------------------------- 1.6 评分 ''' score_vail = f1_score(y_test, y_test_pre, average='macro') print('-------------------------------') print('验证分数:{}'.format(score_vail))
结果: