逻辑回归实践[1]

逻辑回归实践[1]

1数据处理

1 导入数据: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw
说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期

  • 分析——原始大小为(4754,90)

    • 将特征缺失太多的样本删除data.dropna(thesh=50)——(4476,90)
    • 将数据分为data_num和data_obj——利用data.info()
    • 对data_num和data_obj分别分析——data.describe()
      data_obj.drop(['trade_no','bank_card_no','source','trade_no','id_name'], axis=1, inplace=True)
      data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True)
      
    • 填充和转换
    • 保存数据——(4476,89)
  • 遇到问题分析

    1. 读中文乱码(加入encoding=‘gbk’)
    2. 删除样本后,保留的原来的索引,利用pd.contact(), 注意index是否相同
  • 代码

    import pandas as pd
    from sklearn.preprocessing import LabelBinarizer,Imputer
    
    '''==================
    1.读取数据
    '''
    data = pd.read_csv('data.csv',encoding='gbk')     #excel和csv的中文存错格式是GBK
    
    '''
    1_1.删除确实样本和重复样本
    '''
    data_del = data.dropna(thresh=50)
    data_del.drop_duplicates(inplace=True)   #移除重复项
    
    '''
    1_2.分析,将数据分为data_num和data_obj
    '''
    object_column = ['trade_no','bank_card_no','reg_preference_for_trad',  'source','id_name', 'latest_query_time', 'loans_latest_time']
    data_obj = data_del[object_column]
    data_num = data_del.drop(object_column,axis=1)
    '''
    1_3.删除一些特征
    '''
    data_obj.drop(['trade_no','bank_card_no','source','trade_no','id_name'], axis=1, inplace=True)
    data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True)
    
    '''
    1_4.对于数值型数据,采取均值填充
    '''
    imputer =  Imputer(strategy='mean')
    num = imputer.fit_transform(data_num)
    data_num = pd.DataFrame(num, index=data_num.index,columns=data_num.columns)
    
    '''
    1_5.对于非数值数据,采取前值填充
    '''
    data_obj.fillna(method='ffill',inplace=True)
    
    '''======================================================
    1_6.将reg_preference_for_trad转为数值型特征        one-hot编码
    '''
    encoder = LabelBinarizer()
    reg_preference_1hot = encoder.fit_transform(data_obj['reg_preference_for_trad'])
    
    data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
    reg_preference_df = pd.DataFrame(reg_preference_1hot, index=data_obj.index,columns=encoder.classes_)
    data_obj = pd.concat([data_obj, reg_preference_df], axis=1)
    
    '''
    1_7.将latest_query_time和loans_time 两种日期型数据拆分为月份和星期两个特征
    '''
    data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
    data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
    data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday
    data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
    data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
    data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday
    
    data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)
    
    '''==========
    1_7.合并data_num,和data_obj
    '''
    data_processed = pd.concat([data_num, data_obj], axis=1)
    
    '''===========
    1_8.保存处理后的数据
    '''
    data_processed.to_csv('data_processed.csv', index=False)
    
    
    

2. 逻辑回归实践

  • 在将数据标准化后,准确率提升了很多

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import f1_score
    from sklearn.preprocessing import StandardScaler
    
    '''------------------------------------------
    1 读取数据
    '''
    data = pd.read_csv('data_processed.csv',encoding='gbk') 
    
    '''-------------------------------------------
    1.1 划分训练集何验证集
    '''
    train, test = train_test_split(data, test_size=0.1, random_state=666)
    
    '''----------------------------------------
    1.2 获取标签
    '''
    y_train = train.status
    train.drop(['status'], axis=1, inplace=True)
    
    y_test = test.status
    test.drop(['status'], axis=1, inplace=True)
    
    '''
    1.3 数据标准化
    '''
    scaler = StandardScaler()
    train = pd.DataFrame(scaler.fit_transform(train),index=train.index, columns=test.columns)
    test = pd.DataFrame(scaler.fit_transform(test),index=test.index, columns=test.columns)
    
    '''----------------------------------------
    1.4 训练模型
    '''
    model = LogisticRegression(C=1, dual=True)
    model.fit(train, y_train)
    
    '''--------------------------------------
    1.5 预测模型
    '''
    y_test_pre = model.predict(test)
    
    '''----------------------------------------
    1.6 评分
    '''
    score_vail = f1_score(y_test, y_test_pre, average='macro')
    
    print('-------------------------------')
    print('验证分数:{}'.format(score_vail))
    

结果:
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值