逻辑回归(Logistic+Regression)经典实例

机器学习算法完整版见fenghaootong-github

房价预测

数据集描述

数据共有81个特征

SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property

**导入所需模块 **

import numpy as np
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
import math as mat

from scipy import stats
from scipy.stats import norm
from sklearn import preprocessing

import statsmodels.api as sm
from patsy import dmatrices

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

import sklearn.linear_model as LinReg
import sklearn.metrics as metrics

**导入数据 **

#loading the data 
data_train = pd.read_csv('../DATA/SalePrice_train.csv')
data_test = pd.read_csv('../DATA/SalePrice_test.csv')

数据共有81个特征,为了便于说明只挑选7个特征
OverallQual
GrLivArea
GarageCars
TotalBsmtSF
1stFlrSF
FullBath
YearBuilt
因为这些数据与房子的售卖价格相关性比较大

具体如何选择特征,见数据清理

**数据预处理 **

data_train.shape
(1460, 81)  
vars = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath','YearBuilt']
Y = data_train[['SalePrice']] #dim (1460, 1)
ID_train = data_train[['Id']] #dim (1460, 1)
ID_test = data_test[['Id']]   #dim (1459, 1)
#extract only the relevant feature with cross correlation >0.5 respect to SalePrice
X_matrix = data_train[vars]
X_matrix.shape  #dim (1460,6)

X_test = data_test[vars]  
X_test.shape   #dim (1459,6)
(1459, 6)

**查看丢失数据 **

#check for missing data:
#missing data
total = X_matrix.isnull().sum().sort_values(ascending=False)
percent = (X_matrix.isnull().sum()/X_matrix.count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
#no missing data in this training set
TotalPercent
YearBuilt00.0
FullBath00.0
TotalBsmtSF00.0
GarageCars00.0
GrLivArea00.0
OverallQual00.0
total = X_test.isnull().sum().sort_values(ascending=False)
percent = (X_test.isnull().sum()/X_test.count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
#missing data in this test set
TotalPercent
TotalBsmtSF10.000686
GarageCars10.000686
YearBuilt00.000000
FullBath00.000000
GrLivArea00.000000
OverallQual00.000000
#help(mat.ceil) #去上限

**使用均值代替缺失的数据 **

#使用均值代替缺失的数据
X_test['TotalBsmtSF'] = X_test['TotalBsmtSF'].fillna(X_test['TotalBsmtSF'].mean())
X_test['GarageCars'] = X_test['GarageCars'].fillna(mat.ceil(X_test['GarageCars'].mean()))

total = X_test.isnull().sum().sort_values(ascending=False)
percent = (X_test.isnull().sum()/X_test.count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
TotalPercent
YearBuilt00.0
FullBath00.0
TotalBsmtSF00.0
GarageCars00.0
GrLivArea00.0
OverallQual00.0
X_test.shape
(1459, 6)
  • 然后预处理模块的特征缩放和均值归一化。 进一步提供了一个实用类StandardScaler,它实现了变换方法来计算训练集上的均值和标准差,以便稍后能够在测试集上重新应用相同的变换。
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_matrix)
print(X_train_maxabs)
[[ 0.7         0.30308401  0.5         0.1400982   0.66666667  0.99651741]       
 [ 0.6         0.22367955  0.5         0.20654664  0.66666667  0.98308458]     
 [ 0.7         0.31655441  0.5         0.15057283  0.66666667  0.99552239] 
 ...,    
 [ 0.7         0.41474654  0.25        0.18854337  0.66666667  0.96567164]
 [ 0.5         0.191067    0.25        0.17643208  0.33333333  0.97014925]  
 [ 0.5         0.22261609  0.25        0.20556465  0.33333333  0.97761194]] 
X_test_maxabs = max_abs_scaler.fit_transform(X_test)
print(X_test_maxabs)
[[ 0.5         0.17585868  0.2         0.17311089  0.25        0.97562189]   
 [ 0.6         0.26084396  0.2         0.26084396  0.25        0.97412935]    
 [ 0.5         0.31972522  0.4         0.18213935  0.5         0.99353234] 
 ..., 
 [ 0.5         0.24023553  0.4         0.24023553  0.25        0.97512438]   
 [ 0.5         0.19038273  0.          0.17899902  0.25        0.99104478]
 [ 0.7         0.39254171  0.6         0.19548577  0.5         0.99154229]]

**模型训练 **

lr=LinReg.LinearRegression().fit(X_train_maxabs,Y)

**模型预测 **

Y_pred_train = lr.predict(X_train_maxabs)
print("Los Reg performance evaluation on Y_pred_train")
print("R-squared =", metrics.r2_score(Y, Y_pred_train))  
Los Reg performance evaluation on Y_pred_train   
R-squared = 0.768647335422 
Y_pred_test = lr.predict(X_test_maxabs)  
print("Lin Reg performance evaluation on X_test")
#print("R-squared =", metrics.r2_score(Y, Y_pred_test))
print("Coefficients =", lr.coef_)
Lin Reg performance evaluation on X_test 
Coefficients = [[ 205199.68775757  305095.8264889    58585.26325362  178302.68126933
   -16511.92112734  676458.9666186 ]] 

Logistic Regression

**导入模块 **

#导入模块  
import pandas as pd
import numpy as np

**数据预处理 **

#创建特征列表表头  
column_names = ['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
#使用pandas.read_csv函数从网上读取数据集
data = pd.read_csv('DATA/data.csv',names=column_names)
#将?替换为标准缺失值表示
data = data.replace(to_replace='?',value = np.nan)
#丢弃带有缺失值的数据(只要有一个维度有缺失便丢弃)
data = data.dropna(how='any')
#查看data的数据量和维度
data.shape
(683, 11)
data.head(10)
Sample code numberClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
510171228101087109714
6101809911112103112
710185612121213112
810330782111211152
910330784211212112

由于原始数据没有提供对应的测试样本用于评估模型性能,这里对带标记的数据进行分割,25%作为测试集,其余作为训练集

#使用sklearn.cross_validation里的train_test_split模块分割数据集
from sklearn.cross_validation import train_test_split
#随机采样25%的数据用于测试,剩下的75%用于构建训练集
X_train,X_test,y_train,y_test = train_test_split(data[column_names[1:10]],data[column_names[10]],test_size = 0.25,random_state = 33)
#查看训练样本的数量和类别分布
y_train.value_counts()   
2    344
4    168
Name: Class, dtype: int64
#查看测试样本的数量和类别分布
y_test.value_counts()
2    100
4     71
Name: Class, dtype: int64

建立模型,预测数据

#从sklearn.preprocessing导入StandardScaler
from sklearn.preprocessing import StandardScaler
#从sklearn.linear_model导入LogisticRegression(逻辑斯蒂回归)
from sklearn.linear_model import LogisticRegression
#从sklearn.linear_model导入SGDClassifier(随机梯度参数)
from sklearn.linear_model import SGDClassifier 
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
lr = LogisticRegression()
#调用逻辑斯蒂回归,使用fit函数训练模型参数
lr.fit(X_train,y_train)
lr_y_predict = lr.predict(X_test)
#调用随机梯度的fit函数训练模型
lr_y_predict
array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 4, 2, 4, 4, 4, 4, 4, 2, 2, 4, 4,
       2, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 2, 4, 2, 2,
       4, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2,
       2, 4, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 4, 4,
       2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 
       2, 2, 4, 2, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 2, 4, 4, 2, 4, 4, 2, 2, 2,
       2, 4, 2, 4, 2, 4, 2, 2, 2, 2, 2, 4, 4, 2, 4, 4, 2, 4, 2, 2, 2, 2, 4,
       4, 4, 2, 4, 2, 2, 4, 2, 4, 4], dtype=int64) 

**使用线性分类模型进行良/恶性肿瘤预测任务的性能分析 **

#从sklearn.metrics导入classification_report
from sklearn.metrics import classification_report
 
#使用逻辑斯蒂回归模型自带的评分函数score获得模型在测试集上的准确性结果
print('Accuracy of LR Classifier:',lr.score(X_test,y_test))
#使用classification_report模块获得逻辑斯蒂模型其他三个指标的结果(召回率,精确率,调和平均数)
print(classification_report(y_test,lr_y_predict,target_names=['Benign','Malignant']))
Accuracy of LR Classifier: 0.988304093567
             precision    recall  f1-score   support

     Benign       0.99      0.99      0.99       100
  Malignant       0.99      0.99      0.99        71

avg / total       0.99      0.99      0.99       171

  • 12
    点赞
  • 65
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
逻辑回归Logistic Regression)是一种广义的线性回归分析模型,属于机器学习中的监督学习方法。它主要用于解决二分类问题,也可以用于多分类问题。逻辑回归通过给定的一组数据(训练集)来训练模型,并在训练结束后对给定的一组或多组数据(测试集)进行分类。每组数据都由多个指标构成。\[2\] 在Python中,可以使用scikit-learn库中的LogisticRegression类来实现逻辑回归模型。通过导入LogisticRegression类,创建一个逻辑回归模型的实例,然后使用fit()方法对训练集进行训练,再使用predict()方法对测试集进行分类预测。\[1\] 逻辑回归模型中的正则化项可以通过penalty参数进行设置。可选的值为"l1"和"l2",分别对应L1正则化和L2正则化,默认是L2正则化。正则化可以帮助防止过拟合问题。\[3\] #### 引用[.reference_title] - *1* *3* [逻辑回归Logistic Regression)](https://blog.csdn.net/liulina603/article/details/78676723)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [逻辑回归Logistic Regression)详解](https://blog.csdn.net/weixin_60737527/article/details/124141293)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值