基于线性回归对男性体脂率的预测

基于线性回归对男性体脂率的预测

本项目包含:
1.数据探索性分析
2.线性回归预测体脂率

数据集和代码链接放在文末

一、数据说明

在这里插入图片描述

二、前置知识

在这里插入图片描述

三、数据导入

import pandas as pd
df = pd.read_csv('/home/mw/input/bodyfat8096/bodyfat.csv')
df.head()

在这里插入图片描述

df.isnull().sum()

Density 0
BodyFat 0
Age 0
Weight 0
Height 0
Neck 0
Chest 0
Abdomen 0
Hip 0
Thigh 0
Knee 0
Ankle 0
Biceps 0
Forearm 0
Wrist 0
dtype: int64

四、EDA

4.1 单位转换

体重单位是磅,身高单位是英寸,根据说明,其他字段单位都是cm,所以我们先做一个单位的转换
1磅 = 0.45359237kg
1英寸 = 2.54cm

df['Weight'] = df['Weight']*0.45359237
df['Height'] = df['Height']*2.54
df.head()

4.2 年龄分布

我们可以看出,本数据的最大年龄为81岁,最小年龄为22岁;根据年龄分布图来看,大部分被调查者处于60岁以下,根据下图的对照表来说,这个范围内男性标准的体脂率应该为11~22%,下面我们看一下实际的体脂率

print('最大年龄为:{}岁;最小年龄为:{}岁。'.format(max(df['Age']),min(df['Age'])))

最大年龄为:81岁;最小年龄为:22岁。

from matplotlib import pyplot as plt  
%matplotlib inline
import seaborn as sns
fig,ax = plt.subplots(figsize=(6,3), dpi=120)

plt.hist(x = df.Age, # 指定绘图数据
         bins = 15, # 指定直方图中条块的个数
         color = 'skyblue', # 指定直方图的填充色
          edgecolor = 'black' # 指定直方图的边框色
          )
# 添加x轴和y轴标签
plt.xlabel('年龄')
plt.ylabel('频数')
# 添加标题
plt.title('年龄分布')

在这里插入图片描述

4.3 体脂率分布

可以看出,体脂率偏高的人多一些,但是数据存在异常,最小体脂率竟然为0?我们查出来是哪一行数据,后续将其剔除。

fig,ax = plt.subplots(figsize=(6,3), dpi=120)

plt.hist(x = df.BodyFat, # 指定绘图数据
         bins = 15, # 指定直方图中条块的个数
         color = 'skyblue', # 指定直方图的填充色
          edgecolor = 'black' # 指定直方图的边框色
          )
# 添加x轴和y轴标签
plt.xlabel('体脂率')
plt.ylabel('频数')
# 添加标题
plt.title('体脂率分布')

在这里插入图片描述

4.4 矩阵图分析体脂率和其他变量的关系

可以看出,除了密度和身高外,其他变量和体脂率的趋势关系都差不多

plt.figure(figsize=(10,8), dpi= 80)
for index in range(len(list(df))):
    sns.pairplot(df, x_vars=list(df)[index],height = 4,
                aspect=2,y_vars=['BodyFat'],kind="reg")

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

4.5各变量热力图

在这里插入图片描述

五、建模分析

5.1 相关性分析

#特征多了,不好挑,写个相关系数高于就选出来的函数
def corr_del(df,rate):
    feature_names = list(df.columns)
    df_corr = df.corr()
    corr_nums = df_corr.shape[0]
    corr_cols = df_corr.columns
    SimCol = []
    corr_val = []
    for i in range(corr_nums):
        for j in range(i+1,corr_nums):
            if abs(df_corr.iloc[i,j]) > rate:
                SimCol.append((corr_cols[i],corr_cols[j]))
                corr_val.append(df_corr.iloc[i,j])
    return SimCol,corr_val
SimCol,corr_val = corr_del(df,0.6)
dic_corr = dict(zip(SimCol, corr_val))
dic_corr

{(‘Density’, ‘BodyFat’): -0.9880100628473627,
(‘Density’, ‘Chest’): -0.6733369791259974,
(‘Density’, ‘Abdomen’): -0.7947118404395511,
(‘Density’, ‘Hip’): -0.600660354089649,
(‘BodyFat’, ‘Weight’): 0.6050478447082646,
(‘BodyFat’, ‘Chest’): 0.6956065964708372,
(‘BodyFat’, ‘Abdomen’): 0.8097250131267841,
(‘BodyFat’, ‘Hip’): 0.6179803940078904,
(‘Weight’, ‘Neck’): 0.8284683139329283,
(‘Weight’, ‘Chest’): 0.8923815280868261,
(‘Weight’, ‘Abdomen’): 0.8859966850412094,
(‘Weight’, ‘Hip’): 0.9398562247117858,
(‘Weight’, ‘Thigh’): 0.8662642452804185,
(‘Weight’, ‘Knee’): 0.8505789330735126,
(‘Weight’, ‘Ankle’): 0.608315072705449,
(‘Weight’, ‘Biceps’): 0.7983816192935411,
(‘Weight’, ‘Forearm’): 0.6240866005036086,
(‘Weight’, ‘Wrist’): 0.725655446125877,
(‘Neck’, ‘Chest’): 0.782091547883973,
(‘Neck’, ‘Abdomen’): 0.7506565937224198,
(‘Neck’, ‘Hip’): 0.731289751695859,
(‘Neck’, ‘Thigh’): 0.6912527144821207,
(‘Neck’, ‘Knee’): 0.6677702222416465,
(‘Neck’, ‘Biceps’): 0.7283751397304583,
(‘Neck’, ‘Forearm’): 0.618470595979317,
(‘Neck’, ‘Wrist’): 0.741548026075172,
(‘Chest’, ‘Abdomen’): 0.9142550084986067,
(‘Chest’, ‘Hip’): 0.8261023400678932,
(‘Chest’, ‘Thigh’): 0.723367490537425,
(‘Chest’, ‘Knee’): 0.7136089462374152,
(‘Chest’, ‘Biceps’): 0.7252536409866253,
(‘Chest’, ‘Wrist’): 0.6542751000589216,
(‘Abdomen’, ‘Hip’): 0.8717828189481807,
(‘Abdomen’, ‘Thigh’): 0.7619183177868154,
(‘Abdomen’, ‘Knee’): 0.73232810062617,
(‘Abdomen’, ‘Biceps’): 0.6813947740808934,
(‘Abdomen’, ‘Wrist’): 0.613794264693361,
(‘Hip’, ‘Thigh’): 0.8944771771358432,
(‘Hip’, ‘Knee’): 0.8203181202814968,
(‘Hip’, ‘Biceps’): 0.7364316136795688,
(‘Hip’, ‘Wrist’): 0.6243634101702451,
(‘Thigh’, ‘Knee’): 0.7952254144765618,
(‘Thigh’, ‘Biceps’): 0.7591002761702224,
(‘Knee’, ‘Ankle’): 0.6061087576231976,
(‘Knee’, ‘Biceps’): 0.6750434364944304,
(‘Knee’, ‘Wrist’): 0.659266466451139,
(‘Biceps’, ‘Forearm’): 0.6746310817723552,
(‘Biceps’, ‘Wrist’): 0.628100342036711}

5.2 线性回归模型

我们先做以下几个模型
1.根据上面的表格来看,Density跟BodyFat的相关性极高,可以说是一定存在线性关系,而且根据已有公式,确实可以单独拿出来,我们单独将Density取出,做一个线性模型
2.排除Density,我们分成几步去做,首先选用相关系数大于0.6的特征作为输入;
3.如果2效果不好,我们再利用现有特征生成新的指标去计算,比如BMI

5.2.1 Density与BodyFat的线性模型
from sklearn.linear_model import LinearRegression
X_data = pd.DataFrame(df['Density'])
y_data = df['BodyFat']
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
X_train, x_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.2,random_state=2022)

lr_model = LinearRegression()
lr_model.fit(X_train,y_train)
y_predictions = lr_model.predict(x_test)
print(r2_score(y_test,y_predictions))
print(lr_model.coef_)
print(lr_model.intercept_)

0.9754079369106297
[-434.10203913]
477.38901572061883

fig,ax = plt.subplots(figsize=(4,3), dpi=150)

plt.scatter(x_test, y_test, color = 'skyblue', label = '真实值')
#设定X,Y轴标签和title
plt.ylabel('体脂率')
plt.xlabel('身体密度')

#绘制最佳拟合曲线
plt.plot(x_test, y_predictions, color = 'black', label = '预测曲线')
#来个图例
plt.legend(loc = 'best')

在这里插入图片描述

def dens2bf1(x):
    y = 457/x-414
    return y
def dens2bf2(x):
    y = 495/x-450
    return y
p_dens2bf1 = pd.DataFrame(dens2bf1(x_test))
p_dens2bf2 = pd.DataFrame(dens2bf2(x_test))
df_pre = pd.DataFrame()
df_pre['p_dens2bf1'] = p_dens2bf1['Density']
df_pre['p_dens2bf2'] = p_dens2bf1['Density']
df_pre['y_predictions'] = y_predictions
df_pre['Density'] = x_test
df_pre.sort_values(by="y_predictions" , inplace=True, ascending=True) 
df_pre.reset_index(drop=True,inplace=True)
df_pre.head()
fig,ax = plt.subplots(figsize=(4,3), dpi=150)

plt.ylabel('体脂率')
plt.xlabel('身体密度')

#绘制曲线
plt.plot(df_pre['Density'],df_pre['p_dens2bf1'], color = 'blue', label = '公式1的计算曲线')

plt.plot(df_pre['Density'],df_pre['p_dens2bf2'], color = 'green', label = '公式2的计算曲线')

plt.plot(df_pre['Density'],df_pre['y_predictions'], color = 'red', label = '我们的预测曲线')

plt.scatter(x_test, y_test, color = 'skyblue', label = '真实值')

#图例
plt.legend(loc = 'best')

在这里插入图片描述

5.2.2 相关系数>0.6的特征组成的第二个模型

看R^2效果不是很好,模型弃用

X2_data = df[['Weight','Chest','Abdomen','Hip']]
y2_data = df['BodyFat']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2_data, y2_data, test_size = 0.2,random_state=2022)

lr2_model = LinearRegression()
lr2_model.fit(X2_train,y2_train)
y2_predictions = lr2_model.predict(X2_test)
print(r2_score(y2_test,y2_predictions))
print(lr2_model.coef_)
print(lr2_model.intercept_)
0.6835334134948379
[-0.34316868  0.09032488  0.9425218   0.02511257]
-51.98931074554292
import numpy as np
fig,ax = plt.subplots(figsize=(4,3), dpi=150)

plt.ylabel('体脂率')

#绘制曲线
x_ax = np.arange(len(y2_predictions))

plt.plot(x_ax,y2_predictions, color = 'red', label = '预测曲线')
plt.plot(x_ax,y2_test, color = 'skyblue', label = '真实值')

#图例
plt.legend(loc = 'best')

在这里插入图片描述

5.2.3 计算指标模型

找到一些目前常用的指标,利用已有数据做一些新指标
体质指数(BMI)=体重(kg)÷身高^2(m)
ACratio - 腹部胸部比例
HTratio - 臀部大腿比率

df['BMI'] = df['Weight']/((df['Height']/100)*(df['Height']/100))
df['ACratio'] = df['Abdomen']/df['Chest']	
df['HTratio'] = df['Hip']/df['Thigh']	
df.head()
plt.figure(figsize=(24,16))
ax = sns.heatmap(df.corr(), square=True, annot=True, fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)  
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5) 

在这里插入图片描述

SimCol,corr_val = corr_del(df,0.6)
dic_corr = dict(zip(SimCol, corr_val))
dic_corr
{('Density', 'BodyFat'): -0.9880100628473627,
 ('Density', 'Chest'): -0.6733369791259974,
 ('Density', 'Abdomen'): -0.7947118404395511,
 ('Density', 'Hip'): -0.600660354089649,
 ('Density', 'ACratio'): -0.6949473525887486,
 ('BodyFat', 'Weight'): 0.6050478447082646,
 ('BodyFat', 'Chest'): 0.6956065964708372,
 ('BodyFat', 'Abdomen'): 0.8097250131267841,
 ('BodyFat', 'Hip'): 0.6179803940078904,
 ('BodyFat', 'ACratio'): 0.6931515157069884,
 ('Weight', 'Neck'): 0.8284683139329283,
 ('Weight', 'Chest'): 0.8923815280868261,
 ('Weight', 'Abdomen'): 0.8859966850412094,
 ('Weight', 'Hip'): 0.9398562247117858,
 ('Weight', 'Thigh'): 0.8662642452804185,
 ('Weight', 'Knee'): 0.8505789330735126,
 ('Weight', 'Ankle'): 0.608315072705449,
 ('Weight', 'Biceps'): 0.7983816192935411,
 ('Weight', 'Forearm'): 0.6240866005036086,
 ('Weight', 'Wrist'): 0.725655446125877,
 ('Height', 'BMI'): -0.6412508451276566,
 ('Neck', 'Chest'): 0.782091547883973,
 ('Neck', 'Abdomen'): 0.7506565937224198,
 ('Neck', 'Hip'): 0.731289751695859,
 ('Neck', 'Thigh'): 0.6912527144821207,
 ('Neck', 'Knee'): 0.6677702222416465,
 ('Neck', 'Biceps'): 0.7283751397304583,
 ('Neck', 'Forearm'): 0.618470595979317,
 ('Neck', 'Wrist'): 0.741548026075172,
 ('Chest', 'Abdomen'): 0.9142550084986067,
 ('Chest', 'Hip'): 0.8261023400678932,
 ('Chest', 'Thigh'): 0.723367490537425,
 ('Chest', 'Knee'): 0.7136089462374152,
 ('Chest', 'Biceps'): 0.7252536409866253,
 ('Chest', 'Wrist'): 0.6542751000589216,
 ('Abdomen', 'Hip'): 0.8717828189481807,
 ('Abdomen', 'Thigh'): 0.7619183177868154,
 ('Abdomen', 'Knee'): 0.73232810062617,
 ('Abdomen', 'Biceps'): 0.6813947740808934,
 ('Abdomen', 'Wrist'): 0.613794264693361,
 ('Abdomen', 'ACratio'): 0.7516469696482908,
 ('Hip', 'Thigh'): 0.8944771771358432,
 ('Hip', 'Knee'): 0.8203181202814968,
 ('Hip', 'Biceps'): 0.7364316136795688,
 ('Hip', 'Wrist'): 0.6243634101702451,
 ('Thigh', 'Knee'): 0.7952254144765618,
 ('Thigh', 'Biceps'): 0.7591002761702224,
 ('Thigh', 'HTratio'): -0.6048325482252465,
 ('Knee', 'Ankle'): 0.6061087576231976,
 ('Knee', 'Biceps'): 0.6750434364944304,
 ('Knee', 'Wrist'): 0.659266466451139,
 ('Biceps', 'Forearm'): 0.6746310817723552,
 ('Biceps', 'Wrist'): 0.628100342036711}

我们可以发现,好像和实际公式不太一致,BMI与体脂率的相关性并不是很强,反而ACratio - 腹部胸部比例有一定的相关性,但是比进行组合前相关性要低,组合并没有得到十分有用的特征,可能是由于数据量较小的原因。

完整代码可点击下方链接,fork后可在线运行也可以下载:
点击fork下载代码

数据集也上传了
数据集

  • 0
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
目录列表: 2dplanes.arff abalone.arff ailerons.arff Amazon_initial_50_30_10000.arff anneal.arff anneal.ORIG.arff arrhythmia.arff audiology.arff australian.arff auto93.arff autoHorse.arff autoMpg.arff autoPrice.arff autos.arff auto_price.arff balance-scale.arff bank.arff bank32nh.arff bank8FM.arff baskball.arff bodyfat.arff bolts.arff breast-cancer.arff breast-w.arff breastTumor.arff bridges_version1.arff bridges_version2.arff cal_housing.arff car.arff cholesterol.arff cleveland.arff cloud.arff cmc.arff colic.arff colic.ORIG.arff contact-lenses.arff cpu.arff cpu.with.vendor.arff cpu_act.arff cpu_small.arff credit-a.arff credit-g.arff cylinder-bands.arff delta_ailerons.arff delta_elevators.arff dermatology.arff detroit.arff diabetes.arff diabetes_numeric.arff echoMonths.arff ecoli.arff elevators.arff elusage.arff eucalyptus.arff eye_movements.arff fishcatch.arff flags.arff fried.arff fruitfly.arff gascons.arff glass.arff grub-damage.arff heart-c.arff heart-h.arff heart-statlog.arff hepatitis.arff house_16H.arff house_8L.arff housing.arff hungarian.arff hypothyroid.arff ionosphere.arff iris.2D.arff iris.arff kdd_coil_test-1.arff kdd_coil_test-2.arff kdd_coil_test-3.arff kdd_coil_test-4.arff kdd_coil_test-5.arff kdd_coil_test-6.arff kdd_coil_test-7.arff kdd_coil_train-1.arff kdd_coil_train-3.arff kdd_coil_train-4.arff kdd_coil_train-5.arff kdd_coil_train-6.arff kdd_coil_train-7.arff kdd_el_nino-small.arff kdd_internet_usage.arff kdd_ipums_la_97-small.arff kdd_ipums_la_98-small.arff kdd_ipums_la_99-small.arff kdd_JapaneseVowels_test.arff kdd_JapaneseVowels_train.arff kdd_synthetic_control.arff kdd_SyskillWebert-Bands.arff kdd_SyskillWebert-BioMedical.arff kdd_SyskillWebert-Goats.arff kdd_SyskillWebert-Sheep.arff kdd_UNIX_user_data.arff kin8nm.arff kr-vs-kp.arff labor.arff landsat_test.arff landsat_train.arff letter.arff liver-disorders.arff longley.arff lowbwt.arff lung-cancer.arff lymph.arff machine_cpu.arff mbagrade.arff meta.arff mfeat-factors.arff mfeat-fourier.arff mfeat-karhunen.arff mfeat-morphological.arff mfeat-pixel.arff mfeat-zernike.arff molecular-biology_promoters.arff monks-problems-1_test.arff monks-problems-1_train.arff monks-problems-2_test.arff monks-problems-2_train.arff monks-problems-3_test.arff monks-problems-3_train.arff mushroom.arff mv.arff nursery.arff optdigits.arff page-blocks.arff pasture.arff pbc.arff pendigits.arff pharynx.arff pol.arff pollution.arff postoperative-patient-data.arff primary-tumor.arff puma32H.arff puma8NH.arff pwLinear.arff pyrim.arff quake.arff ReutersCorn-test.arff ReutersCorn-train.arff ReutersGrain-test.arff ReutersGrain-train.arff schlvote.arff segment-challenge.arff segment-test.arff segment.arff sensory.arff servo.arff sick.arff sleep.arff solar-flare_1.arff solar-flare_2.arff sonar.arff soybean.arff spambase.arff spectf_test.arff spectf_train.arff spectrometer.arff spect_test.arff spect_train.arff splice.arff sponge.arff squash-stored.arff squash-unstored.arff stock.arff strike.arff supermarket.arff triazines.arff unbalanced.arff vehicle.arff veteran.arff vineyard.arff vote.arff vowel.arff water-treatment.arff waveform-5000.arff weather.nominal.arff weather.numeric.arff white-clover.arff wine.arff wisconsin.arff zoo.arff
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

爱挠静香的下巴

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值