pandas读取分析保险数据

最新推荐文章于 2023-05-23 21:02:20 发布

zjLOVEcyj

最新推荐文章于 2023-05-23 21:02:20 发布

阅读量289

点赞数

分类专栏：机器学习文章标签：算法 python 机器学习人工智能数据分析

本文链接：https://blog.csdn.net/cyj5201314/article/details/105470611

版权

机器学习专栏收录该内容

15 篇文章 2 订阅

订阅专栏

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

#读入数据
data = pd.read_csv('./data/insurance.csv')

# describe做简单的统计摘要
print(data.describe())

# 采样要均匀,查看age字段有多少种数值，即各个年龄有多少个样本
data_count = data['age'].value_counts()
print(data_count)

#绘制年龄分布柱状图
data_count[:10].plot(kind='bar')
plt.show()
#将柱状图保存在本地
# plt.savefig('./temp')

输出结果为:
统计摘要:
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
年龄段统计:
18 69
19 68
51 29
45 29
46 29
47 29
48 29
50 29
52 29
20 29
26 28
54 28
53 28
25 28
24 28
49 28
23 28
22 28
21 28
27 28
28 28
31 27
29 27
30 27
41 27
43 27
44 27
40 27
42 27
57 26
34 26
33 26
32 26
56 26
55 26
59 25
58 25
39 25
38 25
35 25
36 25
37 25
63 23
60 23
61 23
62 23
64 22
Name: age, dtype: int64
年龄分布直方图为(前十个年龄段)
在这里插入图片描述

利用data.corr()方法计算各字段的相关性，往往采用皮尔逊相关系数法

print(data.corr())

结果如下:
age bmi children charges
age 1.000000 0.109272 0.042469 0.299008
bmi 0.109272 1.000000 0.012759 0.198341
children 0.042469 0.012759 1.000000 0.067998
charges 0.299008 0.198341 0.067998 1.000000
表中数值为皮尔逊相关系数，该值越接近1表示二者越正相关，越接近-1表示二者越接近负相关(皮尔逊相关系数依赖于数值计算，故字段数据类型必须是数值)
2. 提取特征数据和目标数据并初步预处理

reg = LinearRegression()#实例化一个线性回归对象
#取出特征数据
x = data[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]
#取出目标数据
y = data['charges']

#将x，y的每一列应用to_numeric方法，即将字段类型为字符串的转化为数值类型
x = x.apply(pd.to_numeric, errors='coerce')
y = y.apply(pd.to_numeric, errors='coerce')
#将x，y的空值填充为0
x.fillna(0, inplace=True)
y.fillna(0, inplace=True)

进行数据升维，利用多项式回归进行数据升维，意在将x和y之间的非线性关系转化为线性关系从而利用线性算法进行建模

#利用多项式回归进行数据升维，意在将x和y之间的非线性关系转化为线性关系从而利用线性算法进行建模
poly_features = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly_features.fit_transform(x)

传入数据训练

#传入升维后的X和y进行训练
reg.fit(X_poly, y)
#打印权重参数和偏置项
print(reg.coef_)
print(reg.intercept_)

调用模型预测，绘制真实值和预测值的散点图以观测模型预测效果

#调用模型预测
y_predict = reg.predict(X_poly)

#绘制真实值和预测值的散点图以观测预测效果
plt.plot(x['age'][:10], y[:10], 'b.')
plt.plot(X_poly[:10, 0], y_predict[:10], 'r.')
plt.show()

输出散点图如下:
在这里插入图片描述
由图可见，由于只进行了数据升维的预处理所以预测效果并不理想

zjLOVEcyj

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas读取分析保险数据

import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import LinearRegression#读入数据data = pd.read_csv('./data/insurance.c...
复制链接

扫一扫

专栏目录