41. Pandas读取Excel绘制直方图

Pandas读取Excel绘制直方图

直方图(Histogram):
直方图是数值数据分布的精确图形表示,是一个连续变量(定量变量)的概率分布的估计,它是一种条形图。
为了构建直方图,第一步是将值的范围分段,即将整个值的范围分成一系列间隔,然后计算每个间隔中有多少值。

1. 读取数据

波斯顿房价数据集

import pandas as pd
import numpy as np
df = pd.read_excel("./datas/boston-house-prices/housing.xlsx")
df
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
00.0063218.02.3100.5386.57565.24.0900129615.3396.904.9824.0
10.027310.07.0700.4696.42178.94.9671224217.8396.909.1421.6
20.027290.07.0700.4697.18561.14.9671224217.8392.834.0334.7
30.032370.02.1800.4586.99845.86.0622322218.7394.632.9433.4
40.069050.02.1800.4587.14754.26.0622322218.7396.905.3336.2
.............................................
5010.062630.011.9300.5736.59369.12.4786127321.0391.999.6722.4
5020.045270.011.9300.5736.12076.72.2875127321.0396.909.0820.6
5030.060760.011.9300.5736.97691.02.1675127321.0396.905.6423.9
5040.109590.011.9300.5736.79489.32.3889127321.0393.456.4822.0
5050.047410.011.9300.5736.03080.82.5050127321.0396.907.8811.9

506 rows × 14 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB
df["MEDV"]
0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: MEDV, Length: 506, dtype: float64

2. 使用matplotlib画直方图

matplotlib直方图文档:https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.hist.html

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(12, 5))
plt.hist(df["MEDV"], bins=100)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b23toZgd-1611064335399)(output_12_0.png)]

3. 使用pyecharts画直方图

pyecharts直方图文档:http://gallery.pyecharts.org/#/Bar/bar_histogram
numpy直方图文档:https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

from pyecharts import options as opts
from pyecharts.charts import Bar
# 需要自己计算有多少个间隔、以及每个间隔有多少个值
hist,bin_edges =  np.histogram(df["MEDV"], bins=100)
# 这是每个间隔的分割点
bin_edges
array([ 5.  ,  5.45,  5.9 ,  6.35,  6.8 ,  7.25,  7.7 ,  8.15,  8.6 ,
        9.05,  9.5 ,  9.95, 10.4 , 10.85, 11.3 , 11.75, 12.2 , 12.65,
       13.1 , 13.55, 14.  , 14.45, 14.9 , 15.35, 15.8 , 16.25, 16.7 ,
       17.15, 17.6 , 18.05, 18.5 , 18.95, 19.4 , 19.85, 20.3 , 20.75,
       21.2 , 21.65, 22.1 , 22.55, 23.  , 23.45, 23.9 , 24.35, 24.8 ,
       25.25, 25.7 , 26.15, 26.6 , 27.05, 27.5 , 27.95, 28.4 , 28.85,
       29.3 , 29.75, 30.2 , 30.65, 31.1 , 31.55, 32.  , 32.45, 32.9 ,
       33.35, 33.8 , 34.25, 34.7 , 35.15, 35.6 , 36.05, 36.5 , 36.95,
       37.4 , 37.85, 38.3 , 38.75, 39.2 , 39.65, 40.1 , 40.55, 41.  ,
       41.45, 41.9 , 42.35, 42.8 , 43.25, 43.7 , 44.15, 44.6 , 45.05,
       45.5 , 45.95, 46.4 , 46.85, 47.3 , 47.75, 48.2 , 48.65, 49.1 ,
       49.55, 50.  ])
len(bin_edges)
101
# 这是间隔的计数
hist
array([ 2,  1,  1,  0,  5,  2,  1,  6,  3,  0,  3,  3,  5,  3,  4,  6,  3,
        5, 14,  9,  9,  6, 11,  8,  6,  8,  6, 10,  9,  9, 15, 13, 20, 16,
       19, 10, 14, 19, 13, 15, 21, 16,  9, 12, 14,  1,  0,  4,  5,  2,  6,
        5,  5,  4,  3,  6,  2,  3,  4,  3,  4,  3,  6,  2,  1,  1,  5,  3,
        1,  4,  1,  3,  1,  1,  1,  0,  0,  1,  0,  0,  1,  1,  1,  1,  1,
        1,  2,  0,  1,  1,  0,  1,  1,  0,  0,  0,  2,  1,  0, 16],
      dtype=int64)
len(hist)
100
对bin_edges的解释,为什么是101个?比hist计数多1个?

举例:如果bins是[1, 2, 3, 4],那么会分成3个区间:[1, 2)、[2, 3)、[3, 4];
其中bins的第一个值是数组的最小值,bins的最后一个元素是数组的最大值

# 注意观察,min是bins的第一个值,max是bins的最后一个元素
df["MEDV"].describe()
count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64
# 查看bins每一个值和前一个值的差值,可以看到这是等分的数据
np.diff(bin_edges)
array([0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45,
       0.45])
# 这些间隔的数目,刚好等于计数hist的数目
len(np.diff(bin_edges))
100
# pyecharts的直方图使用bar实现
# 取bins[:-1],意思是用每个区间的左边元素作为x轴的值
bar = (
    Bar()
    .add_xaxis([str(x) for x in bin_edges[:-1]])
    .add_yaxis("价格分布", [float(x) for x in hist], category_gap=0)
    .set_global_opts(
        title_opts=opts.TitleOpts(title="波斯顿房价-价格分布-直方图", pos_left="center"),
        legend_opts=opts.LegendOpts(is_show=False)
    )
)
bar.render_notebook()
    <div id="c552da67584643c2a067d0a088fb3b41" style="width:900px; height:500px;"></div>

小作业:
获取你们产品的销量数据、价格数据,提取得到一个一数组,画一个直方图看一下数据分布

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值