Python hist直方图使用注意

UncoDong

已于 2023-01-21 08:38:12 修改

阅读量1.5k

点赞数 3

分类专栏： python 文章标签： python pandas numpy

于 2022-11-27 16:52:19 首次发布

本文链接：https://blog.csdn.net/weixin_42763696/article/details/128066051

版权

python 专栏收录该内容

28 篇文章 1 订阅

订阅专栏

如果要对拥有长尾分布的数据直接计算hist，可能会遇到数据都挤在一起了的这种情况

在这里插入图片描述
如果仔细观察输出结果的话，不难发现hist函数做直方图的时候是按照从0到最大值的范围进行等分划分的，这是该函数的第二个参数bins锁导致的。这就导致了长尾数据的大头全都集中在某一个区间中了。
在这里插入图片描述

解决方法1

手动设置bins，根据bins的参数定义, 可以看到他是有一个默认值hist.bins的，因此手动设置bins，多给数据切几分，效果会稍微好一些。

bins : int or sequence or str, default: :rc:`hist.bins`
        If *bins* is an integer, it defines the number of equal-width bins
        in the range.
    
        If *bins* is a sequence, it defines the bin edges, including the
        left edge of the first bin and the right edge of the last bin;
        in this case, bins may be unequally spaced.  All but the last
        (righthand-most) bin is half-open.  In other words, if *bins* is::
    
            [1, 2, 3, 4]
    
        then the first bin is ``[1, 2)`` (including 1, but excluding 2) and
        the second ``[2, 3)``.  The last bin, however, is ``[3, 4]``, which
        *includes* 4.
    
        If *bins* is a string, it is one of the binning strategies
        supported by `numpy.histogram_bin_edges`: 'auto', 'fd', 'doane',
        'scott', 'stone', 'rice', 'sturges', or 'sqrt'.

不过有的时候因为最小值和最大值之间差的太多了，因此即使我分了上千份，一份的区间还是很大。。像下图我分了一千份，一份的长度仍有200,000，感觉就没必要继续再分下去了吧。
在这里插入图片描述

解决方法2

使用Jenks自然间断点（Jenks Natural Breaks）对数据进行分段

pip install jenkspy

使用方法

import jenkspy

jenks_result = jenkspy.jenks_breaks(列表类型的数据, n_classes=你要分多少类)

得到每个分段的区间后，统计每个区间内数据的数量python统计每个区间的数值数量

## 这是复制粘贴来的例子
import random
import pandas as pd
score = [random.randint(0,10) for i in range(100)] # 此处随机生成一个数值列表
score = pd.Series(score)
se1 = pd.cut(score, [0,1,2,5,8,10]) # 统计0-1,1-2依次类推各个区间的数值数量
print(se1.value_counts())

根据分段统计结果画图

## 这是我们自己要用的
se = pd.cut(你的那串数据, jenks_result)
print(se.value_counts())

fig, axes = plt.subplots(figsize=(11, 7), dpi=400)
plt.bar(range(len(se.value_counts())),se.value_counts())
plt.xlabel('Jenks分段数量')
plt.ylabel('区间内数据的数量')