数据科学第2周 | 基于Python实现数据的描述性统计

配置环境

$ pip3 install jupyterlab  # 安装jupyter
$ jupyter notebook  # 运行jupter

$ pip3 install pandas  # 安装pandas
$ pip3 install matplotlib  # 安装matplotlib

加载数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('salaries_10000.csv')
df.head()
emp_nosalaryfrom_dateto_date
010001601171986-06-261987-06-26
110001621021987-06-261988-06-25
210001660741988-06-251989-06-25
310001665961989-06-251990-06-25
410001669611990-06-251991-06-25
df.describe()
emp_nosalary
count10000.00000010000.000000
mean10526.23380064287.636300
std304.53566916940.369764
min10001.00000039265.000000
25%10262.75000050853.750000
50%10526.00000061585.000000
75%10791.00000075238.000000
max11053.000000136004.000000

数据的集中趋势

# 算术平均数
df['salary'].mean()
64287.6363
# 中位数
df['salary'].median()
61585.0
# 众数 
df['salary'].mode()
0    40000
dtype: int64
# 最小值与最大值
print("最小值", df['salary'].min()) 
print("最大值", df['salary'].max()) 
最小值 39265
最大值 136004
# 分位数
print ("第一分位数: ", df['salary'].quantile(q=0.25))
print ("第二分位数: ", df['salary'].quantile(q=0.50))
print ("第三分位数: ", df['salary'].quantile(q=0.75))
第一分位数:  50853.75
第二分位数:  61585.0
第三分位数:  75238.0

数据的离中趋势

# 方差
df['salary'].var()
286976127.73639596
# 标准差
df['salary'].std()
16940.36976386277
# 四分位差
df['salary'].quantile(q=0.75)-df['salary'].quantile(q=0.25)
24384.25
# 离散系数:标准差 / 均值
df['salary'].std() / df['salary'].mean()
0.2635089846017993
plt.hist(df['salary'],100)
(array([390., 105., 150., 166., 231., 180., 177., 220., 231., 210., 222.,
        223., 233., 228., 228., 247., 210., 245., 234., 206., 192., 227.,
        235., 205., 216., 190., 193., 200., 189., 167., 171., 174., 163.,
        165., 160., 143., 153., 119., 136., 119., 123., 139., 106., 118.,
        106.,  89., 111.,  92.,  98.,  83.,  80.,  62.,  74.,  79.,  69.,
         74.,  58.,  66.,  52.,  55.,  52.,  37.,  36.,  31.,  30.,  28.,
         21.,  22.,  17.,  22.,   5.,  12.,  12.,  12.,   4.,   8.,   6.,
          6.,   4.,   3.,   7.,   6.,   7.,   2.,   5.,   2.,   2.,   2.,
          2.,   5.,   1.,   1.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,
          2.]),
 array([ 39265.  ,  40232.39,  41199.78,  42167.17,  43134.56,  44101.95,
         45069.34,  46036.73,  47004.12,  47971.51,  48938.9 ,  49906.29,
         50873.68,  51841.07,  52808.46,  53775.85,  54743.24,  55710.63,
         56678.02,  57645.41,  58612.8 ,  59580.19,  60547.58,  61514.97,
         62482.36,  63449.75,  64417.14,  65384.53,  66351.92,  67319.31,
         68286.7 ,  69254.09,  70221.48,  71188.87,  72156.26,  73123.65,
         74091.04,  75058.43,  76025.82,  76993.21,  77960.6 ,  78927.99,
         79895.38,  80862.77,  81830.16,  82797.55,  83764.94,  84732.33,
         85699.72,  86667.11,  87634.5 ,  88601.89,  89569.28,  90536.67,
         91504.06,  92471.45,  93438.84,  94406.23,  95373.62,  96341.01,
         97308.4 ,  98275.79,  99243.18, 100210.57, 101177.96, 102145.35,
        103112.74, 104080.13, 105047.52, 106014.91, 106982.3 , 107949.69,
        108917.08, 109884.47, 110851.86, 111819.25, 112786.64, 113754.03,
        114721.42, 115688.81, 116656.2 , 117623.59, 118590.98, 119558.37,
        120525.76, 121493.15, 122460.54, 123427.93, 124395.32, 125362.71,
        126330.1 , 127297.49, 128264.88, 129232.27, 130199.66, 131167.05,
        132134.44, 133101.83, 134069.22, 135036.61, 136004.  ]),
 <a list of 100 Patch objects>)

print ("偏态系数: ", df['salary'].skew())
print ("峰态系数: ", df['salary'].kurt())
偏态系数:  0.6784771437175181
峰态系数:  -0.03592105965629688
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值