数据信息 & 统计计算

最新推荐文章于 2024-07-12 16:16:27 发布

EricZHAOedu

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量251

点赞数

分类专栏：深入浅出Pandas 文章标签： pandas python 数据分析

本文链接：https://blog.csdn.net/zhaoleiedu/article/details/127506113

版权

深入浅出Pandas 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

深入浅出Pandas

4.2 数据的信息

4.2.1 查看样本

df.head() # 前部数据, 默认5条
df.tail() # 尾部数据, 默认5条
df.sample() # 一条随机数据, 可指定数量

4.2.2 数据形状 df.shape

4.2.3 基础信息 df.info()

4.2.4 数据类型 df.dtypes

Series使用dtype

4.2.5 行列索引内容 df.axes

[RangeIndex(start=0, stop=100, step=1),
 Index(['name', 'team', 'Q1', 'Q2', 'Q3', 'Q4'], dtype='object')]

4.2.6 其他信息

df.index # 索引对象
df.columns # 列索引
df.values # array
df.ndim # 维度数
df.size # 行X列的总数
df.empty # DataFrame是否为空
df.keys() # Series的索引 & DataFrame的列名
# Series独有的方法
s.name
s.array # 返回PandasArray对象, 所有值组成的array
s.dtype
s.hasnans # Series中是否有缺失值, 有返回True

4.3 统计计算

4.3.1 描述统计 df.describe()

返回一个有多行的统计表, 有总数, 平均数, 标准差, 最小值, 四分位数, 最大值等
如果没有数字, 则会输出与字符相关的统计数据, 如数量, 不重复值数, 最大值等

pd.Series(['a', 'b', 'c', 'c']).describe()
'''
count     4
unique    3
top       c
freq      2
dtype: object
'''

也支持对时间数据的描述性统计，需要指定指定datetime_is_numeric=True

(
    pd.Series(pd.date_range('2020-1-1', '2020-3-1'))
    .describe(datetime_is_numeric=True) # 对时间的描述
)
'''
count                      5
mean     2020-01-03 00:00:00
min      2020-01-01 00:00:00
25%      2020-01-02 00:00:00
50%      2020-01-03 00:00:00
75%      2020-01-04 00:00:00
max      2020-01-05 00:00:00
dtype: object
'''

也可以自己指定分位数, 指定或排除数据类型

df.describe(percentiles=[.05, .95])
df.describe(include=[object, 'number']) # 指定需要描述的类型
df.describe(exclude=[object]) # 排除类型

4.3.3 统计函数

df.mean() # 平均值
df.corr() # 相关系数
df.cov() # 协方差
df.count() # 非空的个数
df.max() 
df.min()
df.abs()
df.median() # 中位数
df.std() # 标准差
df.var() # 方差
df.sem() # 平均值的标准误差
df.mode() # 众数
df.prod() # 连乘
df.mad() # 平均绝对值
df.cumprod() # 累积连乘, 累乘
df.cumsum() # 累加
df.nunique() # 去重数量
df.idxmax() # 最大值索引
df.idxmin() # 最小值索引
df.cummax() # 累积最大值
df.cummin() # 累积最小值
df.skew() # 样本偏度
df.kurt() # 样本峰度
df.quantile() # 分位数, 默认q=0.5, 中位数

4.3.4 非统计计算

df.all()
df.any()
df.round()
df.round({'Q1': 2, 'Q2': 1}) # 对Q1列保留2位小数, Q2列保留一位小数
df.round(-1) # 保留10位
df.nunique() # 每个列的去重值
df.isna()
df.notna()

以下可以传一个值或一个DataFrame, 对数据进行广播计算, 返回计算后的DataFrame

df.add() # +
df.sub() # - 
df.mul() # x
df.div() # /
df.mod() # 模
df.pow() # 幂
df.dot(df2) # 矩阵运算, 内积
# Series
s.value_counts()
s.value_counts(normalize=True) # 重复值的频率
s.value_counts(normalize=True, sort=False) # 不按频率排序
s.unique() # 去重后的array
s.is_unique # 是否不重复
s.nlargest() # 最大的前5个, 可指定多少个
s.nsmallest() # 最小的前5个
s.pct_change() # 计算与前一行的变化百分比
s.pct_change(periods=2) # 前2行的变化百分比
s.cov(s2) # 两个Series的协方差, df.cov()对所有列进行协方差计算