Python数据分析07：pandas描述性统计、相关性和协方差、unique()函数

最新推荐文章于 2022-09-16 11:17:22 发布

灯bupa冷

最新推荐文章于 2022-09-16 11:17:22 发布

阅读量1.9k

点赞数 2

分类专栏：利用Python进行数据分析文章标签： python 数据分析大数据

本文链接：https://blog.csdn.net/Apple_xiaoli/article/details/104664377

版权

本文介绍了如何使用pandas进行数据描述性统计，包括计算相关性和协方差，以及如何利用unique()函数处理唯一值。通过示例展示了计算股票价格和成交量的相关性，并探讨了value_counts()和isin()等方法的应用。

摘要由CSDN通过智能技术生成

CHAPTER 5 Getting Started with pandas

文章目录

CHAPTER 5
Getting Started with pandas
- 5.3 汇总和计算描述性统计（Summarizing and Computing Descriptive Statistics）
- - 5.3.1 相关性和协方差（Correlation and Covariance）
  - 5.3.2 Unique Values, Value Counts, and Membership

5.3 汇总和计算描述性统计（Summarizing and Computing Descriptive Statistics）

pandas有很多数学和统计方法。大部分可以归类为降维或汇总统计，这些方法是用来从series中提取单个值（比如sum或mean）。还有一些方法来处理缺失值：

>>>import pandas as pd
>>>import numpy as np

>>>df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
>>>df
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

使用sum的话，会返回一个series：

>>>df.sum()  #默认对列求和
one    9.25
two   -5.80
dtype: float64

使用axis='columns' or axis=1，计算列之间的和：

>>>df.sum(axis='columns')
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

计算的时候，除非整个切片全是NA，否则NA（即缺失值）会被除外。我们可以用参数skipna来跳过计算NA：

>>>df.mean(axis='columns', skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

reduction methods

方法	描述
axis	DataFrame的行用 0，列用 1
skipna	排除缺失值，默认值为True
level	如果轴是分层索引的（MultiIndex），则按级别减少分组

举几个栗子，idxmin和idxmax，能返回间接的统计值，比如index value：

>>>df.idxmax()
one    b
two    d
dtype: object

计算累加值：

>>>df.cumsum()
    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8

describe能一下子产生多维汇总数据，我感觉就类似R语言中的summary()函数。

对于数值型数据：

>>>df.describe()
            one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000