pandas 每月最后一天_2020年数据分析必知必会（八）：使用pandas查询数据和统计分析的应用（短小但强大）...-CSDN博客

本文链接：https://blog.csdn.net/weixin_39541750/article/details/111665847

本文编辑：远方Github (转载作者请注明出处)

认认真真系统学习数据分析

本文继续学习Python数据分析知识，前期的知识点可点击以下蓝色字体链接进行回看复习：

数据分析开篇：一个简单的应用(2019/11/04)
2020年数据分析必知必会(一)：NumPy数组
2020年数据分析必知必会(二)：NumPy摘要----文章末尾附Python
2020年数据分析必知必会(三)：数组的形状和属性(有福利赠予)
数据分析必知必必会(四)：数组的转换，视图，拷贝，索引和广播(这里的“广播”是一个数组的应用：数据处理旧手机铃声)
2020年数据分析必知必会(五)：统计和线性代数(使用Numpy与Scipy实现)
2020年数据分析必知必会(六)：离散式复制的创建(以北京最近的猪肉价格为例子)
2020年数据分析必知必会(七)：pandas入门与数据结构基础

废话不多说，直接上干货

....

正文开始

1、pandas如何查询数据？

从前面的学习我们已经知道，pandas的DataFrame数据结构类似于关系数据库类型，那么查询方式也就如出一辙了。

数据的背景(你也可选择其他数据作为例子)：太阳黑子

下面以一个众所周知的太阳黑子爆发数据为例子，如图(来源百度)

太阳的光球表面有时会出现一些暗的区域，它是磁场聚集的地方，这就是太阳黑子。

获取太阳黑子数据：

太阳黑子数据网站：https://www.quandl.com/data/SIDC/SUNSPOTS_13-Total-Sunspot-Numbers-13-Month

网站每天使用权限：每天使用Python下载该网站的数据最多调用50次。

Python下载黑子数据需要的模块命令:(win+r+cmd)

pip install Quandl 或 python -m pip install Quandl

这里的Quandl的API接口是免费的，因此为了方便python公司将Quandl作为模块导入黑子爆发的数据。注意：使用次数超过需要注册。

(1)、下载数据：2018至2019年间13个月的太阳黑子总数

(下载速度慢，请耐心等待)

>>> import quandl>>> sunspots = quandl.get("SIDC/SUNSPOTS_13")#2019年的数据SIDC/SUNSPOTS_13             #2018"SIDC/SUNSPOTS_A">>> sunspots            13-Month Smoothed Total Sunspot Number  ...  Definitive/Provisional IndicatorDate                                                ...1749-01-31                                     NaN  ...                               1.01749-02-28                                     NaN  ...                               1.01749-03-31                                     NaN  ...                               1.01749-04-30                                     NaN  ...                               1.01749-05-31                                     NaN  ...                               1.0...                                            ...  ...                               ...2019-06-30                                     NaN  ...                               0.02019-07-31                                     NaN  ...                               0.02019-08-31                                     NaN  ...                               0.02019-09-30                                     NaN  ...                               0.02019-10-31                                     NaN  ...                               0.0[3250 rows x 4 columns]

顺便也下了2017至2018的，主要是觉得上述数据有NaN空值

>>> import quandl>>> sunspots = quandl.get("SIDC/SUNSPOTS_A")>>> sunspots            Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional IndicatorDate1700-12-31                               8.3                             NaN                     NaN                               1.01701-12-31                              18.3                             NaN                     NaN                               1.01702-12-31                              26.7                             NaN                     NaN                               1.01703-12-31                              38.3                             NaN                     NaN                               1.01704-12-31                              60.0                             NaN                     NaN                               1.0...                                      ...                             ...                     ...                               ...2014-12-31                             113.3                             8.0                  5273.0                               1.02015-12-31                              69.8                             6.4                  8903.0                               1.02016-12-31                              39.8                             3.9                  9940.0                               1.02017-12-31                              21.7                             2.5                 11444.0                               1.02018-12-31                               7.0                             1.1                 12611.0                               1.0[319 rows x 4 columns]

(2)、指定导出数据的开头和结尾最后的几行数据

方法：

使用函数head(n)和tail(n)分别下载数据的前n行和后n行，其中n为你要下载的行数。

假设参数n=4,那么就有：

import quandlsunspots = quandl.get("SIDC/SUNSPOTS_A")#2019年的数据SIDC/SUNSPOTS_13             #2018"SIDC/SUNSPOTS_A"print(sunspots)print("前四行数据为：",sunspots.head(4))print("后四行数据为：",sunspots.tail(4))

执行结果：

>> print("前四行数据为：",sunspots.head(4))前四行数据为：Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional IndicatorDate1700-12-31                               8.3                             NaN                     NaN                               1.01701-12-31                              18.3                             NaN                     NaN                               1.01702-12-31                              26.7                             NaN                     NaN                               1.01703-12-31                              38.3                             NaN                     NaN                               1.0>>> print("后四行数据为：",sunspots.tail(4))后四行数据为：Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional IndicatorDate2015-12-31                              69.8                             6.4                  8903.0                               1.02016-12-31                              39.8                             3.9                  9940.0                               1.02017-12-31                              21.7                             2.5                 11444.0                               1.02018-12-31                               7.0                             1.1                 12611.0                               1.0>>>

(3)、查询最近2018年最后一天的太阳黑子的数据

(注意：只能查一个数据，而且是每月最后一天)

#最近一年太阳黑子的爆发的数据统计last_data = sunspots.index[-1]print("最近数据:",sunspots.loc[last_data])

执行结果：

>>> last_data = sunspots.index[-1]>>> print("最近数据:",sunspots.loc[last_data])最近数据: Yearly Mean Total Sunspot Number        7.0Yearly Mean Standard Deviation          1.1Number of Observations              12611.0Definitive/Provisional Indicator        1.0Name: 2018-12-31 00:00:00, dtype: float64

(4)、查询指定日期范围内的数据

(切记：按照年月日格式来查，且数据结果不包括范围的区间的端点值)

假设我想查2008年8月8日到2018年1月1日中最后一月最后一天的数据，日期格式为

2008080820180101

代码如下：

print("查找指定的日期数据",sunspots["20080808":"20180101"])

执行结果：(可以看到都是12月31日)

>>> print("查找指定的日期数据",sunspots["20080808":"20180101"])查找指定的日期数据             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional IndicatorDate2008-12-31                               4.2                             2.5                  6644.0                               1.02009-12-31                               4.8                             2.5                  6465.0                               1.02010-12-31                              24.9                             3.4                  6328.0                               1.02011-12-31                              80.8                             6.7                  6077.0                               1.02012-12-31                              84.5                             6.7                  5753.0                               1.02013-12-31                              94.0                             6.9                  5347.0                               1.02014-12-31                             113.3                             8.0                  5273.0                               1.02015-12-31                              69.8                             6.4                  8903.0                               1.02016-12-31                              39.8                             3.9                  9940.0                               1.02017-12-31                              21.7                             2.5                 11444.0                               1.0

(5)、指定索引来查询

print("指定索引查询",sunspots.iloc[[2,1,0,-1,-2]])

执行结果：

>>> print("指定索引查询",sunspots.iloc[[2,1,0,-1,-2]])指定索引查询             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional IndicatorDate1702-12-31                              26.7                             NaN                     NaN                               1.01701-12-31                              18.3                             NaN                     NaN                               1.01700-12-31                               8.3                             NaN                     NaN                               1.02018-12-31                               7.0                             1.1                 12611.0                               1.02017-12-31                              21.7                             2.5                 11444.0                               1.0>>>

如果按照顺序，也可这么来输出：

>>> print("指定索引查询",sunspots.iloc[[-2,-1,0,1,2]])指定索引查询             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional IndicatorDate2017-12-31                              21.7                             2.5                 11444.0                               1.02018-12-31                               7.0                             1.1                 12611.0                               1.01700-12-31                               8.3                             NaN                     NaN                               1.01701-12-31                              18.3                             NaN                     NaN                               1.01702-12-31                              26.7                             NaN                     NaN                               1.0>>>

(6)、查询指定变量值

换句话说就是查询指定行和列对应位置的数值，类似矩阵或数组中查询指定行列位置的元素。

print("查询第3行第4列元素", sunspots.iloc[2, 3])print("查询第2行第1列元素", sunspots.iat[1, 0])

执行结果：

>>> print("查询第3行第4列元素", sunspots.iloc[2, 3])查询第3行第4列元素 1.0>>> print("查询第2行第1列元素", sunspots.iat[1, 0])查询第2行第1列元素 18.3

(7)、查询布尔型变量

这里需要使用平均值函数:mean()

下面查询各个大于平均值的数值

print("Boolean selection",sunspots[sunspots > sunspots.mean()])

执行结果：

>>> print("Boolean selection",sunspots[sunspots > sunspots.mean()])Boolean selection             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional IndicatorDate1700-12-31                               NaN                             NaN                     NaN                               NaN1701-12-31                               NaN                             NaN                     NaN                               NaN1702-12-31                               NaN                             NaN                     NaN                               NaN1703-12-31                               NaN                             NaN                     NaN                               NaN1704-12-31                               NaN                             NaN                     NaN                               NaN...                                      ...                             ...                     ...                               ...2014-12-31                             113.3                             8.0                  5273.0                               NaN2015-12-31                               NaN                             NaN                  8903.0                               NaN2016-12-31                               NaN                             NaN                  9940.0                               NaN2017-12-31                               NaN                             NaN                 11444.0                               NaN2018-12-31                               NaN                             NaN                 12611.0                               NaN[319 rows x 4 columns]

2、利用pandas的DataFrmae进行统计计算

为了方便，这里先给出统计函数的一些描述：

idxmin 最小值的索引值

idxmax 最大值的索引值

describe 一次性 多种维度统计

count 非NA值的数量

min 最小值

max 最大值

argmin 最小值的索引位置

argmax 最大值的索引位置

sum 总和

mean 平均数

median 算术中位数

mad 根据平均值计算平均绝对离差

var 样本值的方差

std 样本值的标准差

skew 样本值的偏度(三阶矩阵)

kurt 样本值的峰度(四阶矩阵)

cumsum 样本值的累积和

cummin、cummax 样本值的最大值、最小值

cumprod 样本值的累计积

diff 计算一阶差分

pct_change 计算百分数变化

下面举出几个例子说明上述用途，其他类似去使用，代码如下：

import quandl#统计计算# Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual# PyPi url https://pypi.python.org/pypi/Quandlsunspots = quandl.get("SIDC/SUNSPOTS_A")print("Describe", sunspots.describe())print("非NAN数值的数量Non NaN observations",sunspots.count())print("平均绝对标准差MAD", sunspots.mad())print("中位数Median", sunspots.median())print("Min", sunspots.min())print("Max", sunspots.max())print("众数Mode", sunspots.mode())print("离散度的标准差Standard Deviation", sunspots.std())print("方差Variance", sunspots.var())print("偏态系数Skewness", sunspots.skew())print("峰态系数Kurtosis", sunspots.kurt())

执行结果：

>>> import quandl>>> #统计计算... # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual... # PyPi url https://pypi.python.org/pypi/Quandl...>>> sunspots = quandl.get("SIDC/SUNSPOTS_A")>>> print("Describe", sunspots.describe())Describe        Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional Indicatorcount                        319.000000                      201.000000              201.000000                             319.0mean                          78.970533                        7.947761             1572.751244                               1.0std                           62.019871                        3.840522             2667.888556                               0.0min                            0.000000                        1.100000              150.000000                               1.025%                           24.800000                        4.700000              365.000000                               1.050%                           65.800000                        7.600000              365.000000                               1.075%                          115.750000                       10.400000              366.000000                               1.0max                          269.300000                       19.100000            12611.000000                               1.0>>> print("非NAN数值的数量Non NaN observations",sunspots.count())非NAN数值的数量Non NaN observations Yearly Mean Total Sunspot Number    319Yearly Mean Standard Deviation      201Number of Observations              201Definitive/Provisional Indicator    319dtype: int64>>> print("平均绝对标准差MAD", sunspots.mad())平均绝对标准差MAD Yearly Mean Total Sunspot Number      50.954279Yearly Mean Standard Deviation         3.155848Number of Observations              1990.750773Definitive/Provisional Indicator       0.000000dtype: float64>>> print("中位数Median", sunspots.median())中位数Median Yearly Mean Total Sunspot Number     65.8Yearly Mean Standard Deviation        7.6Number of Observations              365.0Definitive/Provisional Indicator      1.0dtype: float64>>> print("Min", sunspots.min())Min Yearly Mean Total Sunspot Number      0.0Yearly Mean Standard Deviation        1.1Number of Observations              150.0Definitive/Provisional Indicator      1.0dtype: float64>>> print("Max", sunspots.max())Max Yearly Mean Total Sunspot Number      269.3Yearly Mean Standard Deviation         19.1Number of Observations              12611.0Definitive/Provisional Indicator        1.0dtype: float64>>> print("众数Mode", sunspots.mode())众数Mode    Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional Indicator0                              18.3                             9.2                   365.0                               1.0>>> print("离散度的标准差Standard Deviation", sunspots.std())离散度的标准差Standard Deviation Yearly Mean Total Sunspot Number      62.019871Yearly Mean Standard Deviation         3.840522Number of Observations              2667.888556Definitive/Provisional Indicator       0.000000dtype: float64>>> print("方差Variance", sunspots.var())方差Variance Yearly Mean Total Sunspot Number    3.846464e+03Yearly Mean Standard Deviation      1.474961e+01Number of Observations              7.117629e+06Definitive/Provisional Indicator    0.000000e+00dtype: float64>>> print("偏态系数Skewness", sunspots.skew())偏态系数Skewness Yearly Mean Total Sunspot Number    0.810785Yearly Mean Standard Deviation      0.546692Number of Observations              1.972382Definitive/Provisional Indicator    0.000000dtype: float64>>> print("峰态系数Kurtosis", sunspots.kurt())峰态系数Kurtosis Yearly Mean Total Sunspot Number   -0.127610Yearly Mean Standard Deviation     -0.252353Number of Observations              2.728810Definitive/Provisional Indicator    0.000000dtype: float64

下期预告：如何聚合数据？