Pandas使用（五）

最新推荐文章于 2023-01-02 13:15:36 发布

Small-J

最新推荐文章于 2023-01-02 13:15:36 发布

阅读量2.8k

点赞数

分类专栏： Python数据分析文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_37662827/article/details/106258253

版权

Python数据分析专栏收录该内容

14 篇文章 2 订阅

订阅专栏

文章目录

5-5 索引与分层索引

查看索引

df.index
- 查看索引
- 注意：索引值不能够单独赋值，只能进行整体的赋值

In [6]: import pandas as pd

In [7]: import numpy as np

In [8]: df = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('qwer'))

In [9]: df
Out[9]:
   q  w   e   r
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [10]: # 查看索引

In [11]: df.index
Out[11]: Index(['a', 'b', 'c'], dtype='object')

In [12]: # 索引并不能单独赋值并修改

In [13]: df.index[0] = 'e'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-57fd5743f906> in <module>
----> 1 df.index[0] = 'e'

d:\python3.6.5\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   4258
   4259     def __setitem__(self, key, value):
-> 4260         raise TypeError("Index does not support mutable operations")
   4261
   4262     def __getitem__(self, key):

TypeError: Index does not support mutable operations

In [14]: # 索引只能通过对应索引重新赋值并修改

In [16]: df.index = list('nms')

In [17]: df
Out[17]:
   q  w   e   r
n  0  1   2   3
m  4  5   6   7
s  8  9  10  11

重置索引

df.reindex()
- 如果新添加的索引中没有对应的值，则默认为nan
- 如果减少索引的值出现，相当于切片

In [22]: df
Out[22]:
   q  w   e   r
n  0  1   2   3
m  4  5   6   7
s  8  9  10  11

In [23]: # 对df进行重置索引

In [24]: df.reindex(list('nma'))
Out[24]:
     q    w    e    r
n  0.0  1.0  2.0  3.0
m  4.0  5.0  6.0  7.0
a  NaN  NaN  NaN  NaN

In [25]: # 当重置的索引中没有对应的值的话显示为nan

In [26]: # 当重置的索引中的索引值不勾，则相当于切片

In [27]: df.reindex(list('ns'))
Out[27]:
   q  w   e   r
n  0  1   2   3
s  8  9  10  11

指定索引

df.set_index()
- 将Dataframe中的列转换为行索引

In [29]: df
Out[29]:
   q  w   e   r
n  0  1   2   3
m  4  5   6   7
s  8  9  10  11

In [30]: # set_index 为DataFram中的列转化为行索引

In [31]: df.set_index('q')
Out[31]:
   w   e   r
q
0  1   2   3
4  5   6   7
8  9  10  11

In [32]: # set_index 中有个参数 drop,

In [33]: # drop : 该参数默认为True 当指定为False时，可以将指定的列索引数值显示出来

In [34]: df.set_index('q', drop=False)
Out[34]:
   q  w   e   r
q
0  0  1   2   3
4  4  5   6   7
8  8  9  10  11

返回index的唯一值

df.set_index("M").index.unique()
- df.set_index('q').index : 显示为index索引
- unique : 过滤掉重复的索引

In [48]: df
Out[48]:
   q  w   e   r
n  0  1   2   3
m  8  5   6   7
s  8  9  10  11

In [49]: # unique 主要查看是否是唯一字段

In [50]: df.set_index('q').index.unique()
Out[50]: Int64Index([0, 8], dtype='int64', name='q')

分层索引

分层索引是Pandas的重要特性，允许你在一个轴向上拥有多个(两个或两个以上)索引层级。

In [52]: # 由于数据中索引出现重复的值将会显示为空号，当我们想取多层索引的时候可以传入列表
In [53]: df.set_index(['q','w'])
Out[53]:
      e   r
q w
0 1   2   3
8 5   6   7
  9  10  11
    

In [55]: df1 = pd.DataFrame({'a': range(7),'b':range(7,0,-1),'c':['one','one','one','two','two',
    ...: 'two','two'],'d':list('hjklmno')})

In [56]: df1
Out[56]:
   a  b    c  d
0  0  7  one  h
1  1  6  one  j
2  2  5  one  k
3  3  4  two  l
4  4  3  two  m
5  5  2  two  n
6  6  1  two  o

In [57]: df2 = df1.set_index(['c','d'])

In [58]: df2
Out[58]:
       a  b
c   d
one h  0  7
    j  1  6
    k  2  5
two l  3  4
    m  4  3
    n  5  2
    o  6  1

分层索引即切片

loc
iloc

交换索引

交换的索引是内层与外层之间的索引

`df.swaplevel(i=level1, j=level2)
- 交换set_index后的内层与外层索引
- level为层级

In [21]: # 创建二维数组

In [22]: df = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('qwer'))

In [23]: # 设置多成索引

In [24]: df
Out[24]:
   q  w   e   r
a  0  1   2   3
b  4  5   6

分层索引也可以进行排序

sort_index(ascending=True)
- ascending : 默认情况下为True为升序，设置为False就变成降序

In [32]: df1
Out[32]:
      e   r
w q
1 0   2   3
5 4   6   7
9 8  10  11

In [33]: df1.sort_index()
Out[33]:
      e   r
w q
1 0   2   3
5 4   6   7
9 8  10  11

In [33]: #查看源代码
In [34]: df1.sort_index??
Signature:
df1.sort_index(
    axis=0,
    level=None,
    ascending=True,
    inplace=False,
    kind='quicksort',
    na_position='last',
    sort_remaining=True,
    by=None,
)
Docstring:
Sort object by labels (along an axis).

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
    The axis along which to sort.  The value 0 identifies the rows,
    and 1 identifies the columns.
level : int or level name or list of ints or list of level names
    If not None, sort on values in specified index level(s).
ascending : bool, default True
    Sort ascending vs. descending.
inplace : bool, default False
    If True, perform operation in-place.
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
    Choice of sorting algorithm. See also ndarray.np.sort for more
    information.  `mergesort` is the only stable algorithm. For
    DataFrames, this option is only applied when sorting on a single
    column or label.
na_position : {'first', 'last'}, default 'last'
    Puts NaNs at the beginning if `first`; `last` puts NaNs at the end.
    Not implemented for MultiIndex.
sort_remaining : bool, default True
    If True and sorting by level and index is multilevel, sort by other
    levels too (in order) after sorting by specified level.

Returns
-------
sorted_obj : DataFrame or None

In [35]: df1.sort_index(ascending=False)
Out[35]:
      e   r
w q
9 8  10  11
5 4   6   7
1 0   2   3

In [36]: # 由于我们的数据是按照从小到大的效果并看不出来什么效果

In [37]: # 所以我们采用升序

In [38]: # sort_index()

In [39]: # 里面有个参数ascending

In [40]: # 默认情况下为True 这情况为降序，将我们设置为True的时候为升序

聚合函数

可以指定mean sum等其他操作

In [53]: df1
Out[53]:
      e   r
w q
1 0   2   3
5 4   6   7
9 8  10  11

In [54]: df1.sum()
Out[54]:
e    18
r    21
dtype: int64

# level 指定内层索引，就是内层索引进行聚合函数计算
In [55]: df1.sum(level=1)
Out[55]:
    e   r
q
0   2   3
4   6   7
8  10  11

将多层索引恢复到数据中

reset_index()

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('qwer'))

In [4]: df
Out[4]:
   q  w   e   r
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [5]: # 设置多层索引

In [6]: df1 = df.set_index(['q','w','r'])

In [7]: df1
Out[7]:
         e
q w r
0 1 3    2
4 5 7    6
8 9 11  10

In [8]: # reset_index : 为把多层索引转换为数据

In [9]: df1 = df1.reset_index()

In [10]: df1
Out[10]:
   q  w   r   e
0  0  1   3   2
1  4  5   7   6
2  8  9  11  10

5-6 时间序列

时间序列前言

时间序列数据在很多领域都是重要的结构化数据形式，比如金融，生态学，物理学。在多个时间点观测的数据形成了时间序列。时间序列可以是固定频率的，也可以是不规则的

不使用Pandas创建的时间序列索引

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from datetime import datetime

In [4]: dates = [datetime(2020,5,18),datetime(2020,5,19),datetime(2020,5,20)]

In [5]: Sr = pd.Series(np.random.randint(20,40, size=3), index=dates)

In [6]: Sr
Out[6]:
2020-05-18    34
2020-05-19    33
2020-05-20    33
dtype: int32

In [7]: Sr.index
Out[7]: DatetimeIndex(['2020-05-18', '2020-05-19', '2020-05-20'], dtype='datetime64[ns]', freq=None)

In [8]: # 取数据出来进行计算

In [9]: Sr[::2]
Out[9]:
2020-05-18    34
2020-05-20    33
dtype: int32

In [10]: Sr1 = Sr[::2]

In [11]: # 算术运算 会自动补齐 对应的值，对应运算，当没有数据进行运算的时候会显示NaN
    
In [12]: Sr + Sr1
Out[12]:
2020-05-18    68.0
2020-05-19     NaN
2020-05-20    66.0
dtype: float64

In [13]: # 数据类型为纳秒级别

In [14]: Sr.index
Out[14]: DatetimeIndex(['2020-05-18', '2020-05-19', '2020-05-20'], dtype='datetime64[ns]', freq=None)

In [15]: Sr.index.dtype
Out[15]: dtype('<M8[ns]')

时间序列基础

时间序列介绍

Pandas中的基础时间序列种类是由时间戳索引的Series，在Pandas外部通常表示为Panda字符串或datetime对象。

注意

datetime对象可作为索引，时间序列DatetimeIndex
<M8[ns]类型为纳秒级别的时间戳
时间序列里面每个元素为Timestamp对象

生成时间序列索引

pd.date_range(start=None,end=None,periods=None,frep=None,tz=None,normalize=False,name=None,closed=None)
- start : 起始时间
- end : 结束时间
- periods : 固定时期
- freq : 日期偏移量（频率）
  - h : 为小时
  - min : 为分钟
  - s : 为秒
  - D : 为天
  - W : 为周
  - M : 为月
  - Y : 为年
- normalize : 标准化为0的时间戳

In [40]: dt = pd.date_range(start='20200101', end='20200520',freq='1h')

In [41]: dt
Out[41]:
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 01:00:00',
               '2020-01-01 02:00:00', '2020-01-01 03:00:00',
               '2020-01-01 04:00:00', '2020-01-01 05:00:00',
               '2020-01-01 06:00:00', '2020-01-01 07:00:00',
               '2020-01-01 08:00:00', '2020-01-01 09:00:00',
               ...
               '2020-05-19 15:00:00', '2020-05-19 16:00:00',
               '2020-05-19 17:00:00', '2020-05-19 18:00:00',
               '2020-05-19 19:00:00', '2020-05-19 20:00:00',
               '2020-05-19 21:00:00', '2020-05-19 22:00:00',
               '2020-05-19 23:00:00', '2020-05-20 00:00:00'],
              dtype='datetime64[ns]', length=3361, freq='H')

In [42]: # 当指定分钟的时候

In [43]: dt = pd.date_range(start='20200101', end='20200520',freq='1h30min')

In [44]: dt
Out[44]:
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 01:30:00',
               '2020-01-01 03:00:00', '2020-01-01 04:30:00',
               '2020-01-01 06:00:00', '2020-01-01 07:30:00',
               '2020-01-01 09:00:00', '2020-01-01 10:30:00',
               '2020-01-01 12:00:00', '2020-01-01 13:30:00',
               ...
               '2020-05-19 10:30:00', '2020-05-19 12:00:00',
               '2020-05-19 13:30:00', '2020-05-19 15:00:00',
               '2020-05-19 16:30:00', '2020-05-19 18:00:00',
               '2020-05-19 19:30:00', '2020-05-19 21:00:00',
               '2020-05-19 22:30:00', '2020-05-20 00:00:00'],
              dtype='datetime64[ns]', length=2241, freq='90T')

In [45]: # 当指定秒数的时候

In [46]: dt = pd.date_range(start='20200101', end='20200520',freq='1h30min30s')

In [47]: dt
Out[47]:
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 01:30:30',
               '2020-01-01 03:01:00', '2020-01-01 04:31:30',
               '2020-01-01 06:02:00', '2020-01-01 07:32:30',
               '2020-01-01 09:03:00', '2020-01-01 10:33:30',
               '2020-01-01 12:04:00', '2020-01-01 13:34:30',
               ...
               '2020-05-19 09:29:00', '2020-05-19 10:59:30',
               '2020-05-19 12:30:00', '2020-05-19 14:00:30',
               '2020-05-19 15:31:00', '2020-05-19 17:01:30',
               '2020-05-19 18:32:00', '2020-05-19 20:02:30',
               '2020-05-19 21:33:00', '2020-05-19 23:03:30'],
              dtype='datetime64[ns]', length=2228, freq='5430S')

In [48]: # 当指定为天

In [49]: dt = pd.date_range(start='20200101', end='20200520',freq='1D')

In [50]: dt
Out[50]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2020-05-11', '2020-05-12', '2020-05-13', '2020-05-14',
               '2020-05-15', '2020-05-16', '2020-05-17', '2020-05-18',
               '2020-05-19', '2020-05-20'],
              dtype='datetime64[ns]', length=141, freq='D')

In [51]: # 当指定为周

In [52]: dt = pd.date_range(start='20200101', end='20200520',freq='1W')

In [53]: dt
Out[53]:
DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
               '2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
               '2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
               '2020-03-29', '2020-04-05', '2020-04-12', '2020-04-19',
               '2020-04-26', '2020-05-03', '2020-05-10', '2020-05-17'],
              dtype='datetime64[ns]', freq='W-SUN')

In [54]: # 当指定为月

In [55]: dt = pd.date_range(start='20200101', end='20200520',freq='1M')

In [56]: dt
Out[56]: DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30'], dtype='datetime64[ns]', freq='M')

# periods 划分为5个区间
# 当不指定end值的时候，将会按照periods为划分区间，当我们不设置freq时，会采用默认参数d
In [57]: dt = pd.date_range(start='20200101',periods=5)

In [58]: dt
Out[58]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05'],
              dtype='datetime64[ns]', freq='D')

In [21]: # periods 为固定时间序列

In [22]: # normalize 为标准化时间为0的时间戳

In [23]: df = pd.date_range(start='2020-05-21', periods=5, normalize=True)

In [24]: df
Out[24]:
DatetimeIndex(['2020-05-21', '2020-05-22', '2020-05-23', '2020-05-24',
               '2020-05-25'],
              dtype='datetime64[ns]', freq='D')

时间序列索引及选择数据

时间序列取值通过 [] 来进行取值
年份月份日之间需要使用空格来进行操作
也可以通过 - 进行桥接
也支持loc 和 iloc等操作

In [21]: # periods 为固定时间序列

In [22]: # normalize 为标准化时间为0的时间戳
In [25]: ts = pd.Series(np.random.randint(20,50,size=100),index=pd.date_range(start='20200521',periods=100))

In [26]: ts
Out[26]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
              ..
2020-08-24    34
2020-08-25    25
2020-08-26    44
2020-08-27    23
2020-08-28    41
Freq: D, Length: 100, dtype: int32

In [27]: # periods为时间间隔，由于不指定end,freq是以D来进行划分也就是一天

In [28]: # 进行时间序列索引操作

In [29]: # 选取2020的数据

In [30]: ts['2020']
Out[30]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
              ..
2020-08-24    34
2020-08-25    25
2020-08-26    44
2020-08-27    23
2020-08-28    41
Freq: D, Length: 100, dtype: int32

In [31]: # 选取2020 5 月的数据

In [33]: ts['2020 05']
Out[33]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
2020-05-26    29
2020-05-27    27
2020-05-28    32
2020-05-29    40
2020-05-30    38
2020-05-31    35
Freq: D, dtype: int32

In [34]: # 年份月份日之间要进行空格相隔

In [35]: # 取2020年5月01日至5月10日的数据

In [36]: ts['2020 05 01' : '2020 05 10']
Out[36]: Series([], Freq: D, dtype: int32)

In [37]: ts
Out[37]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
              ..
2020-08-24    34
2020-08-25    25
2020-08-26    44
2020-08-27    23
2020-08-28    41
Freq: D, Length: 100, dtype: int32

In [38]: # 取2020年5月的所有数据

In [39]: ts['2020 05 21':'2020 05 31']
Out[39]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
2020-05-26    29
2020-05-27    27
2020-05-28    32
2020-05-29    40
2020-05-30    38
2020-05-31    35
Freq: D, dtype: int32
        
In [40]: ts.loc['2020-05']
Out[40]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
2020-05-26    29
2020-05-27    27
2020-05-28    32
2020-05-29    40
2020-05-30    38
2020-05-31    35
Freq: D, dtype: int32

时间序列也含有重复的索引

df.index.is_unique
- 检查索引是否有重复的值出现
- 当显示为True 表示为没有重复的索引
- 当显示为False 表示为有重复的索引

In [51]: dates = [datetime(2020, 5, 21),datetime(2020, 5, 21),datetime(2020, 5, 22),datetime(2020, 5, 23)]

In [52]: dates
Out[52]:
[datetime.datetime(2020, 5, 21, 0, 0),
 datetime.datetime(2020, 5, 21, 0, 0),
 datetime.datetime(2020, 5, 22, 0, 0),
 datetime.datetime(2020, 5, 23, 0, 0)]

In [53]: st = pd.Series(np.random.randint(20,30,size=4),index=dates)

In [54]: st
Out[54]:
2020-05-21    23
2020-05-21    28
2020-05-22    29
2020-05-23    24
dtype: int32

In [55]: # 检查是否有重复索引

In [56]: # 当为false 显示有重复

In [57]: # 当为true 显示没有重复

In [59]: st.index.is_unique
Out[59]: False
    
In [61]: # 当有重复索引获取值的时候也不会进行报错

In [62]: st.loc['2020-05-21']
Out[62]:
2020-05-21    23
2020-05-21    28
dtype: int32

重复索引进行分组运算

In [70]: dates = [datetime(2020, 5, 21),datetime(2020, 5, 21),datetime(2020, 5, 22),datetime(2020, 5, 22)]

In [71]: st = pd.Series(np.random.randint(20,30,size=4),index=dates)

In [72]: st
Out[72]:
2020-05-21    29
2020-05-21    20
2020-05-22    29
2020-05-22    25
dtype: int32

In [73]: # 重复索引进行分组在进行求和运算

In [74]: st = st.groupby(level=0).sum()

In [75]: st
Out[75]:
2020-05-21    49
2020-05-22    54
dtype: int32

移位日期

"移位"指的是将日期按时间向前移动或向后移动。Series和DataFrame都有一个shift方法用于进行简单的前向或后向移位而不改变索引

In [77]: import pandas as pd

In [78]: import numpy as np

In [79]: st = pd.Series(np.random.randint(20,30,size=100),index=pd.date_range(start='20200521',periods=100))

In [80]: st
Out[80]:
2020-05-21    27
2020-05-22    25
2020-05-23    21
2020-05-24    23
2020-05-25    23
              ..
2020-08-24    25
2020-08-25    21
2020-08-26    27
2020-08-27    21
2020-08-28    25
Freq: D, Length: 100, dtype: int32

In [81]: # 当我进行指定向前进行移位,向前移动时，由于前面没数据，使用nan填充

In [82]: st.shift(2)
Out[82]:
2020-05-21     NaN
2020-05-22     NaN
2020-05-23    27.0
2020-05-24    25.0
2020-05-25    21.0
              ...
2020-08-24    21.0
2020-08-25    28.0
2020-08-26    25.0
2020-08-27    21.0
2020-08-28    27.0
Freq: D, Length: 100, dtype: float64

In [83]: # 也可以进行向后进行移位

In [84]: st.shift(-2)
Out[84]:
2020-05-21    21.0
2020-05-22    23.0
2020-05-23    23.0
2020-05-24    22.0
2020-05-25    22.0
              ...
2020-08-24    27.0
2020-08-25    21.0
2020-08-26    25.0
2020-08-27     NaN
2020-08-28     NaN
Freq: D, Length: 100, dtype: float64

应用场景

计算增长率
- （后一天-前一天）/ 前一天
- 后一天/前天 -1
- pd.pct_chang()

In [85]: st.pct_change()
Out[85]:
2020-05-21         NaN
2020-05-22   -0.074074
2020-05-23   -0.160000
2020-05-24    0.095238
2020-05-25    0.000000
                ...
2020-08-24   -0.107143
2020-08-25   -0.160000
2020-08-26    0.285714
2020-08-27   -0.222222
2020-08-28    0.190476
Freq: D, Length: 100, dtype: float64
            
 # 通过shift 也可以实现          
In [86]: st/st.shift(1)-1
Out[86]:
2020-05-21         NaN
2020-05-22   -0.074074
2020-05-23   -0.160000
2020-05-24    0.095238
2020-05-25    0.000000
                ...
2020-08-24   -0.107143
2020-08-25   -0.160000
2020-08-26    0.285714
2020-08-27   -0.222222
2020-08-28    0.190476
Freq: D, Length: 100, dtype: float64

5-7 重采样

重采样介绍

重采样：指的是将时间序列从一个频率转化为另一个频率进行处理的过程，将高频率数据转化为低频率数据为降采样，低频率转化为高频率为升采样。

In [87]: import pandas as pd

In [88]: import numpy as np

In [89]: df = pd.DataFrame(np.random.randint(20,30,size=10),index=pd.date_range(start='20200521',periods=10))

In [90]: df
Out[90]:
             0
2020-05-21  25
2020-05-22  25
2020-05-23  23
2020-05-24  23
2020-05-25  20
2020-05-26  22
2020-05-27  23
2020-05-28  28
2020-05-29  23
2020-05-30  22

In [91]: # 采用重采样 resample 可以指定类型

In [92]: df.resample('d').mean()
Out[92]:
             0
2020-05-21  25
2020-05-22  25
2020-05-23  23
2020-05-24  23
2020-05-25  20
2020-05-26  22
2020-05-27  23
2020-05-28  28
2020-05-29  23
2020-05-30  22

# 当以星期来进行操作
In [93]: df.resample('w').mean()
Out[93]:
             0
2020-05-24  24
2020-05-31  23

练习

北上广深与沈阳5个城市空气质量数据，绘制出北京的PM2.5随时间的变化情况

# @Time : 2020/5/21 14:25 
# @Author : SmallJ 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

# 读取csv文件
df = pd.read_csv('PM2.5/BeijingPM20100101_20151231.csv')

# 显示所有的数据
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# 读取一行数据
df.head(1)

# PeriodIndex 为时间段
datetime = pd.PeriodIndex(year=df.year, month=df.month, day=df.day, hour=df.hour, freq="h")

# 添加一列值
df['datetime'] = datetime

# 设置datetime为索引,在原数据上进行修改
df.set_index(df.datetime, inplace=True)

# freq : 以1小时为基础
# 采用重采样进行进行频率处理
df = df.resample('7D').mean()

# 处理缺失值
data = df['PM_US Post'].dropna()

# 绘制图片

x = data.index
y = data.values

# 中文显示设置
font = {'family': 'SimHei'}
matplotlib.rc('font', **font)

# 设置画布大小
plt.figure(figsize=(15, 8), dpi=80)

# 显示title
plt.title('北京的PM2.5天气情况')

# 绘制折线图
# 这里并不能直接采用x 为什么呢，因为x的数据类型为 period[7D]
plt.plot(range(len(x)), y, color='blue')

# 设置x轴的刻度
# ticks=None, labels=None
# ticks 为刻度
# labels 为标签
plt.xticks(ticks=range(0, len(x))[::10], labels=x[::10], rotation=45)

# 绘图
plt.savefig('beijingpm.png')

# 展示图例
plt.show()

在这里插入图片描述

Small-J

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Pandas使用（五）

文章目录5-5 索引与分层索引查看索引重置索引指定索引返回index的唯一值分层索引分层索引即切片交换索引5-6 时间序列时间序列前言时间序列基础生成时间序列索引时间序列索引及选择数据时间序列也含有重复的索引移位日期5-7 重采样重采样介绍练习5-5 索引与分层索引查看索引df.index查看索引注意：索引值不能够单独赋值，只能进行整体的赋值In [6]: import pandas as pdIn [7]: import numpy as npIn [8]: df =
复制链接

扫一扫