Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）-CSDN博客

入门学习马上结束辽。

1.Pandas库

import pandas as pd

两个数据类型：Series，DataFrame

Series类型：数据+索引

自定义索引

b = pd.Series([9,8,7,6],index=['a','b','c','d'])

b
Out[3]: 
a    9
b    8
c    7
d    6
dtype: int64

从标量值创建

s = pd.Series(25,index=['a','b','c'])#index=不能省略

s
Out[7]: 
a    25
b    25
c    25
dtype: int64

从字典类型创建

d = pd.Series({'a':9,'b':8,'c':7})

d
Out[9]: 
a    9
b    8
c    7
dtype: int64

从ndarray类型创建

import numpy as np

n = pd.Series(np.arange(5))

n
Out[12]: 
0    0
1    1
2    2
3    3
4    4
dtype: int32

基本操作

b = pd.Series([9,8,7,6],['a','b','c','d'])

b
Out[14]: 
a    9
b    8
c    7
d    6
dtype: int64

b.index
Out[15]: Index(['a', 'b', 'c', 'd'], dtype='object')


b.values
Out[17]: array([9, 8, 7, 6], dtype=int64)

　b.get('d',100)
　Out[18]: 6

Series对象和索引都可以有一个名字，存储在属性.name中

DataFrame类型：共用相同索引的多列数据

从二维ndarray对象创建

import pandas as pd

import numpy as np

d = pd.DataFrame(np.arange(10),reshape(2,5))
Traceback (most recent call last):

  File "<ipython-input-3-8f29c41caece>", line 1, in <module>
    d = pd.DataFrame(np.arange(10),reshape(2,5))

NameError: name 'reshape' is not defined


d = pd.DataFrame(np.arange(10).reshape(2,5))

d
Out[5]:

0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9

从一维ndarray对象字典创建

dt = {'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([9,8,7,6],index=['a','b','c','d'])}

d = pd.DataFrame(dt)

d
Out[11]: 
   one  two
a  1.0    9
b  2.0    8
c  3.0    7
d  NaN    6

pd.DataFrame(dt,index=['b','c','d'],columns=['two','three'])
Out[13]: 
   two three
b    8   NaN
c    7   NaN
d    6   NaN

从列表类型的字典创建

d1 = {'one':[1,2,3,4],'two':[9,8,7,6]}

d = pd.DataFrame(d1,index=['a','b','c','d'])

d
Out[16]: 
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

数据类型操作

如何改变Series和DataFrame对象？

增加或重排：重新索引

.reindex()

import pandas as pd

d1 = {'城市':['北京','上海','广州','深圳','沈阳'],
'环比':[101.5,101.2,101.3,102.0,100.1],
'同比':[101.5,101.2,101.3,102.0,100.1],
'定基':[101.5,101.2,101.3,102.0,100.1]}

d = pd.DataFrame(d1,index=[1,2,3,4,5])

d
Out[4]: 
      同比  城市     定基     环比
1  101.5  北京  101.5  101.5
2  101.2  上海  101.2  101.2
3  101.3  广州  101.3  101.3
4  102.0  深圳  102.0  102.0
5  100.1  沈阳  100.1  100.1

d = d.reindex(index=[5,4,3,2,1])

d
Out[6]: 
      同比  城市     定基     环比
5  100.1  沈阳  100.1  100.1
4  102.0  深圳  102.0  102.0
3  101.3  广州  101.3  101.3
2  101.2  上海  101.2  101.2
1  101.5  北京  101.5  101.5

d = d.reindex(columns=['城市','同比','环比','定基'])

d
Out[8]: 
   城市     同比     环比     定基
5  沈阳  100.1  100.1  100.1
4  深圳  102.0  102.0  102.0
3  广州  101.3  101.3  101.3
2  上海  101.2  101.2  101.2
1  北京  101.5  101.5  101.5

其他参数：

fill_value：重新索引中，勇于填充缺失位置的值

method：填充方法，fill当前值向前填充，bfill向后填充

limit：最大填充量

copy：默认True，生成新的对象，False时，新旧相等不复制

索引类型的常用方法：

.append(idx)：连接另一个Index对象，产生新的Index对象

.diff(idx)：计算差集，产生新的Index对象

.intersection(idx)：计算交集

.union(idx)：计算并集

.delete(loc)：删除loc位置处的元素

.insert(loc,e)：在loc位置增加一个元素e

nc = d.columns.delete(2)

ni = d.index.insert(5,6)

nd = d.reindex(index=ni,columns=nc,method='ffill')
Traceback (most recent call last):

  File "<ipython-input-11-ba08f80a2d41>", line 1, in <module>
    nd = d.reindex(index=ni,columns=nc,method='ffill')

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex
    **kwargs)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex
    fill_value, copy).__finalize__(self)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes
    fill_value, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns
    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex
    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer
    indexer = self._get_fill_indexer(target, method, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer
    limit)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted
    side)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic
    raise ValueError('index must be monotonic increasing or decreasing')

ValueError: index must be monotonic increasing or decreasing


ni = d.index.insert(5,0)

nd = d.reindex(index=ni,columns=nc,method='ffill')
Traceback (most recent call last):

  File "<ipython-input-13-ba08f80a2d41>", line 1, in <module>
    nd = d.reindex(index=ni,columns=nc,method='ffill')

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex
    **kwargs)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex
    fill_value, copy).__finalize__(self)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes
    fill_value, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns
    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex
    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer
    indexer = self._get_fill_indexer(target, method, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer
    limit)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted
    side)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic
    raise ValueError('index must be monotonic increasing or decreasing')

ValueError: index must be monotonic increasing or decreasing


nd = d.reindex(index=ni,columns=nc).ffill()

nd
Out[15]: 
   城市     同比     定基
5  沈阳  100.1  100.1
4  深圳  102.0  102.0
3  广州  101.3  101.3
2  上海  101.2  101.2
1  北京  101.5  101.5
0  北京  101.5  101.5

ValueError: index must be monotonic increasing or decreasing

解决方法见代码

删除：drop

a = pd.Series([9,8,7,6],index=['a','b','c','d'])

a
Out[17]: 
a    9
b    8
c    7
d    6
dtype: int64

a.drop(['b','c'])
Out[18]: 
a    9
d    6
dtype: int64

pandas库的数据类型运算：

import pandas as pd

import numpy as np

a = pd.DataFrame(np.arange(12),reshape(3,4))
Traceback (most recent call last):

  File "<ipython-input-21-a8c747b1897a>", line 1, in <module>
    a = pd.DataFrame(np.arange(12),reshape(3,4))

NameError: name 'reshape' is not defined


a = pd.DataFrame(np.arange(12).reshape(3,4))

a
Out[23]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

b = pd.DataFrame(np.arange(20).reshape(4,5))

b
Out[25]: 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

a+b
Out[26]: 
      0     1     2     3   4
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

b.add(a,fill_value = 0)
Out[27]: 
      0     1     2     3     4
0   0.0   2.0   4.0   6.0   4.0
1   9.0  11.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

a.mul(b,fill_value = 0)
Out[28]: 
      0     1      2      3    4
0   0.0   1.0    4.0    9.0  0.0
1  20.0  30.0   42.0   56.0  0.0
2  80.0  99.0  120.0  143.0  0.0
3   0.0   0.0    0.0    0.0  0.0

不同维度间为广播运算：

b = pd.DataFrame(np.arange(20).reshape(4,5))

b
Out[31]: 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

c =pd.Series(np.arange(4))

c
Out[33]: 
0    0
1    1
2    2
3    3
dtype: int32

c-10
Out[34]: 
0   -10
1    -9
2    -8
3    -7
dtype: int32

b-c
Out[35]: 
      0     1     2     3   4
0   0.0   0.0   0.0   0.0 NaN
1   5.0   5.0   5.0   5.0 NaN
2  10.0  10.0  10.0  10.0 NaN
3  15.0  15.0  15.0  15.0 NaN

b.sub(c,axis=0)
Out[36]:
0 1 2 3 4
0 0 1 2 3 4
1 4 5 6 7 8
2 8 9 10 11 12
3 12 13 14 15 16

排序：

.sort_index()方法在指定轴上根据索引进行排序，默认升序。

.sort_index(axis=0,ascending=True)

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])

b
Out[4]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

b.sort_index()
Out[5]: 
    0   1   2   3   4
a   5   6   7   8   9
b  15  16  17  18  19
c   0   1   2   3   4
d  10  11  12  13  14

b.sort_index(ascending=False)
Out[6]: 
    0   1   2   3   4
d  10  11  12  13  14
c   0   1   2   3   4
b  15  16  17  18  19
a   5   6   7   8   9

.sort_values()方法在指定轴上根据数值进行排序，默认升序

Series.sort_values(axis=0,ascending=True)

DataFrame(by,axis=0,ascending=True)

by:axis轴上某个索引或索引列表

NaN统一放到排序末尾

基本统计分析：

.describe()

a = pd.Series([9,8,7,6])

a
Out[8]: 
0    9
1    8
2    7
3    6
dtype: int64

a.describe()
Out[9]: 
count    4.000000
mean     7.500000
std      1.290994
min      6.000000
25%      6.750000
50%      7.500000
75%      8.250000
max      9.000000
dtype: float64

a.describe()['count']
Out[10]: 4.0

b.describe()
Out[11]: 
               0          1          2          3          4
count   4.000000   4.000000   4.000000   4.000000   4.000000
mean    7.500000   8.500000   9.500000  10.500000  11.500000
std     6.454972   6.454972   6.454972   6.454972   6.454972
min     0.000000   1.000000   2.000000   3.000000   4.000000
25%     3.750000   4.750000   5.750000   6.750000   7.750000
50%     7.500000   8.500000   9.500000  10.500000  11.500000
75%    11.250000  12.250000  13.250000  14.250000  15.250000
max    15.000000  16.000000  17.000000  18.000000  19.000000

b.describe()[2]
Out[12]: 
count     4.000000
mean      9.500000
std       6.454972
min       2.000000
25%       5.750000
50%       9.500000
75%      13.250000
max      17.000000
Name: 2, dtype: float64

数据的累计统计分析：

.cumsum()依次给出前1、2、。。。n个数的和

.cumprod()积

.cummax()最大值

.cummin()最小值

b.cumsum()
Out[13]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   7   9  11  13
d  15  18  21  24  27
b  30  34  38  42  46

滚动计算

.rolling(w).sum()依次计算相邻w个元素的和

.rolling(w).mean()算术平均值

.rolling(w).var()方差

.rolling(w).std()标准差

.rolling(w).min() .max()最小值、最大值

b.rolling(2).sum()
Out[14]: 
      0     1     2     3     4
c   NaN   NaN   NaN   NaN   NaN
a   5.0   7.0   9.0  11.0  13.0
d  15.0  17.0  19.0  21.0  23.0
b  25.0  27.0  29.0  31.0  33.0