3.1 Pandas介绍
3.1.1 Pandas简介
panel + data + analysis
- 2008年由WesMcKinney开发
- 专门用于数据挖掘的python库
- 以NumPy为基础,依赖NumPy来实现高性能计算
- 基于matplotlib,能够简单地画图
- 独特的数据结构
3.1.2 为什么使用Pandas
- 便捷的数据处理能力
- 方便的文件读取功能
- 封装了Matplotlib、NumPy的画图和计算
3.1.3 DataFrame
- 结构:既有行索引,又有列索引的二维数据
- 行索引:表明不同行,横向索引,index,axis=0
- 列索引:表明不同列,纵向索引,columns,axis=1
- 常用属性
- 形状shape
- index
- columns
- values
- T
- 方法
- DataFrame.head()
- DataFrame.tail()
- DataFrame索引的设置
- 修改行列索引值
- 重设索引
- 设置新索引
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(10, 5))
stock_change
array([[-1.23168733, 0.42368921, -0.06619239, -0.05146062, 0.04914023],
[-0.80122527, 1.52284836, -1.49273893, -0.54990148, -0.96014347],
[ 0.64891876, -0.34520261, 0.90939211, 0.48298636, 1.14078454],
[-1.29226065, -1.42416842, -1.22687598, -0.60255858, -0.06617384],
[ 0.93717564, 0.24636034, 0.98573956, 0.43211993, 0.63089973],
[-0.14848874, 2.5762479 , -1.3517323 , 0.61471629, 1.79526939],
[-0.16038667, 2.36821683, 0.1179758 , 0.23462105, -0.49421883],
[-0.15292821, 0.79710342, 1.27399354, -1.52690597, 0.31501682],
[ 0.96106066, -0.92449658, 0.45836083, -0.0337036 , 0.56269464],
[-1.57881991, 0.33391995, -1.52311807, -0.34654408, 0.25373526]])
1. DataFrame的结构
import pandas as pd
pd.DataFrame(stock_change)
| 0 | 1 | 2 | 3 | 4 |
---|
0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
stocks = ['股票{}'.format(i) for i in range(10)]
stocks
['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9']
pd.DataFrame(stock_change, index=stocks)
| 0 | 1 | 2 | 3 | 4 |
---|
股票0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
股票1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
股票2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
股票3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
股票4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
股票5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
股票6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
股票7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
股票8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
股票9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
产生一个时间索引:
pd.date_range(start=None, end=None, periods=None, freq='B)
- start:开始时间
- end:结束时间
- periods:时间天数
- freq:递进单位,默认1天,'B’表示工作日(略过周末)
date = pd.date_range(start='20180101', periods=5, freq='B')
date
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05'],
dtype='datetime64[ns]', freq='B')
pd.DataFrame(stock_change, index=stocks, columns=date)
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
股票1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
股票2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
股票3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
股票4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
股票5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
股票6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
股票7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
股票8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
股票9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
2. DataFrame的属性
data = pd.DataFrame(stock_change, index=stocks, columns=date)
data
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
股票1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
股票2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
股票3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
股票4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
股票5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
股票6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
股票7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
股票8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
股票9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
data.shape
(10, 5)
data.index
Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object')
data.columns
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05'],
dtype='datetime64[ns]', freq='B')
data.values
array([[-1.23168733, 0.42368921, -0.06619239, -0.05146062, 0.04914023],
[-0.80122527, 1.52284836, -1.49273893, -0.54990148, -0.96014347],
[ 0.64891876, -0.34520261, 0.90939211, 0.48298636, 1.14078454],
[-1.29226065, -1.42416842, -1.22687598, -0.60255858, -0.06617384],
[ 0.93717564, 0.24636034, 0.98573956, 0.43211993, 0.63089973],
[-0.14848874, 2.5762479 , -1.3517323 , 0.61471629, 1.79526939],
[-0.16038667, 2.36821683, 0.1179758 , 0.23462105, -0.49421883],
[-0.15292821, 0.79710342, 1.27399354, -1.52690597, 0.31501682],
[ 0.96106066, -0.92449658, 0.45836083, -0.0337036 , 0.56269464],
[-1.57881991, 0.33391995, -1.52311807, -0.34654408, 0.25373526]])
data.T
| 股票0 | 股票1 | 股票2 | 股票3 | 股票4 | 股票5 | 股票6 | 股票7 | 股票8 | 股票9 |
---|
2018-01-01 | -1.231687 | -0.801225 | 0.648919 | -1.292261 | 0.937176 | -0.148489 | -0.160387 | -0.152928 | 0.961061 | -1.578820 |
---|
2018-01-02 | 0.423689 | 1.522848 | -0.345203 | -1.424168 | 0.246360 | 2.576248 | 2.368217 | 0.797103 | -0.924497 | 0.333920 |
---|
2018-01-03 | -0.066192 | -1.492739 | 0.909392 | -1.226876 | 0.985740 | -1.351732 | 0.117976 | 1.273994 | 0.458361 | -1.523118 |
---|
2018-01-04 | -0.051461 | -0.549901 | 0.482986 | -0.602559 | 0.432120 | 0.614716 | 0.234621 | -1.526906 | -0.033704 | -0.346544 |
---|
2018-01-05 | 0.049140 | -0.960143 | 1.140785 | -0.066174 | 0.630900 | 1.795269 | -0.494219 | 0.315017 | 0.562695 | 0.253735 |
---|
3. DataFrame的方法
data.head()
data.head(3)
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
股票1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
股票2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
股票3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
股票4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
股票1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
股票2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
data.tail()
data.tail(3)
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
股票6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
股票7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
股票8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
股票9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
股票8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
股票9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
4. DataFrame索引的设置
A.修改行列索引值
data
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
股票1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
股票2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
股票3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
股票4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
股票5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
股票6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
股票7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
股票8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
股票9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
data.index[2] = '股票88'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-54-639e4c928d71> in <module>
1 # 不能单独修改索引
----> 2 data.index[2] = '股票88'
D:\Anaconda\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
3908
3909 def __setitem__(self, key, value):
-> 3910 raise TypeError("Index does not support mutable operations")
3911
3912 def __getitem__(self, key):
TypeError: Index does not support mutable operations
stocks_ = ['股票-{}'.format(i) for i in range(10)]
stocks_
['股票-0',
'股票-1',
'股票-2',
'股票-3',
'股票-4',
'股票-5',
'股票-6',
'股票-7',
'股票-8',
'股票-9']
data.index = stocks_
data
| 2018-01-01 | 2018-01-02 | 2018-01-03 | 2018-01-04 | 2018-01-05 |
---|
股票-0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
股票-1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
股票-2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
股票-3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
股票-4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
股票-5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
股票-6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
股票-7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
股票-8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
股票-9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
B. 重设索引
- reset_index(drop=False)
- 设置新的下标索引
- drop:默认为False,即不删除原索引,如果为True,则删除原索引
data.reset_index()
data.reset_index().shape
| index | 2018-01-01 00:00:00 | 2018-01-02 00:00:00 | 2018-01-03 00:00:00 | 2018-01-04 00:00:00 | 2018-01-05 00:00:00 |
---|
0 | 股票-0 | -1.231687 | 0.423689 | -0.066192 | -0.051461 | 0.049140 |
---|
1 | 股票-1 | -0.801225 | 1.522848 | -1.492739 | -0.549901 | -0.960143 |
---|
2 | 股票-2 | 0.648919 | -0.345203 | 0.909392 | 0.482986 | 1.140785 |
---|
3 | 股票-3 | -1.292261 | -1.424168 | -1.226876 | -0.602559 | -0.066174 |
---|
4 | 股票-4 | 0.937176 | 0.246360 | 0.985740 | 0.432120 | 0.630900 |
---|
5 | 股票-5 | -0.148489 | 2.576248 | -1.351732 | 0.614716 | 1.795269 |
---|
6 | 股票-6 | -0.160387 | 2.368217 | 0.117976 | 0.234621 | -0.494219 |
---|
7 | 股票-7 | -0.152928 | 0.797103 | 1.273994 | -1.526906 | 0.315017 |
---|
8 | 股票-8 | 0.961061 | -0.924497 | 0.458361 | -0.033704 | 0.562695 |
---|
9 | 股票-9 | -1.578820 | 0.333920 | -1.523118 | -0.346544 | 0.253735 |
---|
(10, 6)
C. 设置新索引(以某列值设置为新的索引)
- set_index(keys, drop=True)
- 列索引名称或者列索引名称的列表
- drop:boolean, default True.当作新的索引,删除原来的列
1. 创建数据
df = pd.DataFrame({'month': [1, 4, 7, 10],
'year': [2012, 2014, 2013, 2014],
'sale': [55, 40, 84, 31]})
df
| month | year | sale |
---|
0 | 1 | 2012 | 55 |
---|
1 | 4 | 2014 | 40 |
---|
2 | 7 | 2013 | 84 |
---|
3 | 10 | 2014 | 31 |
---|
2. 以月份设置新的索引
df.set_index('month')
| year | sale |
---|
month | | |
---|
1 | 2012 | 55 |
---|
4 | 2014 | 40 |
---|
7 | 2013 | 84 |
---|
10 | 2014 | 31 |
---|
df.set_index('month', drop=False)
| month | year | sale |
---|
month | | | |
---|
1 | 1 | 2012 | 55 |
---|
4 | 4 | 2014 | 40 |
---|
7 | 7 | 2013 | 84 |
---|
10 | 10 | 2014 | 31 |
---|
new_df = df.set_index(['year','month'])
new_df
| | sale |
---|
year | month | |
---|
2012 | 1 | 55 |
---|
2014 | 4 | 40 |
---|
2013 | 7 | 84 |
---|
2014 | 10 | 31 |
---|
new_df.index
MultiIndex([(2012, 1),
(2014, 4),
(2013, 7),
(2014, 10)],
names=['year', 'month'])
通过以上设置,DataFrame就变成了具有multiindex的DATaFrame