数据分析第三篇——Pandas介绍与DataFrame

3.1 Pandas介绍

3.1.1 Pandas简介

panel + data + analysis

  • 2008年由WesMcKinney开发
  • 专门用于数据挖掘的python库
  • 以NumPy为基础,依赖NumPy来实现高性能计算
  • 基于matplotlib,能够简单地画图
  • 独特的数据结构

3.1.2 为什么使用Pandas

  • 便捷的数据处理能力
  • 方便的文件读取功能
  • 封装了Matplotlib、NumPy的画图和计算

3.1.3 DataFrame

  1. 结构:既有行索引,又有列索引的二维数据
    • 行索引:表明不同行,横向索引,index,axis=0
    • 列索引:表明不同列,纵向索引,columns,axis=1
  2. 常用属性
    • 形状shape
    • index
    • columns
    • values
    • T
  3. 方法
    • DataFrame.head()
    • DataFrame.tail()
  4. DataFrame索引的设置
    1. 修改行列索引值
    2. 重设索引
    3. 设置新索引
# 创建一个符合正态分布的10个股票5天的涨跌幅数据
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(10, 5))
stock_change
array([[-1.23168733,  0.42368921, -0.06619239, -0.05146062,  0.04914023],
       [-0.80122527,  1.52284836, -1.49273893, -0.54990148, -0.96014347],
       [ 0.64891876, -0.34520261,  0.90939211,  0.48298636,  1.14078454],
       [-1.29226065, -1.42416842, -1.22687598, -0.60255858, -0.06617384],
       [ 0.93717564,  0.24636034,  0.98573956,  0.43211993,  0.63089973],
       [-0.14848874,  2.5762479 , -1.3517323 ,  0.61471629,  1.79526939],
       [-0.16038667,  2.36821683,  0.1179758 ,  0.23462105, -0.49421883],
       [-0.15292821,  0.79710342,  1.27399354, -1.52690597,  0.31501682],
       [ 0.96106066, -0.92449658,  0.45836083, -0.0337036 ,  0.56269464],
       [-1.57881991,  0.33391995, -1.52311807, -0.34654408,  0.25373526]])
1. DataFrame的结构
import pandas as pd
pd.DataFrame(stock_change)
01234
0-1.2316870.423689-0.066192-0.0514610.049140
1-0.8012251.522848-1.492739-0.549901-0.960143
20.648919-0.3452030.9093920.4829861.140785
3-1.292261-1.424168-1.226876-0.602559-0.066174
40.9371760.2463600.9857400.4321200.630900
5-0.1484892.576248-1.3517320.6147161.795269
6-0.1603872.3682170.1179760.234621-0.494219
7-0.1529280.7971031.273994-1.5269060.315017
80.961061-0.9244970.458361-0.0337040.562695
9-1.5788200.333920-1.523118-0.3465440.253735
# 添加行索引
stocks = ['股票{}'.format(i) for i in range(10)]
stocks
['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9']
pd.DataFrame(stock_change, index=stocks)
01234
股票0-1.2316870.423689-0.066192-0.0514610.049140
股票1-0.8012251.522848-1.492739-0.549901-0.960143
股票20.648919-0.3452030.9093920.4829861.140785
股票3-1.292261-1.424168-1.226876-0.602559-0.066174
股票40.9371760.2463600.9857400.4321200.630900
股票5-0.1484892.576248-1.3517320.6147161.795269
股票6-0.1603872.3682170.1179760.234621-0.494219
股票7-0.1529280.7971031.273994-1.5269060.315017
股票80.961061-0.9244970.458361-0.0337040.562695
股票9-1.5788200.333920-1.523118-0.3465440.253735
产生一个时间索引:

pd.date_range(start=None, end=None, periods=None, freq='B)

  • start:开始时间
  • end:结束时间
  • periods:时间天数
  • freq:递进单位,默认1天,'B’表示工作日(略过周末)
# 添加列索引
date = pd.date_range(start='20180101', periods=5, freq='B') # Freq:频率(Frequency),B代表Business——交易日
date
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='B')
pd.DataFrame(stock_change, index=stocks, columns=date)
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票0-1.2316870.423689-0.066192-0.0514610.049140
股票1-0.8012251.522848-1.492739-0.549901-0.960143
股票20.648919-0.3452030.9093920.4829861.140785
股票3-1.292261-1.424168-1.226876-0.602559-0.066174
股票40.9371760.2463600.9857400.4321200.630900
股票5-0.1484892.576248-1.3517320.6147161.795269
股票6-0.1603872.3682170.1179760.234621-0.494219
股票7-0.1529280.7971031.273994-1.5269060.315017
股票80.961061-0.9244970.458361-0.0337040.562695
股票9-1.5788200.333920-1.523118-0.3465440.253735
2. DataFrame的属性
data = pd.DataFrame(stock_change, index=stocks, columns=date)
data
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票0-1.2316870.423689-0.066192-0.0514610.049140
股票1-0.8012251.522848-1.492739-0.549901-0.960143
股票20.648919-0.3452030.9093920.4829861.140785
股票3-1.292261-1.424168-1.226876-0.602559-0.066174
股票40.9371760.2463600.9857400.4321200.630900
股票5-0.1484892.576248-1.3517320.6147161.795269
股票6-0.1603872.3682170.1179760.234621-0.494219
股票7-0.1529280.7971031.273994-1.5269060.315017
股票80.961061-0.9244970.458361-0.0337040.562695
股票9-1.5788200.333920-1.523118-0.3465440.253735
data.shape
(10, 5)
data.index
Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object')
data.columns
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='B')
data.values
array([[-1.23168733,  0.42368921, -0.06619239, -0.05146062,  0.04914023],
       [-0.80122527,  1.52284836, -1.49273893, -0.54990148, -0.96014347],
       [ 0.64891876, -0.34520261,  0.90939211,  0.48298636,  1.14078454],
       [-1.29226065, -1.42416842, -1.22687598, -0.60255858, -0.06617384],
       [ 0.93717564,  0.24636034,  0.98573956,  0.43211993,  0.63089973],
       [-0.14848874,  2.5762479 , -1.3517323 ,  0.61471629,  1.79526939],
       [-0.16038667,  2.36821683,  0.1179758 ,  0.23462105, -0.49421883],
       [-0.15292821,  0.79710342,  1.27399354, -1.52690597,  0.31501682],
       [ 0.96106066, -0.92449658,  0.45836083, -0.0337036 ,  0.56269464],
       [-1.57881991,  0.33391995, -1.52311807, -0.34654408,  0.25373526]])
data.T # 转置
股票0股票1股票2股票3股票4股票5股票6股票7股票8股票9
2018-01-01-1.231687-0.8012250.648919-1.2922610.937176-0.148489-0.160387-0.1529280.961061-1.578820
2018-01-020.4236891.522848-0.345203-1.4241680.2463602.5762482.3682170.797103-0.9244970.333920
2018-01-03-0.066192-1.4927390.909392-1.2268760.985740-1.3517320.1179761.2739940.458361-1.523118
2018-01-04-0.051461-0.5499010.482986-0.6025590.4321200.6147160.234621-1.526906-0.033704-0.346544
2018-01-050.049140-0.9601431.140785-0.0661740.6309001.795269-0.4942190.3150170.5626950.253735
3. DataFrame的方法
data.head()
data.head(3)
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票0-1.2316870.423689-0.066192-0.0514610.049140
股票1-0.8012251.522848-1.492739-0.549901-0.960143
股票20.648919-0.3452030.9093920.4829861.140785
股票3-1.292261-1.424168-1.226876-0.602559-0.066174
股票40.9371760.2463600.9857400.4321200.630900
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票0-1.2316870.423689-0.066192-0.0514610.049140
股票1-0.8012251.522848-1.492739-0.549901-0.960143
股票20.648919-0.3452030.9093920.4829861.140785
data.tail()
data.tail(3)
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票5-0.1484892.576248-1.3517320.6147161.795269
股票6-0.1603872.3682170.1179760.234621-0.494219
股票7-0.1529280.7971031.273994-1.5269060.315017
股票80.961061-0.9244970.458361-0.0337040.562695
股票9-1.5788200.333920-1.523118-0.3465440.253735
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票7-0.1529280.7971031.273994-1.5269060.315017
股票80.961061-0.9244970.458361-0.0337040.562695
股票9-1.5788200.333920-1.523118-0.3465440.253735
4. DataFrame索引的设置
A.修改行列索引值
data
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票0-1.2316870.423689-0.066192-0.0514610.049140
股票1-0.8012251.522848-1.492739-0.549901-0.960143
股票20.648919-0.3452030.9093920.4829861.140785
股票3-1.292261-1.424168-1.226876-0.602559-0.066174
股票40.9371760.2463600.9857400.4321200.630900
股票5-0.1484892.576248-1.3517320.6147161.795269
股票6-0.1603872.3682170.1179760.234621-0.494219
股票7-0.1529280.7971031.273994-1.5269060.315017
股票80.961061-0.9244970.458361-0.0337040.562695
股票9-1.5788200.333920-1.523118-0.3465440.253735
# 不能单独修改索引
data.index[2] = '股票88'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-54-639e4c928d71> in <module>
      1 # 不能单独修改索引
----> 2 data.index[2] = '股票88'


D:\Anaconda\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   3908 
   3909     def __setitem__(self, key, value):
-> 3910         raise TypeError("Index does not support mutable operations")
   3911 
   3912     def __getitem__(self, key):


TypeError: Index does not support mutable operations
stocks_ = ['股票-{}'.format(i) for i in range(10)]
stocks_
['股票-0',
 '股票-1',
 '股票-2',
 '股票-3',
 '股票-4',
 '股票-5',
 '股票-6',
 '股票-7',
 '股票-8',
 '股票-9']
# 索引的修改只能整体进行
data.index = stocks_
data
2018-01-012018-01-022018-01-032018-01-042018-01-05
股票-0-1.2316870.423689-0.066192-0.0514610.049140
股票-1-0.8012251.522848-1.492739-0.549901-0.960143
股票-20.648919-0.3452030.9093920.4829861.140785
股票-3-1.292261-1.424168-1.226876-0.602559-0.066174
股票-40.9371760.2463600.9857400.4321200.630900
股票-5-0.1484892.576248-1.3517320.6147161.795269
股票-6-0.1603872.3682170.1179760.234621-0.494219
股票-7-0.1529280.7971031.273994-1.5269060.315017
股票-80.961061-0.9244970.458361-0.0337040.562695
股票-9-1.5788200.333920-1.523118-0.3465440.253735
B. 重设索引
  • reset_index(drop=False)
    • 设置新的下标索引
    • drop:默认为False,即不删除原索引,如果为True,则删除原索引
data.reset_index()
data.reset_index().shape
index2018-01-01 00:00:002018-01-02 00:00:002018-01-03 00:00:002018-01-04 00:00:002018-01-05 00:00:00
0股票-0-1.2316870.423689-0.066192-0.0514610.049140
1股票-1-0.8012251.522848-1.492739-0.549901-0.960143
2股票-20.648919-0.3452030.9093920.4829861.140785
3股票-3-1.292261-1.424168-1.226876-0.602559-0.066174
4股票-40.9371760.2463600.9857400.4321200.630900
5股票-5-0.1484892.576248-1.3517320.6147161.795269
6股票-6-0.1603872.3682170.1179760.234621-0.494219
7股票-7-0.1529280.7971031.273994-1.5269060.315017
8股票-80.961061-0.9244970.458361-0.0337040.562695
9股票-9-1.5788200.333920-1.523118-0.3465440.253735
(10, 6)
C. 设置新索引(以某列值设置为新的索引)
  • set_index(keys, drop=True)
    • 列索引名称或者列索引名称的列表
    • drop:boolean, default True.当作新的索引,删除原来的列
1. 创建数据
df = pd.DataFrame({'month': [1, 4, 7, 10],
                    'year': [2012, 2014, 2013, 2014],
                    'sale': [55, 40, 84, 31]})
df
monthyearsale
01201255
14201440
27201384
310201431
2. 以月份设置新的索引
df.set_index('month')
yearsale
month
1201255
4201440
7201384
10201431
df.set_index('month', drop=False)
monthyearsale
month
11201255
44201440
77201384
1010201431
new_df = df.set_index(['year','month'])
new_df
sale
yearmonth
2012155
2014440
2013784
20141031
new_df.index
MultiIndex([(2012,  1),
            (2014,  4),
            (2013,  7),
            (2014, 10)],
           names=['year', 'month'])
通过以上设置,DataFrame就变成了具有multiindex的DATaFrame
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值