pandas索引.pandas.v0.22

# from http://pandas.pydata.org/pandas-docs/stable/indexing.html
# pandas 0.22.0 
import pandas as pd
import numpy as np
"""
尽量利用pandas提供的专用索引方式,而不是python通用的切片方式。
三种主要的索引方式:indexers
.loc label based 基于标签, 可以是标签名,可以是布尔值,可以是一元函数
.iloc integer psition based 基于整数位置(from 0 to length-1 of the axis),和python切片类似
[]

"""

"""
Object Type         Selection           Return Value Type
------------------------------------------------------------
Series              series[label]       scalar value
DataFrame           frame[colname]      Series corresponding to colname
Panel               panel[itemname]     DataFrame corresponding to the itemname
"""
'\n尽量利用pandas提供的专用索引方式,而不是python通用的切片方式。\n三种主要的索引方式:indexers\n.loc label based 基于标签, 可以是标签名,可以是布尔值,可以是一元函数\n.iloc integer psition based 基于整数位置(from 0 to length-1 of the axis),和python切片类似\n[]\n\n'






'\nObject Type \t\tIndexers\nSeries \t\t\t\ts.loc[indexer]\nDataFrame \t\t\tdf.loc[row_indexer,column_indexer]\nPanel \t\t\t\tp.loc[item_indexer,major_indexer,minor_indexer]\n'
# Here we construct a simple time series data set to use for illustrating the indexing functionality 
# 构造时间序列,举例说明索引的功能

dates = pd.date_range('1/1/2000', periods=8)
dates
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')
ABCD
2000-01-010.1366280.3240970.856313-0.145259
2000-01-020.113819-0.7186300.016217-1.571263
2000-01-03-0.603598-0.143643-1.143063-0.425266
2000-01-04-0.486200-0.136663-2.016020-0.815514
2000-01-050.8353181.036607-0.5029190.878680
2000-01-06-2.3762050.362577-0.484754-0.478711
2000-01-070.1933711.3304680.5441601.030900
2000-01-080.476533-0.476653-0.4343560.744500
panel = pd.Panel({'one' : df, 'two' : df - df.mean()})  # 多维表格
panel
panel['one']
panel['two']
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 8 (major_axis) x 4 (minor_axis)
Items axis: one to two
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-08 00:00:00
Minor_axis axis: A to D
ABCD
2000-01-010.1366280.3240970.856313-0.145259
2000-01-020.113819-0.7186300.016217-1.571263
2000-01-03-0.603598-0.143643-1.143063-0.425266
2000-01-04-0.486200-0.136663-2.016020-0.815514
2000-01-050.8353181.036607-0.5029190.878680
2000-01-06-2.3762050.362577-0.484754-0.478711
2000-01-070.1933711.3304680.5441601.030900
2000-01-080.476533-0.476653-0.4343560.744500
ABCD
2000-01-010.3504190.1268271.251866-0.047518
2000-01-020.327611-0.9159000.411770-1.473522
2000-01-03-0.389806-0.340913-0.747511-0.327524
2000-01-04-0.272408-0.333933-1.620467-0.717772
2000-01-051.0491100.839337-0.1073660.976422
2000-01-06-2.1624130.165307-0.089201-0.380969
2000-01-070.4071631.1331980.9397131.128641
2000-01-080.690325-0.673923-0.0388030.842241
# Thus, as per above, we have the most basic indexing using []:
# 最基本的索引方法,使用[]

s = df['A']
s
dates[5]  # 注意切片从0计数
s[dates[5]]
2000-01-01    0.136628
2000-01-02    0.113819
2000-01-03   -0.603598
2000-01-04   -0.486200
2000-01-05    0.835318
2000-01-06   -2.376205
2000-01-07    0.193371
2000-01-08    0.476533
Freq: D, Name: A, dtype: float64






Timestamp('2000-01-06 00:00:00', freq='D')






-2.376204948581219
# 在[]中传入列名的列表,如[ 'A', "B" ]
columns_l = ['A', 'B']
df[columns_l]  
df[[ 'A', "B" ]]  # 相当于上面,注意两重[]
df
AB
2000-01-010.3240970.136628
2000-01-02-0.7186300.113819
2000-01-03-0.143643-0.603598
2000-01-04-0.136663-0.486200
2000-01-051.0366070.835318
2000-01-060.362577-2.376205
2000-01-071.3304680.193371
2000-01-08-0.4766530.476533
AB
2000-01-010.3240970.136628
2000-01-02-0.7186300.113819
2000-01-03-0.143643-0.603598
2000-01-04-0.136663-0.486200
2000-01-051.0366070.835318
2000-01-060.362577-2.376205
2000-01-071.3304680.193371
2000-01-08-0.4766530.476533
ABCD
2000-01-010.3240970.1366280.856313-0.145259
2000-01-02-0.7186300.1138190.016217-1.571263
2000-01-03-0.143643-0.603598-1.143063-0.425266
2000-01-04-0.136663-0.486200-2.016020-0.815514
2000-01-051.0366070.835318-0.5029190.878680
2000-01-060.362577-2.376205-0.484754-0.478711
2000-01-071.3304680.1933710.5441601.030900
2000-01-08-0.4766530.476533-0.4343560.744500
df[['A', 'B']] = df[['B', 'A']]  # 交换两列的值
df
df.loc[:,['B', 'A']]

# 下式不能交换两列的值
# This will not modify df because the column alignment is before value assignment. ? 不理解?
# 列赋值在值赋值之前?
df.loc[:,['B', 'A']] = df[['A', 'B']]  
df

# df.loc[:,['B', 'A']] = df[:, ['A', 'B']]   # 错误?
# df

# 正确的方式:
df.loc[:,['B', 'A']] = df[['A', 'B']].values  # 取列的值
df
ABCD
2000-01-010.1366280.3240970.856313-0.145259
2000-01-020.113819-0.7186300.016217-1.571263
2000-01-03-0.603598-0.143643-1.143063-0.425266
2000-01-04-0.486200-0.136663-2.016020-0.815514
2000-01-050.8353181.036607-0.5029190.878680
2000-01-06-2.3762050.362577-0.484754-0.478711
2000-01-070.1933711.3304680.5441601.030900
2000-01-080.476533-0.476653-0.4343560.744500
BA
2000-01-010.3240970.136628
2000-01-02-0.7186300.113819
2000-01-03-0.143643-0.603598
2000-01-04-0.136663-0.486200
2000-01-051.0366070.835318
2000-01-060.362577-2.376205
2000-01-071.3304680.193371
2000-01-08-0.4766530.476533
ABCD
2000-01-010.1366280.3240970.856313-0.145259
2000-01-020.113819-0.7186300.016217-1.571263
2000-01-03-0.603598-0.143643-1.143063-0.425266
2000-01-04-0.486200-0.136663-2.016020-0.815514
2000-01-050.8353181.036607-0.5029190.878680
2000-01-06-2.3762050.362577-0.484754-0.478711
2000-01-070.1933711.3304680.5441601.030900
2000-01-080.476533-0.476653-0.4343560.744500
ABCD
2000-01-010.3240970.1366280.856313-0.145259
2000-01-02-0.7186300.1138190.016217-1.571263
2000-01-03-0.143643-0.603598-1.143063-0.425266
2000-01-04-0.136663-0.486200-2.016020-0.815514
2000-01-051.0366070.835318-0.5029190.878680
2000-01-060.362577-2.376205-0.484754-0.478711
2000-01-071.3304680.1933710.5441601.030900
2000-01-08-0.4766530.476533-0.4343560.744500
# Attribute Access¶
# 属性访问,属性存取
# You may access an index on a Series, column on a DataFrame, 
# and an item on a Panel directly as an attribute:

sa = pd.Series([1,2,3],index=list('abc'))
sa

dfa = df.copy()
dfa
a    1
b    2
c    3
dtype: int64
ABCD
2000-01-010.3240970.1366280.856313-0.145259
2000-01-02-0.7186300.1138190.016217-1.571263
2000-01-03-0.143643-0.603598-1.143063-0.425266
2000-01-04-0.136663-0.486200-2.016020-0.815514
2000-01-051.0366070.835318-0.5029190.878680
2000-01-060.362577-2.376205-0.484754-0.478711
2000-01-071.3304680.1933710.5441601.030900
2000-01-08-0.4766530.476533-0.4343560.744500
# 对item column等赋值
# 就像属性一样的存取,但需要注意:
# 1. 名称应符合python命名规则,由字母、数字和下划线组成
# 2. 不能和已有的方法名称重名,例如 min
# 3. 不能与pandas内部“关键字”重名,例如 index axis items labels
# 以上情况,可以使用 [ "" ] 引用

sa.a = 5
sa

dfa.index
dfa.A = list(range(len(dfa.index)))
dfa
a    5
b    2
c    3
dtype: int64






DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')
ABCD
2000-01-0100.1366280.856313-0.145259
2000-01-0210.1138190.016217-1.571263
2000-01-032-0.603598-1.143063-0.425266
2000-01-043-0.486200-2.016020-0.815514
2000-01-0540.835318-0.5029190.878680
2000-01-065-2.376205-0.484754-0.478711
2000-01-0760.1933710.5441601.030900
2000-01-0870.476533-0.4343560.744500
x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})  # 字典key值为列名
x
xy
013
124
235
dict(x=9, y=99)
x.iloc[1]
x.iloc[1] = dict(x=9, y=99)
x
{'x': 9, 'y': 99}






x    2
y    4
Name: 1, dtype: int64
xy
013
1999
235
df = pd.DataFrame({'one': [1., 2., 3.]})
df
df.two = [4, 5, 6]  # 错误,不能增加一列,利用属性的方式不能对没有的列赋值
df.two  # 但是增加了一项属性,而且可以取得这项属性
df
df['two'] = [4, 5, 6]  # 可以增加一列
df
one
01.0
12.0
23.0
d:\python\36-64\lib\site-packages\ipykernel_launcher.py:3: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  This is separate from the ipykernel package so we can avoid doing imports until





[4, 5, 6]
one
01.0
12.0
23.0
onetwo
01.04
12.05
23.06
# Slicing ranges
# 切片范围
# iloc 方法是最稳健和兼容的
# 下面介绍 方括号 [] 操作符 作为切片

# 对series操作
# 取值
s
s[:5]
s[::2]
s[::-1]
2000-01-01    0.324097
2000-01-02   -0.718630
2000-01-03   -0.143643
2000-01-04   -0.136663
2000-01-05    1.036607
2000-01-06    0.362577
2000-01-07    1.330468
2000-01-08   -0.476653
Freq: D, Name: A, dtype: float64






2000-01-01    0.324097
2000-01-02   -0.718630
2000-01-03   -0.143643
2000-01-04   -0.136663
2000-01-05    1.036607
Freq: D, Name: A, dtype: float64






2000-01-01    0.324097
2000-01-03   -0.143643
2000-01-05    1.036607
2000-01-07    1.330468
Freq: 2D, Name: A, dtype: float64






2000-01-08   -0.476653
2000-01-07    1.330468
2000-01-06    0.362577
2000-01-05    1.036607
2000-01-04   -0.136663
2000-01-03   -0.143643
2000-01-02   -0.718630
2000-01-01    0.324097
Freq: -1D, Name: A, dtype: float64
# 赋值
s2 = s.copy()
s2
s2[:5] = 0
s2
2000-01-01    0.324097
2000-01-02   -0.718630
2000-01-03   -0.143643
2000-01-04   -0.136663
2000-01-05    1.036607
2000-01-06    0.362577
2000-01-07    1.330468
2000-01-08   -0.476653
Freq: D, Name: A, dtype: float64






2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06    0.362577
2000-01-07    1.330468
2000-01-08   -0.476653
Freq: D, Name: A, dtype: float64
# 对DataFrame操作
# [] 操作 选择的是行
df
df[:3]
df[::-1]
onetwo
01.04
12.05
23.06
onetwo
01.04
12.05
23.06
onetwo
23.06
12.05
01.04
# Selection By Label
# 通过标签选择

# 这种方式可能依靠上下文关系,有时候会调用链式赋值,这应该避免。参考:
# Returning a view versus a copy 返回视图 对比 拷贝

# When setting values in a pandas object, 
# care must be taken to avoid what is called chained indexing. 
# Here is an example.

[list('abcd'),list('efgh'),list('ijkl'),list('mnop')]
[['one','two'],['first','second']]

dfmi = pd.DataFrame([list('abcd'),list('efgh'),list('ijkl'),list('mnop')],  # 定义四列数值
        columns=pd.MultiIndex.from_product(
         [['one','two'],['first','second']])) # 定义多重索引,第一个list是第一层,以下类推

dfmi
[['a', 'b', 'c', 'd'],
 ['e', 'f', 'g', 'h'],
 ['i', 'j', 'k', 'l'],
 ['m', 'n', 'o', 'p']]






[['one', 'two'], ['first', 'second']]
onetwo
firstsecondfirstsecond
0abcd
1efgh
2ijkl
3mnop
#  Compare these two access methods:
dfmi['one']['second']  # chained
dfmi.loc[:,('one','second')]
0    b
1    f
2    j
3    n
Name: second, dtype: object






0    b
1    f
2    j
3    n
Name: (one, second), dtype: object
# 两种方式分析

# 第一种方式:链式 使用两个 []
dfmi['one']  # 第一个[] 先生成了一个DataFrame
dfmi['one']['second']  # pandas 把两个[] 作为分开的事件,他们执行分开的两步调用 __getitem__
firstsecond
0ab
1ef
2ij
3mn
# 第二种方式:loc
# 通过嵌套的元组切片
df_s = (slice(None), ('one', 'second'))
dfmi.loc[df_s]  # 只调用了一次 __getitem__
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object
# Selection By Label
# 通过标签选择

# loc要注意索引的数据类型,必须与索引的数据类型一致才可以,
# 例如 datetimeIndex 中,使用loc[2:3] ,即整数型的slice 将会出现TypeError


dfl = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'), 
                   index=pd.date_range('20130101',periods=5))

dfl
# dfl.loc[2:3]  # 错误的loc
dfl.loc['20130102':'20130104']  # 使用可转换为datetime的字符串
dfl.loc['20130202':'20130204']  # 不报错,返回为空DataFrame
# dfl.loc['20130202']  # 报错,错误信息是index无此值
dfl.loc['20130104':'20130114']  # 只返回存在的数据
ABCD
2013-01-01-0.8873581.0683620.551961-0.378400
2013-01-021.239840-0.986149-0.8806552.112731
2013-01-03-0.7855261.583703-0.871005-0.659880
2013-01-04-1.2674622.500886-0.9805691.308624
2013-01-05-0.842107-0.9210861.020196-0.055930
ABCD
2013-01-021.239840-0.986149-0.8806552.112731
2013-01-03-0.7855261.583703-0.871005-0.659880
2013-01-04-1.2674622.500886-0.9805691.308624
ABCD
2013-01-04-1.2674622.500886-0.9805691.308624
2013-01-05-0.842107-0.9210861.020196-0.055930
# loc可以使用整数,但此时的整数不代表位置,而是label
# loc是基本的取值方法
# 给loc的输入,即在[]中的值,可以是:
# 1. 单独的label  e.g. 5 or 'a
# 2. labels的list   ['a', 'b', 'c']
# 3. slice对象  'a':'f'  !! 注意:与python的切片不同,pandas的切片包括开始和结尾,而python不包括结尾
# 4. 布尔值
# 5. 调用函数  [lambda df: df.A > 0, :]

# Series
s1 = pd.Series(np.random.randn(6),index=list('abcdef'))
s1
s1.loc['c':]
s1.loc['b']
s1.loc['c':] = 0
s1
a    0.796911
b   -1.341250
c    0.008152
d   -0.745881
e    0.674385
f    1.108411
dtype: float64






c    0.008152
d   -0.745881
e    0.674385
f    1.108411
dtype: float64






-1.3412499335785426






a    0.796911
b   -1.341250
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64
# DataFrame
df1 = pd.DataFrame(np.random.randn(6,4),  # 6X4 阵列
                   index=list('abcdef'),  # 索引
                   columns=list('ABCD'))  # 行号
df1
ABCD
a-1.912141-0.835589-0.188341-1.024797
b-0.977498-2.0502140.355172-0.291794
c-0.183401-0.376330-0.188848-2.116438
d-1.0083590.230593-0.099235-0.426229
e-0.0273382.1254590.066200-0.247813
f-1.1321031.9452351.8911791.549750
df1.loc[['a', 'b', 'd'], :]  # 先是行label选择,再是列label选择
df1.loc['d':, 'A':'C']

# 取得 a cross section 截面,用单个的label,返回Series
# 以下三式等同
df1.loc['a']
df1.loc['a',:]
df1.xs('a')
type(df1.loc['a'])
ABCD
a-1.912141-0.835589-0.188341-1.024797
b-0.977498-2.0502140.355172-0.291794
d-1.0083590.230593-0.099235-0.426229
ABC
d-1.0083590.230593-0.099235
e-0.0273382.1254590.066200
f-1.1321031.9452351.891179
A   -1.912141
B   -0.835589
C   -0.188341
D   -1.024797
Name: a, dtype: float64






A   -1.912141
B   -0.835589
C   -0.188341
D   -1.024797
Name: a, dtype: float64






A   -1.912141
B   -0.835589
C   -0.188341
D   -1.024797
Name: a, dtype: float64






pandas.core.series.Series
# 通过布尔值数组取值
df1.loc['a'] > 0
df1.loc[:, df1.loc['a'] > 0]
A    False
B    False
C    False
D    False
Name: a, dtype: bool
a
b
c
d
e
f
# 获取一个值
# this is also equivalent to ``df1.at['a','A']``
df1.loc['a', 'A']
df1.at['a','A']
-1.9121410098540752






-1.9121410098540752
# Slicing with labels¶
# 用labels切片
s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s
s.loc[3:5]  # 包含 5, 注意不是 3 4 5 ,而是从 标签3 到 标签5
0    a
3    b
2    c
5    d
4    e
dtype: object






3    b
2    c
5    d
dtype: object
# s.loc[3:6]  # 2个错误,ValueError: index must be monotonic increasing or decreasing  KeyError: 6
# 如果排序的话,可以超出范围
s.sort_index()  # 不改变原值
s
s.sort_index().loc[1:6]  # 可以超出范围
s.sort_index().loc[6:8]  # 即使一个都没有
# s.sort_index().loc[8]  # 但不能是单值,必须是切片
0    a
2    c
3    b
4    e
5    d
dtype: object






0    a
3    b
2    c
5    d
4    e
dtype: object






2    c
3    b
4    e
5    d
dtype: object






Series([], dtype: object)
# Selection By Position
# 通过位置选择,仅通过基于索引的整数,与python和numpy类似,从0开始,且不包括最后一个
# iloc的输入:
# 1. 整数
# 2. 整数的list
# 3. 整数的切片
# 4. 布尔值数组
# 5. 调用函数

# Series
s1 = pd.Series(np.random.randn(5), index=list(range(0,10,2)))
s1
s1.iloc[:3]
s1.iloc[3]
s1.iloc[:3] = 0
s1
0   -0.312716
2    1.425936
4    1.716575
6    2.099666
8    0.262365
dtype: float64






0   -0.312716
2    1.425936
4    1.716575
dtype: float64






2.099665679869975






0    0.000000
2    0.000000
4    0.000000
6    2.099666
8    0.262365
dtype: float64
# DataFrame
df1 = pd.DataFrame(np.random.randn(6,4),
                  index=list(range(0,12,2)),
                  columns=list(range(0,8,2)))
df1
0246
01.635993-0.5124501.786760-0.002533
20.1731880.275977-0.044987-1.077772
41.9850201.6040200.127853-1.003384
60.250428-0.1020901.566787-1.708521
8-2.111103-1.2321410.863753-0.545229
10-1.7629991.0098400.2740130.786940
df1.iloc[:3]
df1.iloc[1:5, 2:4]
df1.iloc[[1, 3, 5], [1, 3]]
df1.iloc[1:3, :]
df1.iloc[:, 1:3]
0246
01.635993-0.5124501.786760-0.002533
20.1731880.275977-0.044987-1.077772
41.9850201.6040200.127853-1.003384
46
2-0.044987-1.077772
40.127853-1.003384
61.566787-1.708521
80.863753-0.545229
26
20.275977-1.077772
6-0.102090-1.708521
101.0098400.786940
0246
20.1731880.275977-0.044987-1.077772
41.9850201.6040200.127853-1.003384
24
0-0.5124501.786760
20.275977-0.044987
41.6040200.127853
6-0.1020901.566787
8-1.2321410.863753
101.0098400.274013
# this is also equivalent to ``df1.iat[1,1]``
# 取单个的值
df1.iloc[1, 1]
df1.iat[1, 1]
0.2759774784621013






0.2759774784621013
# For getting a cross section using an integer position (equiv to df.xs(1))
# 取截面,得到Series
df1.iloc[1]
df1.iloc[:,1]
0    0.173188
2    0.275977
4   -0.044987
6   -1.077772
Name: 2, dtype: float64






0    -0.512450
2     0.275977
4     1.604020
6    -0.102090
8    -1.232141
10    1.009840
Name: 2, dtype: float64
# Out of range slice indexes are handled gracefully just as in Python/Numpy.
# 超过索引的切片处理,与python和numpy一样
# 注意:不能是单独索引,或列表中,有超过界限的值,只可以是slice,即带冒号的切片才不会提示错误

x = list('abcdef')
x[4:10]
x[8:10]
s = pd.Series(x)
s.iloc[4:10]
s.iloc[8:10]  # 超过界限bound返回空
['e', 'f']






[]






4    e
5    f
dtype: object






Series([], dtype: object)
dfl = pd.DataFrame(np.random.randn(5,2), columns=list('AB'))
dfl
dfl.iloc[:, 2:3]
dfl.iloc[:, 1:3]
dfl.iloc[4:6]
AB
0-0.0699411.124145
1-0.0257810.940736
2-0.1174170.503736
30.8822860.302845
4-0.1363740.276822
0
1
2
3
4
B
01.124145
10.940736
20.503736
30.302845
40.276822
AB
4-0.1363740.276822
# Selection By Callable
# 通过调用函数进行选择

df1 = pd.DataFrame(np.random.randn(6,4),
                  index=list("abcdef"),
                  columns=list("ABCD"))
df1
df1.A
ABCD
a-1.9475780.8742861.139484-3.004564
b0.5652550.0284400.6856880.973264
c-1.2759920.732339-0.3244901.116887
d0.4333250.002567-1.3101270.844756
e0.341412-0.6066460.0346230.772968
f1.518936-0.5903510.604839-1.461750
a   -1.947578
b    0.565255
c   -1.275992
d    0.433325
e    0.341412
f    1.518936
Name: A, dtype: float64
df1.loc[lambda df: df.A > 0, :]
df1.loc[:, lambda df: ['A', 'B']]
df1.iloc[:, lambda df: [0, 1]]
df1[lambda df: df.columns[0]]
ABCD
b0.5652550.0284400.6856880.973264
d0.4333250.002567-1.3101270.844756
e0.341412-0.6066460.0346230.772968
f1.518936-0.5903510.604839-1.461750
AB
a-1.9475780.874286
b0.5652550.028440
c-1.2759920.732339
d0.4333250.002567
e0.341412-0.606646
f1.518936-0.590351
AB
a-1.9475780.874286
b0.5652550.028440
c-1.2759920.732339
d0.4333250.002567
e0.341412-0.606646
f1.518936-0.590351
a   -1.947578
b    0.565255
c   -1.275992
d    0.433325
e    0.341412
f    1.518936
Name: A, dtype: float64
df1.A
df1.A.loc[lambda s: s > 0]
df1.A.loc[df1.A > 0]
a   -1.947578
b    0.565255
c   -1.275992
d    0.433325
e    0.341412
f    1.518936
Name: A, dtype: float64






b    0.565255
d    0.433325
e    0.341412
f    1.518936
Name: A, dtype: float64






b    0.565255
d    0.433325
e    0.341412
f    1.518936
Name: A, dtype: float64
# 使用这些方法或索引,可以使用链式的选择方法,而不用中间的临时变量。链式方法,不是链式[]
bb = pd.read_csv('data/baseball.csv', index_col='id')
bb.groupby(['year', 'team']).sum().loc[lambda df: df.r > 100]
# IX Indexer is Deprecated
# 不推荐使用ix
# in favor of the more strict .iloc and .loc indexers.
# 使用.iloc和.loc代替.ix

dfd = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6]},index=list('abc'))
dfd
AB
a14
b25
c36
dfd.ix[[0, 2], 'A']
d:\python\36-64\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





a    1
c    3
Name: A, dtype: int64
dfd.index[[0, 2]]  # 取得索引的名称
dfd.loc[dfd.index[[0, 2]], 'A']
dfd.columns.get_loc('A')  # 取得列的索引值
dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Index(['a', 'c'], dtype='object')






a    1
c    3
Name: A, dtype: int64






0






a    1
c    3
Name: A, dtype: int64
# Indexing with list with missing labels is Deprecated
# 不推荐用有缺失标签的list进行索引
# using .loc or [] with a list with one or more missing labels, is deprecated, in favor of .reindex.
# 推荐使用.reindex

s = pd.Series([1, 2, 3])
s
0    1
1    2
2    3
dtype: int64
s.loc[[1, 2]]  # list中的keys都存在,则没有变化
1    2
2    3
dtype: int64
s.loc[[1, 2, 3]]  # 当有缺失时,赋值为NaN 
d:\python\36-64\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.





1    2.0
2    3.0
3    NaN
dtype: float64
# Reindexing

s.reindex([1, 2, 3])
1    2.0
2    3.0
3    NaN
dtype: float64
# 如果仅选择有效的keys
labels = [1, 2, 3]
s.index.intersection(labels)  # index和labes的交集
s.loc[s.index.intersection(labels)]
Int64Index([1, 2], dtype='int64')






1    2
2    3
dtype: int64
# reindex 索引中不能有重复的项
s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
s
labels = ['c', 'd']
# s.reindex(labels)  # 不能reindex
a    0
a    1
b    2
c    3
dtype: int32
# 可以先把交集切出来,再进行reindex
# 但是交集不能有重复的index
s.index.intersection(labels)
s.loc[s.index.intersection(labels)]
s.loc[s.index.intersection(labels)].reindex(labels)
Index(['c'], dtype='object')






c    3
dtype: int32






c    3.0
d    NaN
dtype: float64
# Selecting Random Samples
# 随机取样选择
# 默认行取样

s = pd.Series([0,1,2,3,4,5])
s
0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64
s.sample()  # 默认取1行
s.sample(n=3)  # 取3行
s.sample(frac=0.5)  # 小数,行数百分数
s.sample(frac=0.8)  # 小数
s
1    1
dtype: int64






3    3
4    4
5    5
dtype: int64






2    2
3    3
4    4
dtype: int64






2    2
1    1
3    3
0    0
5    5
dtype: int64






0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64
s.sample(n=6, replace=False)  # 默认
s.sample(n=6, replace=True)  # replace,不改变本身
s
5    5
4    4
1    1
0    0
2    2
3    3
dtype: int64






0    0
4    4
0    0
2    2
0    0
0    0
dtype: int64






0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64
# 默认每行都有同样的概率被抽样到,也可以指定每行的概率比重
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
s.sample(n=3, weights=example_weights)

example_weights2 = [0.5, 0, 0, 0, 0, 0]  # 权重将会自动的归一化
s.sample(n=1, weights=example_weights2)
2    2
3    3
5    5
dtype: int64






0    0
dtype: int64
# 可以指定DataFrame的某列作为权重
df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
df2.sample(n = 3, weights = 'weight_column')
col1weight_column
180.4
270.1
090.5
# 对列进行抽样
df3 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4], 'col3':[3,4,5]})
df3
df3.sample(n=2, axis=1)  # 指定axis=1,对列抽样,抽取的是列的组合
df3.sample(n=2, axis=0)
col1col2col3
0123
1234
2345
col3col2
032
143
254
col1col2col3
2345
0123
# 可以指定random seed 或者 numpy 的 randomState 对象,作为sample 随机数生成器的种子
# 一旦seed确定,随机数不变
df3.sample(n=2, random_state=2)
df3.sample(n=2, random_state=2)
df3.sample(n=2, random_state=200)
df3.sample(n=2, random_state=200)
col1col2col3
2345
1234
col1col2col3
2345
1234
col1col2col3
0123
1234
col1col2col3
0123
1234
# Setting With Enlargement
# loc或[] 操作 可以通过赋值不存在的键扩大Series或Dataframe

# Series
se = pd.Series([1,2,3])
se
se[5] = 5  # append
se
0    1
1    2
2    3
dtype: int64






0    1
1    2
2    3
5    5
dtype: int64
# DataFrame

dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
dfi
AB
001
123
245
dfi.loc[:,'C'] = dfi.loc[:,'A']  # enlarge 增加列
dfi
ABC
0010
1232
2454
dfi.loc[3] = 5  # append 增加行
dfi
ABC
0010
1232
2454
3555
# Fast scalar value getting and setting
# 快速取得或设置标量的值
# [] 可以进行很多操作,所以它为了知道你要进行那种操作,会有一点计算开销
# 最快的标量访问方式是使用 at 和 iat 方法,他们可以在所有的数据结构上使用
# at 类似于 loc 基于 label
# iat 类似于 iloc 基于 整数index
s = pd.Series([0,1,2,3,4,5])
s
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df
0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64
ABCD
2000-01-010.845046-0.1197581.354224-0.134199
2000-01-02-1.935579-1.469185-2.5814390.355347
2000-01-03-0.038740-1.5240561.3762571.572331
2000-01-04-0.8469710.189231-0.287885-0.561706
2000-01-05-0.127290-0.0439180.103347-1.055387
2000-01-060.4064371.9176240.8104630.367583
2000-01-070.438904-0.2301900.593607-0.438856
2000-01-08-1.9551221.5312600.889124-0.014259
# get value
s.iat[5]
df.at[dates[5], 'A']
df.iat[3, 0]
5






0.40643702489386246






-0.8469710793801154
# set value
df.at[dates[5], 'E'] = 7  # 没有列就增加一列,没有值的默认赋值为nan
df
ABCDE
2000-01-010.845046-0.1197581.354224-0.134199NaN
2000-01-02-1.935579-1.469185-2.5814390.355347NaN
2000-01-03-0.038740-1.5240561.3762571.572331NaN
2000-01-04-0.8469710.189231-0.287885-0.561706NaN
2000-01-05-0.127290-0.0439180.103347-1.055387NaN
2000-01-060.4064371.9176240.8104630.3675837.0
2000-01-070.438904-0.2301900.593607-0.438856NaN
2000-01-08-1.9551221.5312600.889124-0.014259NaN
df.iat[3, 0] = 7
df
ABCDE
2000-01-010.845046-0.1197581.354224-0.134199NaN
2000-01-02-1.935579-1.469185-2.5814390.355347NaN
2000-01-03-0.038740-1.5240561.3762571.572331NaN
2000-01-047.0000000.189231-0.287885-0.561706NaN
2000-01-05-0.127290-0.0439180.103347-1.055387NaN
2000-01-060.4064371.9176240.8104630.3675837.0
2000-01-070.438904-0.2301900.593607-0.438856NaN
2000-01-08-1.9551221.5312600.889124-0.014259NaN
df.at[dates[-1]+1, 0] = 7  # 行和列都扩展了
df
ABCDE0
2000-01-010.845046-0.1197581.354224-0.134199NaNNaN
2000-01-02-1.935579-1.469185-2.5814390.355347NaNNaN
2000-01-03-0.038740-1.5240561.3762571.572331NaNNaN
2000-01-047.0000000.189231-0.287885-0.561706NaNNaN
2000-01-05-0.127290-0.0439180.103347-1.055387NaNNaN
2000-01-060.4064371.9176240.8104630.3675837.0NaN
2000-01-070.438904-0.2301900.593607-0.438856NaNNaN
2000-01-08-1.9551221.5312600.889124-0.014259NaNNaN
2000-01-09NaNNaNNaNNaNNaN7.0
# Boolean indexing
# 使用布尔向量过滤数据
# 操作符包括:| for or, & for and, and ~ for not ,必须用括号进行分组
s = pd.Series(range(-3, 4))  # 不用把range先list,直接可以series
s
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64
s[s >= 0]  # 直接用series,不用取其值
s[~(s < 0)]
s[(s < -1) | (s > 0.5)]
3    0
4    1
5    2
6    3
dtype: int64






3    0
4    1
5    2
6    3
dtype: int64






0   -3
1   -2
4    1
5    2
6    3
dtype: int64
df[df['A'] > 0]
ABCDE0
2000-01-010.845046-0.1197581.354224-0.134199NaNNaN
2000-01-047.0000000.189231-0.287885-0.561706NaNNaN
2000-01-060.4064371.9176240.8104630.3675837.0NaN
2000-01-070.438904-0.2301900.593607-0.438856NaNNaN
# List comprehensions and map method of Series can also be used to produce more complex criteria:
# 列表生成式和map方法 也可以用来生成 更复杂的条件判断

df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                     'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                     'c' : np.random.randn(7)})
df2
abc
0onex1.706402
1oney0.491977
2twoy1.357986
3threex-1.023513
4twoy-0.653028
5onex0.041052
6sixx1.021882
# 判别式
criterion = df2['a'].map(lambda x: x.startswith('t'))  # 选择 two 和 three
criterion
0    False
1    False
2     True
3     True
4     True
5    False
6    False
Name: a, dtype: bool
df2[criterion]  # 根据 a 列的判别式 选择数据表的一部分,包含其他列
abc
2twoy1.357986
3threex-1.023513
4twoy-0.653028
# 等价的方式,但是速度慢一些 
df2[[x.startswith('t') for x in df2['a']]]  # 不适用map,而是用列表生成式
abc
2twoy1.357986
3threex-1.023513
4twoy-0.653028
# 复合判断
df2[criterion & (df2['b'] == 'x')]  # a 列 和 b 列 均符合某类要求
abc
3threex-1.023513
# 布尔向量选择 可以与索引选择一并使用
df2.loc[criterion & (df2['b'] == 'x'),'b':'c']  # 只选择 b 和 c 两列,不选择c列
bc
3x-1.023513
# Indexing with isin
# 用isin索引
# series
# 对列数据进行判断
s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
s
4    0
3    1
2    2
1    3
0    4
dtype: int64
s.isin([2, 4, 6])  # 列数据中是否存在列表中的值,返回布尔值
s[s.isin([2, 4, 6])]  # 可以利用返回的布尔值进行选择
4    False
3    False
2     True
1    False
0     True
dtype: bool






2    2
0    4
dtype: int64
# 也可以对index obj 进行筛选
s[s.index.isin([2, 4, 6])]
s.reindex([2, 4, 6])  # reindex不同,列表中没有的值返回了nan,且原来的int64返回了float64数据类型
4    0
2    2
dtype: int64






2    2.0
4    0.0
6    NaN
dtype: float64
# 对应多重索引,可以单独选择索引级别
s_mi = pd.Series(np.arange(6),index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
s_mi
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int32
s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]  # 指定索引级别,在第二级索引中选择
0  c    2
1  a    3
dtype: int32






0  a    0
   c    2
1  a    3
   c    5
dtype: int32
# DataFrame
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],'ids2': ['a', 'n', 'c', 'n']})
df
idsids2vals
0aa1
1bn2
2fc3
3nn4
values = ['a', 'b', 1, 3]
df.isin(values)  # 匹配所有的值
idsids2vals
0TrueTrueTrue
1TrueFalseFalse
2FalseFalseTrue
3FalseFalseFalse
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
row_mask = df.isin(values)  # 对不用的列,分别匹配某些值
row_mask
# ?row_mask.all  # Return whether all elements are True over requested axis
row_mask = row_mask.all(1)
row_mask
df[row_mask]  # 选择全是True的行
idsids2vals
0TrueTrueTrue
1TrueFalseFalse
2FalseTrueTrue
3FalseFalseFalse
0     True
1    False
2    False
3    False
dtype: bool
idsids2vals
0aa1
# The where() Method and Masking
# To guarantee that selection output has the same shape as the original data
# 保证选集输出与原数据有同样的shape形态

# series
s
s[s > 0]  # 只返回满足的项
s.where(s > 0)  # 全部返回,不满足的项,赋值nan
4    0
3    1
2    2
1    3
0    4
dtype: int64






3    1
2    2
1    3
0    4
dtype: int64






4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64
# DataFrame 使用布尔值选择时,返回值保留原数据结构
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
df[df < 0]
df.where(df < 0)  # 等价于上式
df.where(df < 0, -df)  # where 可以传入另一个参数,用于替换条件为 False 的项,返回copy数据拷贝,不修改原值
df
ABCD
2000-01-01NaN-0.500453NaNNaN
2000-01-02-1.048619NaN-0.295425NaN
2000-01-03-0.214918-1.989228NaNNaN
2000-01-04-2.326385-0.225754-1.331598NaN
2000-01-05NaNNaNNaN-0.089382
2000-01-06-1.041194NaN-1.084442NaN
2000-01-07-1.040706-2.336161NaNNaN
2000-01-08NaN-0.075381-0.951126-0.347865
ABCD
2000-01-01NaN-0.500453NaNNaN
2000-01-02-1.048619NaN-0.295425NaN
2000-01-03-0.214918-1.989228NaNNaN
2000-01-04-2.326385-0.225754-1.331598NaN
2000-01-05NaNNaNNaN-0.089382
2000-01-06-1.041194NaN-1.084442NaN
2000-01-07-1.040706-2.336161NaNNaN
2000-01-08NaN-0.075381-0.951126-0.347865
ABCD
2000-01-01-0.383691-0.500453-0.101632-0.848213
2000-01-02-1.048619-0.856605-0.295425-1.060710
2000-01-03-0.214918-1.989228-0.278514-1.088771
2000-01-04-2.326385-0.225754-1.331598-1.457230
2000-01-05-0.656919-1.081810-1.148303-0.089382
2000-01-06-1.041194-0.533706-1.084442-1.824709
2000-01-07-1.040706-2.336161-0.565496-0.269414
2000-01-08-0.166739-0.075381-0.951126-0.347865
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
# You may wish to set values based on some boolean criteria. This can be done intuitively like so:
# 设置基于布尔值的项值

s2 = s.copy()
s2
s2[s2 < 3] = 0
s2
df2 = df.copy()
df2
df2[df2 < 0] = 0
df2
4    0
3    1
2    2
1    3
0    4
dtype: int64






4    0
3    0
2    0
1    3
0    4
dtype: int64
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
ABCD
2000-01-010.3836910.0000000.1016320.848213
2000-01-020.0000000.8566050.0000001.060710
2000-01-030.0000000.0000000.2785141.088771
2000-01-040.0000000.0000000.0000001.457230
2000-01-050.6569191.0818101.1483030.000000
2000-01-060.0000000.5337060.0000001.824709
2000-01-070.0000000.0000000.5654960.269414
2000-01-080.1667390.0000000.0000000.000000
# where 默认返回修改后的数据拷贝,原值不变;可以设置inplace参数,直接修改原值,而不是创建拷贝

df_orig = df.copy()
df_orig.where(df > 0, -df, inplace=True)
df_orig
ABCD
2000-01-010.3836910.5004530.1016320.848213
2000-01-021.0486190.8566050.2954251.060710
2000-01-030.2149181.9892280.2785141.088771
2000-01-042.3263850.2257541.3315981.457230
2000-01-050.6569191.0818101.1483030.089382
2000-01-061.0411940.5337061.0844421.824709
2000-01-071.0407062.3361610.5654960.269414
2000-01-080.1667390.0753810.9511260.347865
# 注意 :pandas 和 numpy 的where方法不一样
# 一般的,df1.where(m, df2)  相当于 np.where(m, df1, df2)

df.where(df < 0, -df) == np.where(df < 0, df, -df)
ABCD
2000-01-01TrueTrueTrueTrue
2000-01-02TrueTrueTrueTrue
2000-01-03TrueTrueTrueTrue
2000-01-04TrueTrueTrueTrue
2000-01-05TrueTrueTrueTrue
2000-01-06TrueTrueTrueTrue
2000-01-07TrueTrueTrueTrue
2000-01-08TrueTrueTrueTrue
# alignment 定位,对齐
# where 可以选择局部(部分区域)的布尔条件  
# This is analogous to partial setting via .loc (but on the contents rather than the axis labels)

df2 = df.copy()
df2
df2[1:4]   # 行选择
df2.where(df2[1:4] > 0, 3)  # 对不符合项的值进行赋值!
df2[ df2[1:4] > 0 ] = 3  # 只定位部分区域,对符合项的值进行赋值
df2
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
ABCD
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
ABCD
2000-01-013.03.0000003.0000003.000000
2000-01-023.00.8566053.0000001.060710
2000-01-033.03.0000000.2785141.088771
2000-01-043.03.0000003.0000001.457230
2000-01-053.03.0000003.0000003.000000
2000-01-063.03.0000003.0000003.000000
2000-01-073.03.0000003.0000003.000000
2000-01-083.03.0000003.0000003.000000
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486193.000000-0.2954253.000000
2000-01-03-0.214918-1.9892283.0000003.000000
2000-01-04-2.326385-0.225754-1.3315983.000000
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
# Where can also accept axis and level parameters to align the input when performing the where.
# where 可以接受 轴参数axis 和 级别参数level 
df2 = df.copy()
df2
df2.where(df2>0,df2['A'],axis='index')  # 小于等于0的值,赋值为A列的值
df2
df2.apply(lambda x, y: x.where(x>0,y), y=df['A'])  # 相当于上式,但此式较慢,同样不改变原值,而是生成一个拷贝copy
df2
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
ABCD
2000-01-010.3836910.3836910.1016320.848213
2000-01-02-1.0486190.856605-1.0486191.060710
2000-01-03-0.214918-0.2149180.2785141.088771
2000-01-04-2.326385-2.326385-2.3263851.457230
2000-01-050.6569191.0818101.1483030.656919
2000-01-06-1.0411940.533706-1.0411941.824709
2000-01-07-1.040706-1.0407060.5654960.269414
2000-01-080.1667390.1667390.1667390.166739
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
ABCD
2000-01-010.3836910.3836910.1016320.848213
2000-01-02-1.0486190.856605-1.0486191.060710
2000-01-03-0.214918-0.2149180.2785141.088771
2000-01-04-2.326385-2.326385-2.3263851.457230
2000-01-050.6569191.0818101.1483030.656919
2000-01-06-1.0411940.533706-1.0411941.824709
2000-01-07-1.040706-1.0407060.5654960.269414
2000-01-080.1667390.1667390.1667390.166739
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
# where 可以接受一个函数调用,此函数只能有一个参数,且返回有效的布尔条件
df3 = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6],'C': [7, 8, 9]})
df3
df3.where(lambda x: x > 4, lambda x: x + 10)  # x<=4 的值 赋值为  x+10
ABC
0147
1258
2369
ABC
011147
11258
21369
# mask 遮罩 mask is the inverse boolean operation of where. 反向的布尔值操作
s
s.mask(s >= 0)  # 选择<0 的值, 不符合项 置为nan
df
df.mask(df >= 0)
4    0
3    1
2    2
1    3
0    4
dtype: int64






4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64
ABCD
2000-01-010.383691-0.5004530.1016320.848213
2000-01-02-1.0486190.856605-0.2954251.060710
2000-01-03-0.214918-1.9892280.2785141.088771
2000-01-04-2.326385-0.225754-1.3315981.457230
2000-01-050.6569191.0818101.148303-0.089382
2000-01-06-1.0411940.533706-1.0844421.824709
2000-01-07-1.040706-2.3361610.5654960.269414
2000-01-080.166739-0.075381-0.951126-0.347865
ABCD
2000-01-01NaN-0.500453NaNNaN
2000-01-02-1.048619NaN-0.295425NaN
2000-01-03-0.214918-1.989228NaNNaN
2000-01-04-2.326385-0.225754-1.331598NaN
2000-01-05NaNNaNNaN-0.089382
2000-01-06-1.041194NaN-1.084442NaN
2000-01-07-1.040706-2.336161NaNNaN
2000-01-08NaN-0.075381-0.951126-0.347865
# The query() Method (Experimental) 对 DataFrame对象 使用表达式进行选择

# 例如 :选择b列中 在a列和c列两值中间的 行,a<b<c
n = 10
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
df
abc
00.2830940.0518070.126487
10.0200970.3730230.147193
20.0919210.8309560.143214
30.3403040.5272460.709769
40.6517220.3445240.151233
50.3966850.5243760.540237
60.5027510.6277080.708038
70.4723380.2697700.586165
80.9375220.2395600.861873
90.6618790.4655360.271580
# pure python
df[(df.a < df.b) & (df.b < df.c)]
abc
30.3403040.5272460.709769
50.3966850.5243760.540237
60.5027510.6277080.708038
# query,传入的是str表达式
df.query('(a < b) & (b < c)')  # 比纯py慢?!
abc
30.3403040.5272460.709769
50.3966850.5243760.540237
60.5027510.6277080.708038
# 对于命名的index索引是低效的,比利用列名称
# 而且如果索引的名称和列名同名,列名优先
df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))
# df
df.index.name = "a"
df
bc
a
004
112
230
303
443
500
604
714
830
914
df.query('a <= b and b <= c')
bc
a
004
112
# 可以不使用索引的名称,而是直接用index,这样同时可以避免与列名重名
df.query('index <= b <= c')
bc
a
004
112
# MultiIndex query() Syntax
# 对于多重索引

n = 10
colors = np.random.choice(['red', 'green'], size=n)
colors
foods = np.random.choice(['eggs', 'ham'], size=n)
foods
array(['green', 'green', 'green', 'green', 'green', 'green', 'green',
       'red', 'green', 'green'], dtype='<U5')






array(['ham', 'eggs', 'eggs', 'eggs', 'eggs', 'ham', 'eggs', 'eggs',
       'ham', 'ham'], dtype='<U4')
index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])
index
df = pd.DataFrame(np.random.randn(n, 2), index=index)
df
MultiIndex(levels=[['green', 'red'], ['eggs', 'ham']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 1, 0, 0, 1, 1]],
           names=['color', 'food'])
01
colorfood
greenham-0.2957611.399719
eggs1.8071850.498136
eggs-0.1196402.279162
eggs-0.238709-0.650418
eggs2.2358271.066954
ham1.1567941.694717
eggs-0.037158-0.529213
redeggs0.0467990.763592
greenham0.488240-0.455112
ham-0.169486-0.646891
df.query('color == "red"')
01
colorfood
redeggs0.0467990.763592
# 如果多重索引没有命名 可以使用特殊的名字
df.index.names = [None, None]
df
df.query('ilevel_0 == "red"') # The convention is ilevel_0, which means “index level 0” for the 0th level of the index.
01
greenham-0.2957611.399719
eggs1.8071850.498136
eggs-0.1196402.279162
eggs-0.238709-0.650418
eggs2.2358271.066954
ham1.1567941.694717
eggs-0.037158-0.529213
redeggs0.0467990.763592
greenham0.488240-0.455112
ham-0.169486-0.646891
01
redeggs0.0467990.763592
# query() Use Cases 使用示例
# A use case for query() is when you have a collection of DataFrame objects 
# that have a subset of column names (or index levels/names) in common.   ## 有公共列名称子集的df
# You can pass the same query to both frames 
# without having to specify which frame you’re interested in querying.  ## 传入同样的query
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)
df
df2
abc
00.0960870.7993090.112070
10.0831010.5051510.830588
20.5014720.6927920.663570
30.8728210.4425410.904903
40.5136510.3756170.786898
50.1843680.2852900.055147
60.2849510.2435560.732871
70.4315250.6763850.458296
80.7430570.3260790.434655
90.6109210.7176390.580765
abc
00.8049320.5330580.713190
10.3058930.4600480.879321
20.3444380.9708700.685098
30.3480100.8394490.799309
40.8447130.2675620.771202
50.4035340.0887860.950782
60.1816160.5671180.989711
70.3637360.8520800.140771
80.8661270.2853650.025491
90.3297510.1217160.782729
100.0292530.4194090.051255
110.0574060.1065950.559687
expr = '0.0 <= a <= c <= 0.5'
mp = map(lambda frame: frame.query(expr), [df, df2])  # 同一个表达式,作用在有同样列名的多个df上
for i in mp:
    print(i)
          a         b         c
0  0.096087  0.799309  0.112070
7  0.431525  0.676385  0.458296
           a         b         c
10  0.029253  0.419409  0.051255
# query() Python versus pandas Syntax Comparison


# Full numpy-like syntax
df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))
df
abc
0845
1747
2932
3487
4458
5788
6176
7591
8338
9715
 df.query('(a < b) & (b < c)')
 df[(df.a < df.b) & (df.b < df.c)]   
abc
4458
abc
4458
# 几种其他的不同的写法
df.query('a < b & b < c')  # 去掉括号
df.query('a < b and b < c')  # 使用英文and
df.query('a < b < c')  # 连写,优雅的表达
abc
4458
abc
4458
abc
4458
# The in and not in operators

# get all rows where columns "a" and "b" have overlapping values 
# 得到a列和b列有重叠值的行,a列中的值在 in b列的值中
df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
                       'c': np.random.randint(5, size=12),
                       'd': np.random.randint(9, size=12)})
df
abcd
0aa02
1aa06
2ba33
3ba12
4cb12
5cb12
6db38
7db10
8ec27
9ec01
10fc01
11fc31
df.query('a in b')  # 第一次运行113ms 第二次明显加快 25ms
abcd
0aa02
1aa06
2ba33
3ba12
4cb12
5cb12
df[df.a.isin(df.b)]   # How you'd do it in pure Python  仍然比query快 35ms
abcd
0aa02
1aa06
2ba33
3ba12
4cb12
5cb12
df[~df.a.isin(df.b)]
abcd
6db38
7db10
8ec27
9ec01
10fc01
11fc31
df.query('a not in b')
abcd
6db38
7db10
8ec27
9ec01
10fc01
11fc31
# rows where cols a and b have overlapping values and col c's values are less than col d's
df.query('a in b and c < d')
df[df.a.isin(df.b) & (df.c < df.d)]
abcd
0aa02
1aa06
3ba12
4cb12
5cb12
abcd
0aa02
1aa06
3ba12
4cb12
5cb12
# 注意:?????
# Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation. 
# However, only the in/not in expression itself is evaluated in vanilla Python. 
# For example, in the expression
df.query('a in b + c + d')
# (b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general, any operations that can be evaluated using numexpr will be.
# Special use of the == operator with list objects
# 特别的用法,== 用于list
# Comparing a list of values to a column using ==/!= works similarly to in/not in
# 用==/!=比较列表的值与列的值,类似于in/not in
df
abcd
0aa02
1aa06
2ba33
3ba12
4cb12
5cb12
6db38
7db10
8ec27
9ec01
10fc01
11fc31
df.query('b == ["a", "b", "c"]')  # b列中有列表中值的
abcd
0aa02
1aa06
2ba33
3ba12
4cb12
5cb12
6db38
7db10
8ec27
9ec01
10fc01
11fc31
df.query('c == [1, 2]')
df.query('[1, 2] in c')
df[df.c.isin([1, 2])]
abcd
3ba12
4cb12
5cb12
7db10
8ec27
abcd
3ba12
4cb12
5cb12
7db10
8ec27
abcd
3ba12
4cb12
5cb12
7db10
8ec27
# Boolean Operators 布尔操作符
# ~ 或 not
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
df
df['bools'] = np.random.rand(len(df)) > 0.5
df
abc
00.5286710.2658700.932892
10.5200690.0478950.478818
20.6485950.1807440.838445
30.1456140.9061900.762163
40.6126370.0272320.778020
50.6405650.5355380.018280
60.6336640.1246540.185709
70.6350880.3773000.914968
80.7026840.5044590.107014
90.6357570.2611440.665611
abcbools
00.5286710.2658700.932892False
10.5200690.0478950.478818True
20.6485950.1807440.838445True
30.1456140.9061900.762163False
40.6126370.0272320.778020True
50.6405650.5355380.018280True
60.6336640.1246540.185709True
70.6350880.3773000.914968True
80.7026840.5044590.107014False
90.6357570.2611440.665611False
df.query('~bools')
df.query('not bools')
abcbools
00.5286710.2658700.932892False
30.1456140.9061900.762163False
80.7026840.5044590.107014False
90.6357570.2611440.665611False
abcbools
00.5286710.2658700.932892False
30.1456140.9061900.762163False
80.7026840.5044590.107014False
90.6357570.2611440.665611False
# 任意组合布尔表达式
shorter = df.query('a < b < c and (not bools) or bools > 2')  # short query syntax
longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]  # equivalent in pure Python
shorter
longer
# Performance of query()
# DataFrame.query() using numexpr is slightly faster than Python for large frames
# Note : You will only see the performance benefits of using the numexpr engine with DataFrame.query() 
# if your frame has more than approximately 200,000 rows

# query() 效能在大数据时高于数字表达式
# Duplicate Data
# 识别和剔除重复数据,两个方法:duplicated and drop_duplicates,他们的参数都是列
# duplicated 返回布尔向量,标示重复的行
# drop_duplicates 删除重复列
# 默认保留第一个找到的值,但可以通过keep参数指定保留的值
# keep='first' (default): mark / drop duplicates except for the first occurrence. 只保留第一个
# keep='last': mark / drop duplicates except for the last occurrence. 只保留最后一个重复值
# keep=False: mark / drop all duplicates.  剔除所有重复值
df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
                    'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
                    'c': np.random.randn(7)})
df2
abc
0onex2.206877
1oney-1.852209
2twox-0.706555
3twoy-1.007860
4twox-0.185356
5threex-0.687592
6fourx-2.052029
df2.duplicated('a')
df2.duplicated("a", keep="last")
df2.duplicated('a', keep=False)
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool






0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool






0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool
df2.drop_duplicates('a')
df2.drop_duplicates('a', keep="last")
df2.drop_duplicates('a', keep=False)
abc
0onex2.206877
2twox-0.706555
5threex-0.687592
6fourx-2.052029
abc
1oney-1.852209
4twox-0.185356
5threex-0.687592
6fourx-2.052029
abc
5threex-0.687592
6fourx-2.052029
# 传入一个列表作为参数
df2.duplicated(['a', 'b'])  # a b 两列看做一个整体,标示重复值
df2.drop_duplicates(['a', 'b'])
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool
abc
0onex2.206877
1oney-1.852209
2twox-0.706555
3twoy-1.007860
5threex-0.687592
6fourx-2.052029
# index索引去重  index.duplicated
df3 = pd.DataFrame({'a': np.arange(6),'b': np.random.randn(6)},index=['a', 'a', 'b', 'c', 'b', 'a'])
df3
ab
a01.446937
a1-1.740284
b2-0.604590
c30.096239
b40.823314
a51.990803
df3.index.duplicated()
df3[~df3.index.duplicated()]
df3[~df3.index.duplicated(keep='last')]
df3[~df3.index.duplicated(keep=False)]
array([False,  True, False, False,  True,  True])
ab
a01.446937
b2-0.604590
c30.096239
ab
c30.096239
b40.823314
a51.990803
ab
c30.096239
# Dictionary-like get() method get方法
# Each of Series, DataFrame, and Panel have a get method which can return a default value.
s = pd.Series([1,2,3], index=['a','b','c'])
s.get("a")  # 相当于s["a"]
s.get("x", default=-1)  # 可以对不存在的index赋值
1






-1
# The lookup() Method
# 按一定的顺序取得行/列的值
dflookup = pd.DataFrame(np.random.rand(20,4), columns = ['A','B','C','D'])
dflookup
ABCD
00.9426790.7273160.6583450.465770
10.4372780.2251580.4365220.164805
20.2709470.2802230.3098000.015967
30.5598360.6309620.6736780.712503
40.5353720.9898870.6615670.361962
50.7263220.6011920.5478580.477509
60.8294110.5836130.8716470.460966
70.3657220.5636600.1649540.134314
80.8173340.5230030.6844920.369386
90.7109060.6936330.6188770.888263
100.0269530.1852170.7462350.568846
110.1927650.9272000.4317360.084300
120.9884600.7843200.1801450.039405
130.2333490.1486780.8882100.004917
140.1051300.1467240.2613700.936558
150.7012240.7918600.4200830.391538
160.2901860.3079930.1394290.618879
170.1821320.1744200.8455010.647986
180.7320090.9199500.1973610.582814
190.5680960.5391250.2690160.537584
list(range(0,10,2))
dflookup.lookup(list(range(0,10,2)), ['B','C','A','B','D'])
[0, 2, 4, 6, 8]






array([0.72731646, 0.30979986, 0.53537223, 0.58361289, 0.36938598])
# index objets 索引对象

# 索引可以通过list或其他序列对象直接创建
index = pd.Index(['e', 'd', 'a', 'b'])
index

# 可以命名
index = pd.Index(['e', 'd', 'a', 'b'], name='something')
index.name
index
Index(['e', 'd', 'a', 'b'], dtype='object')






'something'






Index(['e', 'd', 'a', 'b'], dtype='object', name='something')
index = pd.Index(list(range(5)), name='rows')
index
columns = pd.Index(['A', 'B', 'C'], name='cols')
columns
Int64Index([0, 1, 2, 3, 4], dtype='int64', name='rows')






Index(['A', 'B', 'C'], dtype='object', name='cols')
df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)  # 使用索引给列命名,列名是一个索引对象
df
colsABC
rows
01.8681650.704614-2.049472
1-0.8788100.706876-0.741121
2-2.649523-0.9522110.806387
30.369280-0.052788-0.995775
40.218246-0.034493-0.198815
# Setting metadata 设置元数据
# 索引“多半”是“不可变”,但是可以设置和改变其元数据,比如索引的名称,或者,多重索引的级别和标签
# 可以使用 rename, set_names, set_levels, and set_labels 设置这些属性。
# 默认返回一个拷贝(不修改原值),也可以就地修改in place
ind = pd.Index([1, 2, 3])
ind
Int64Index([1, 2, 3], dtype='int64')
ind.name = "ind"
ind
ind.rename("apple")
ind
ind.name = "apple"
ind
ind.set_names(["bob"], inplace=True)
ind
Int64Index([1, 2, 3], dtype='int64', name='ind')






Int64Index([1, 2, 3], dtype='int64', name='apple')






Int64Index([1, 2, 3], dtype='int64', name='ind')






Int64Index([1, 2, 3], dtype='int64', name='apple')






Int64Index([1, 2, 3], dtype='int64', name='bob')
# set_names, set_levels, and set_labels also take an optional level` argument
# df.query('ilevel_0 == "red"') 在query中使用 ilevel_0 ……
index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
index
MultiIndex(levels=[[0, 1, 2], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
index.levels[1]
index.levels[0]
index.set_levels(["a", "b"], level=1)
Index(['one', 'two'], dtype='object', name='second')






Int64Index([0, 1, 2], dtype='int64', name='first')






MultiIndex(levels=[[0, 1, 2], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
# Set operations on Index objects 索引对象中的集合操作 
# Note: The resulting index from a set operation will be sorted in ascending order. 
# 注意: 索引的集合操作返回的是升序排序后的结果
a = pd.Index(['c', 'b', 'a'])
b = pd.Index(['c', 'e', 'd'])
a | b  # 并集
a & b  # 交集
# a - b  # 不支持
a.difference(b)  # 差集,非对称
a.symmetric_difference(b)  # 对称差集,相当于idx1.difference(idx2).union(idx2.difference(idx1))
a ^ b  # 对称差集
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')






Index(['c'], dtype='object')






Index(['a', 'b'], dtype='object')






Index(['a', 'b', 'd', 'e'], dtype='object')






Index(['a', 'b', 'd', 'e'], dtype='object')
# Missing values 缺失值
# 注意:索引能够容纳缺失值,但是应该避免这样的情况。因为可能出现不可预料的结果,例如有些操作默认排除缺失值。

idx1 = pd.Index([1, np.nan, 3, 4])
idx1
Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')
idx1.fillna(2)  # 缺失值赋值为2,nan的index赋值为2
Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'), pd.NaT, pd.Timestamp('2011-01-03')])  # pd的缺失值是NaT,与np不同
idx2
DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
idx2.fillna(pd.Timestamp('2011-01-02'))
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)
# Set / Reset Index 设置/重设索引


# DataFrame
data = pd.DataFrame({"a":["bar", "bar", "foo", "foo"],
                    "b":["one","two","one","two"],
                    "c":["z","y","x", "w"],
                    "d":range(1,5)})  # range 不用list
data
abcd
0baronez1
1bartwoy2
2fooonex3
3footwow4
indexed1 = data.set_index("c")  # 把df的某列设置为index索引
indexed1
abd
c
zbarone1
ybartwo2
xfooone3
wfootwo4
indexed2 = data.set_index(['a', 'b'])  # 把df的多列设置为多重索引
indexed2
cd
ab
baronez1
twoy2
fooonex3
twow4
frame = data.set_index('c', drop=False)  # drop参数,可以设置索引时不删除列,默认为删除列
frame
frame = frame.set_index(['a', 'b'], append=True)  # append参数,在原有索引的基础上,增加索引,变成复合索引
frame
abcd
c
zbaronez1
ybartwoy2
xfooonex3
wfootwow4
cd
cab
zbaronez1
ybartwoy2
xfooonex3
wfootwow4
data.set_index('c', drop=False)  # 把某列设为索引,但不删除
data
data.set_index(['a', 'b'], inplace=True)  # 就地修改,而不是返回拷贝
data
abcd
c
zbaronez1
ybartwoy2
xfooonex3
wfootwow4
abcd
0baronez1
1bartwoy2
2fooonex3
3footwow4
cd
ab
baronez1
twoy2
fooonex3
twow4
# Reset the index 重置索引
# Note: The reset_index method used to be called delevel which is now deprecated. 不推荐使用delevel

data
data.index.name  # 无返回值?
data.index.names  # 索引列的名称
data.reset_index()  # 重置索引,原索引变为普通列,原索引名称变为列名
cd
ab
baronez1
twoy2
fooonex3
twow4
FrozenList(['a', 'b'])
abcd
0baronez1
1bartwoy2
2fooonex3
3footwow4
frame
frame.reset_index(level=1)  # 对于多重索引,可以指定重置哪一级索引,而不是全部重置
cd
cab
zbaronez1
ybartwoy2
xfooonex3
wfootwow4
acd
cb
zonebarz1
ytwobary2
xonefoox3
wtwofoow4
# reset_index方法可以使用drop参数,若其为true,则仅是把索引剔除,而不转换为df数据列
frame
frame.reset_index(level=2, drop=True) 
cd
cab
zbaronez1
ybartwoy2
xfooonex3
wfootwow4
cd
ca
zbarz1
ybary2
xfoox3
wfoow4
# Adding an ad hoc index
# If you create an index yourself, you can just assign it to the index field
# 创建index对象,并赋值给df
index = pd.MultiIndex.from_product([range(2), ['one', 'two']], names=['first', 'second'])
index
data
data.index = index  # 注意:index的长度要与df向适应
data
MultiIndex(levels=[[0, 1], ['one', 'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['first', 'second'])
abcd
0baronez1
1bartwoy2
2fooonex3
3footwow4
abcd
firstsecond
0onebaronez1
twobartwoy2
1onefooonex3
twofootwow4
# Returning a view versus a copy 返回视图 对比 返回拷贝
# 避免链式操作,即多个连接的[],具体见前面的说明

# Why does assignment fail when using chained indexing? 为什么链式索引赋值失败
# python解释器,对以下赋值的区别

# 第一种:非链式赋值
dfmi.loc[:,('one','second')] = value
# becomes, python解释为:
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)  # 直接set
# Of course, dfmi.loc.__getitem__(idx) may be a view or a copy of dfmi.

# 第二种:链式赋值
dfmi['one']['second'] = value  # pd会抛出异常SettingWithCopy,链式操作具有不确定性!
# becomes, python解释为:
dfmi.__getitem__('one').__setitem__('second', value)  # 先get,再set,前面的get返回的是一个copy
# SettingWithCopy有时候会在没有明显的链式操作的情况下出现,例如:
def do_something(df):
   foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
   # ... many lines here ...
   foo['quux'] = value       # We don't know whether this will modify df or not!  # “隐式”的链式操作
   return foo
# Evaluation order matters 赋值命令事情
# 链式赋值操作引起的SettingWithCopyWarning,可以通过设置option mode取消或抑制。

dfb = pd.DataFrame({'a' : ['one', 'one', 'two','three', 'two', 'one', 'six'],'c' : np.arange(7)})
dfb
ac
0one0
1one1
2two2
3three3
4two4
5one5
6six6
# This will show the SettingWithCopyWarning
# but the frame values will be set
dfb['c'][dfb.a.str.startswith('o')] = 42
dfb
d:\python\36-64\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
ac
0one42
1one42
2two2
3three3
4two4
5one42
6six6
# This however is operating on a copy and will not work.
# 把option的提示级别由warn改为raise ????
pd.set_option('mode.chained_assignment','warn')  # This however is operating on a copy and will not work. # ??
# A chained assignment can also crop up in setting in a mixed dtype frame. 出现在
# Note : These setting rules apply to all of .loc/.iloc 

# This is the correct access method 正确的存取方法
dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})
dfc
dfc.loc[0,'A'] = 11
dfc
AB
0aaa1
1bbb2
2ccc3
AB
0111
1bbb2
2ccc3
dfc = dfc.copy()
dfc
dfc['A'][0] = 111  # This can work at times, but is not guaranteed, and so should be avoided 有时可以工作,但不保证,应避免使用
dfc
AB
0111
1bbb2
2ccc3
d:\python\36-64\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
AB
01111
1bbb2
2ccc3
pd.set_option('mode.chained_assignment', 'raise')  # 链式赋值将不会执行
# http://pandas.pydata.org/pandas-docs/stable/indexing.html 全文完
# 2018-02-19

转载于:https://www.cnblogs.com/shigubuhua/p/8459134.html

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值