Pandas杂记(一)

目录

1.创建

从字典创建

从列表创建

从numpy数组创建 

 从python原生列表创建

2.按列按行取值

返回值类型

取单列

取单行

取多列

取多行

取行列操作是否返回副本

取单列

取单行

取多列

取多行

3.values 属性

4.布尔索引

5.index

reindex

关于index

 

 

1.创建

从字典创建

字典的键是列名,字典的长度即是数据的长度,有广播机制,所以每个键下值的长度除了一样,还可以长度为1,通过广播机制不足。但长度不能是其他值。

In [9]: df2 = pd.DataFrame({'A': 1.,
   ...:                     'B': pd.Timestamp('20130102'),
   ...:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': np.array([3] * 4, dtype='int32'),
   ...:                     'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                     'F': 'foo'})
   ...: 

In [10]: df2
Out[10]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

从列表创建

列表的长度即是记录的长度,列表中的一个元素(这边的字典)就是一条记录,其中字典的键是列名。相同的键对应的值组成一列。

from pandas import DataFrame
d = [{"f1": 1, "f2": 2, "f3": 3},
     {"f2": 12, "f1": 14, "f3": 16},
     {"f3": 25, "f2": 24, "f1": 26},
     {"f1": 35, "f2": 34, "f3": 36}]
df = DataFrame(d)
print(df)

 输出

   f1  f2  f3
0   1   2   3
1  14  12  16
2  26  24  25
3  35  34  36

从numpy数组创建 

df = DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
print(df)

 输出

          A         B         C         D
0  0.039461  0.774348 -0.007067  0.738565
1 -0.142427 -0.287318  0.743472 -0.609328
2 -0.731498  0.095589 -0.664986  0.078787
3  0.583649  0.036846  0.050926  0.911483
4 -0.822104  0.260254  1.887518 -1.561972
5  0.391799 -0.392966  0.681349 -1.643266

 从python原生列表创建

d = [[1, 2, 3, 4],
     [11, 12, 13, 14],
     [21, 22, 23, 24]]
dates = pd.date_range('20130101', periods=3)
df = DataFrame(d, columns=list('ABCD'))
print(df)

输出:

    A   B   C   D
0   1   2   3   4
1  11  12  13  14
2  21  22  23  24

 

2.按列按行取值

  • 若取多行或多列(切片 or 中括号枚举,这种方式下哪怕取一行一列特性和多行多列一样)
    • 取多行多列均返回DataFrame 类
    • 切片再套一层中括号会报错,中括号里面只能一一枚举.(例如:origin_df.iloc[:, [1,2,3]],而不能写成origin_df.iloc[:, [1:3]]) ,索引即行操作同
    • 注意用字段名或索引名切片的时候,左闭右闭,均能取到.
    • 注意用下标切片取多行多列时,左闭右开,最右边取不到.
    • values 属性返回二维矩阵,类型是numpy.ndarray
  • 若取单行或单列所得类型得看取值方式
    • 所取行或列外面再套一层中括号,得到 DataFrame类型(例如origin_df[['A']]), values 属性返回二维矩阵
    • 否则 Series 类型(例如origin_df['A']), .values 属性返回一维矩阵,类型是numpy.ndarray

是否返回副本

  • 凡是多套一层中括号的均是返回副本(切片无法套中括号),副本的修改值的操作不会影响到源. 
  • 否则不返回副本,修改会影响到源,枚举必须套括号.

drop操作都不会影响源

返回值类型

取单列

origin_df = DataFrame(np.arange(24).reshape(6, 4), 
                      columns=list('ABCD'), index=list('abcdef'))
print("单列:{}\n单列套中括号: {}\nloc取单列: {}\nloc取单列套中括号: {}\n\
iloc取单列: {}\niloc取单列套中括号: {}".format(
    type(origin_df['B']), type(origin_df[['B']]), 
    type(origin_df.loc[:, 'B']), type(origin_df.loc[:, ['B']]),
    type(origin_df.iloc[:, 1]), type(origin_df.iloc[:, [1]])
    ))
print("="* 20)
try:
    print(origin_df.loc['B']) # 报错
except KeyError as ex:
    print(repr(ex))
单列:<class 'pandas.core.series.Series'>
单列套中括号: <class 'pandas.core.frame.DataFrame'>
loc取单列: <class 'pandas.core.series.Series'>
loc取单列套中括号: <class 'pandas.core.frame.DataFrame'>
iloc取单列: <class 'pandas.core.series.Series'>
iloc取单列套中括号: <class 'pandas.core.frame.DataFrame'>
====================
KeyError('B')

取单行

origin_df = DataFrame(np.arange(24).reshape(6, 4), 
                      columns=list('ABCD'), index=list('abcdef'))
print("loc取单行: {}\nloc取单行套中括号: {}\niloc取单行: {}\niloc取单行套中括号: {}".format(
    type(origin_df.loc['a']), type(origin_df.loc[['d'], :]),
    type(origin_df.iloc[0, :]), type(origin_df.iloc[[1], :])
    ))
print(origin_df.loc['b']) #简版
print(origin_df.iloc[0]) #简版
print("="* 20)
try:
    print(origin_df['b']) # 报错
except KeyError as ex:
    print(repr(ex))
loc取单行: <class 'pandas.core.series.Series'>
loc取单行套中括号: <class 'pandas.core.frame.DataFrame'>
iloc取单行: <class 'pandas.core.series.Series'>
iloc取单行套中括号: <class 'pandas.core.frame.DataFrame'>
A    4
B    5
C    6
D    7
Name: b, dtype: int64
A    0
B    1
C    2
D    3
Name: a, dtype: int64
====================
KeyError('b')

取多列

origin_df = DataFrame(np.arange(24).reshape(6, 4), 
                      columns=list('ABCD'), index=list('abcdef'))
print("多列:{}\n多列套中括号: {}\nloc取多列: {}\nloc取多列套中括号: {}\n\
iloc取多列: {}\niloc取多列套中括号: {}".format(
    type(origin_df['B':'C']), type(origin_df[['B',"D"]]), 
    type(origin_df.loc[:, 'B':"D"]), type(origin_df.loc[:, ['B',"D"]]),
    type(origin_df.iloc[:, 1:3]), type(origin_df.iloc[:, [1,3]])
    ))

print(origin_df.loc[:, 'B':"D"])
print("="* 20)
try:
    print(origin_df.loc['B':"C"]) # 报错
except KeyError as ex:
    print(repr(ex))
多列:<class 'pandas.core.frame.DataFrame'>
多列套中括号: <class 'pandas.core.frame.DataFrame'>
loc取多列: <class 'pandas.core.frame.DataFrame'>
loc取多列套中括号: <class 'pandas.core.frame.DataFrame'>
iloc取多列: <class 'pandas.core.frame.DataFrame'>
iloc取多列套中括号: <class 'pandas.core.frame.DataFrame'>
    B   C   D
a   1   2   3
b   5   6   7
c   9  10  11
d  13  14  15
e  17  18  19
f  21  22  23
====================
Empty DataFrame
Columns: [A, B, C, D]
Index: []

取多行

origin_df = DataFrame(np.arange(24).reshape(6, 4), 
                      columns=list('ABCD'), index=list('abcdef'))
print("loc取多行: {}\nloc取多行套中括号: {}\niloc取多行: {}\niloc取多行套中括号: {}".format(
    type(origin_df.loc['a':'b', :]), type(origin_df.loc[['d','e'], :]),
    type(origin_df.iloc[0:1, :]), type(origin_df.iloc[[1,3], :])
    ))
print(origin_df.loc['b':'d']) #简版
print(origin_df.iloc[0:2]) #简版
print("="* 20)

print(origin_df['b': 'd']) # 可以取到,注意与单行的区别
loc取多行: <class 'pandas.core.frame.DataFrame'>
loc取多行套中括号: <class 'pandas.core.frame.DataFrame'>
iloc取多行: <class 'pandas.core.frame.DataFrame'>
iloc取多行套中括号: <class 'pandas.core.frame.DataFrame'>
    A   B   C   D
b   4   5   6   7
c   8   9  10  11
d  12  13  14  15
   A  B  C  D
a  0  1  2  3
b  4  5  6  7
====================
    A   B   C   D
b   4   5   6   7
c   8   9  10  11
d  12  13  14  15

取行列操作是否返回副本

origin_df = DataFrame([[c + str(i) for i in range(6)]
                       for c in ("ABCDEF")], 
                      columns=list("ABCDEF"), index=list("uvwxyz"))
print(origin_df)
    A   B   C   D   E   F
u  A0  B0  C0  D0  E0  F0
v  A1  B1  C1  D1  E1  F1
w  A2  B2  C2  D2  E2  F2
x  A3  B3  C3  D3  E3  F3
y  A4  B4  C4  D4  E4  F4
z  A5  B5  C5  D5  E5  F5

取单列

origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
                       for i in range(6)], 
                      columns=list("ABCDEF"), index=list("uvwxyz"))

col_1 = origin_df['A']
col_1[0] = "单列"
col_2 = origin_df[['B']]
col_2.iloc[0, 0] = "单列套中括号"

col_3 = origin_df.loc[:, "C"]
col_3.iloc[0] = "loc取单列"
col_4 = origin_df.loc[:, ["D"]]
col_4.iloc[0,0] = "loc取单列套中括号"

col_5 = origin_df.iloc[:, 4]
col_5.iloc[0] = "iloc取单列"
col_6 = origin_df.iloc[:, [5]]
col_6.iloc[0,0] = "iloc取单列套中括号"
print(origin_df)
    A   B       C   D        E   F
u  单列  B0  loc取单列  D0  iloc取单列  F0
v  A1  B1      C1  D1       E1  F1
w  A2  B2      C2  D2       E2  F2
x  A3  B3      C3  D3       E3  F3
y  A4  B4      C4  D4       E4  F4
z  A5  B5      C5  D5       E5  F5

取单行

origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
                       for i in range(6)], 
                      columns=list("ABCDEF"), index=list("uvwxyz"))

row_3 = origin_df.loc['w', :]
row_3.iloc[0] = "loc取单行"
row_4 = origin_df.loc[['x'], :]
row_4.iloc[0,0] = "loc取单行套中括号"

row_5 = origin_df.iloc[4, :]
row_5.iloc[0] = "iloc取单行"
row_6 = origin_df.iloc[[5], :]
row_6.iloc[0,0] = "loc取单行套中括号"
print(origin_df)
         A   B   C   D   E   F
u       A0  B0  C0  D0  E0  F0
v       A1  B1  C1  D1  E1  F1
w   loc取单行  B2  C2  D2  E2  F2
x       A3  B3  C3  D3  E3  F3
y  iloc取单行  B4  C4  D4  E4  F4
z       A5  B5  C5  D5  E5  F5

取多列

origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
                       for i in range(6)], 
                      columns=list("ABCDEF"), index=list("uvwxyz"))

# 取不到Empty DataFrame,Columns: [A, B, C, D, E, F],Index: []
# col_1 = origin_df['A':'B'] 
col_2 = origin_df[['A',"B"]]
col_2.iloc[1, 0] = "多列套中括号"

col_3 = origin_df.loc[:, "C":"D"]
col_3.iloc[2, 0] = "loc取多列"
print("列名切片col_3\n {}".format(col_3))
col_4 = origin_df.loc[:, ["C","D"]]
col_4.iloc[3, 0] = "loc取多列套中括号"

col_5 = origin_df.iloc[:, 4:5]
col_5.iloc[4, 0] = "iloc取多列"
print("=" * 20)
print("下标切片col_5\n {}".format(col_5))
col_6 = origin_df.iloc[:, [4,5]]
col_6.iloc[5,0] = "iloc取多列套中括号"
print("=" * 20)
print(origin_df)
列名切片col_3
         C   D
u      C0  D0
v      C1  D1
w  loc取多列  D2
x      C3  D3
y      C4  D4
z      C5  D5
====================
下标切片col_5
          E
u       E0
v       E1
w       E2
x       E3
y  iloc取多列
z       E5
====================
    A   B       C   D        E   F
u  A0  B0      C0  D0       E0  F0
v  A1  B1      C1  D1       E1  F1
w  A2  B2  loc取多列  D2       E2  F2
x  A3  B3      C3  D3       E3  F3
y  A4  B4      C4  D4  iloc取多列  F4
z  A5  B5      C5  D5       E5  F5

取多行

origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
                       for i in range(6)], 
                      columns=list("ABCDEF"), index=list("uvwxyz"))

row_3 = origin_df.loc['w':'x', :]
row_3.iloc[1, 2] = "loc取多行"
print("索引名切片row_3\n {}".format(row_3))
row_4 = origin_df.loc[['w', 'x'], :]
row_4.iloc[1, 3] = "loc取多行套中括号"

row_5 = origin_df.iloc[4:5, :]
row_5.iloc[0, 4] = "iloc取多行"
print("=" * 20)
print("下标切片row_5\n {}".format(row_5))
row_6 = origin_df.iloc[[4,5], :]
row_6.iloc[1,5] = "loc取多行套中括号"
print("=" * 20)
print(origin_df)
索引名切片row_3
     A   B       C   D   E   F
w  A2  B2      C2  D2  E2  F2
x  A3  B3  loc取多行  D3  E3  F3
====================
下标切片row_5
     A   B   C   D        E   F
y  A4  B4  C4  D4  iloc取多行  F4
====================
    A   B       C   D        E   F
u  A0  B0      C0  D0       E0  F0
v  A1  B1      C1  D1       E1  F1
w  A2  B2      C2  D2       E2  F2
x  A3  B3  loc取多行  D3       E3  F3
y  A4  B4      C4  D4  iloc取多行  F4
z  A5  B5      C5  D5       E5  F5

3.values 属性

  • 若取多行或多列(切片 or 中括号枚举,这种方式下哪怕取一行一列特性和多行多列一样)
    • values 属性返回二维矩阵,类型是numpy.ndarray
  • 若取单行或单列所得类型得看取值方式
    • 所取行或列外面再套一层中括号(例如origin_df[['A']]), values 属性返回二维矩阵
    • 否则 values 属性返回一维矩阵,类型是numpy.ndarray
  • 是否返回副本

    • 凡是多套一层中括号的均是返回副本(切片无法套中括号),副本的修改值的操作不会影响到源. 

    • 否则不返回副本,修改会影响到源,枚举必须套括号.

col_1 = origin_df['A']
print("col_1\n{}".format(col_1.values))
col_2 = origin_df[['B']]
print("col_2\n{}".format(col_2.values))

row_3 = origin_df.loc['w', :]
print("row_3\n{}".format(row_3.values))
row_4 = origin_df.loc[['x'], :]
print("row_4\n{}".format(row_4.values))

col_2 = origin_df[['A',"B"]]
print("col_2\n{}".format(col_2.values))

col_3 = origin_df.loc[:, "C":"C"]
print(" 哪怕切片仅取一列性质依旧通多列col_3\n{}".format(col_3.values))

row_5 = origin_df.iloc[4:5, :]
print("row_5\n {}".format(row_5.values))
row_6 = origin_df.iloc[[4,5], :]
print("row_6\n {}".format(row_6.values))
col_1
['A0' 'A1' 'A2' 'A3' 'A4' 'A5']
col_2
[['B0']
 ['B1']
 ['B2']
 ['B3']
 ['B4']
 ['B5']]
row_3
['A2' 'B2' 'C2' 'D2' 'E2' 'F2']
row_4
[['A3' 'B3' 'C3' 'D3' 'E3' 'F3']]
col_2
[['A0' 'B0']
 ['A1' 'B1']
 ['A2' 'B2']
 ['A3' 'B3']
 ['A4' 'B4']
 ['A5' 'B5']]
 哪怕切片仅取一列性质依旧通多列col_3
[['C0']
 ['C1']
 ['C2']
 ['C3']
 ['C4']
 ['C5']]
[['C0' 'D0']
 ['C1' 'D1']
 ['C2' 'D2']
 ['C3' 'D3']
 ['C4' 'D4']
 ['C5' 'D5']]
row_5
 [['A4' 'B4' 'C4' 'D4' 'E4' 'F4']]
row_6
 [['A4' 'B4' 'C4' 'D4' 'E4' 'F4']
 ['A5' 'B5' 'C5' 'D5' 'E5' 'F5']]

4.布尔索引

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
                   A         B         C         D
2013-01-01  0.161101  0.364128  1.735634 -0.835255
2013-01-02  1.164170  0.384188  0.302318 -0.293224
2013-01-03  1.116850  1.469352  0.867080 -0.420124
2013-01-04  0.952359  1.056309 -2.857191  0.668887
2013-01-05 -0.097658 -0.794298  1.387195 -0.897870
2013-01-06 -0.270472 -1.841921  2.008927  1.140431

排序:返回副本

df.sort_values(by='B')
 ABCD
2013-01-06-0.270472-1.8419212.0089271.140431
2013-01-05-0.097658-0.7942981.387195-0.897870
2013-01-010.1611010.3641281.735634-0.835255
2013-01-021.1641700.3841880.302318-0.293224
2013-01-040.9523591.056309-2.8571910.668887
2013-01-031.1168501.4693520.867080-0.420124
print(df[df['A'] > 0])
                   A         B         C         D
2013-01-01  0.161101  0.364128  1.735634 -0.835255
2013-01-02  1.164170  0.384188  0.302318 -0.293224
2013-01-03  1.116850  1.469352  0.867080 -0.420124
2013-01-04  0.952359  1.056309 -2.857191  0.668887
print(df[df > 0])
                   A         B         C         D
2013-01-01  0.161101  0.364128  1.735634       NaN
2013-01-02  1.164170  0.384188  0.302318       NaN
2013-01-03  1.116850  1.469352  0.867080       NaN
2013-01-04  0.952359  1.056309       NaN  0.668887
2013-01-05       NaN       NaN  1.387195       NaN
2013-01-06       NaN       NaN  2.008927  1.140431
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print(df2)
print(df2[df2['E'].isin(['two', 'four'])])
                   A         B         C         D      E
2013-01-01  0.161101  0.364128  1.735634 -0.835255    one
2013-01-02  1.164170  0.384188  0.302318 -0.293224    one
2013-01-03  1.116850  1.469352  0.867080 -0.420124    two
2013-01-04  0.952359  1.056309 -2.857191  0.668887  three
2013-01-05 -0.097658 -0.794298  1.387195 -0.897870   four
2013-01-06 -0.270472 -1.841921  2.008927  1.140431  three
                   A         B         C         D     E
2013-01-03  1.116850  1.469352  0.867080 -0.420124   two
2013-01-05 -0.097658 -0.794298  1.387195 -0.897870  four
print(df2[~df2['E'].isin(['two', 'four'])])
                  A         B         C         D      E
2013-01-01  0.161101  0.364128  1.735634 -0.835255    one
2013-01-02  1.164170  0.384188  0.302318 -0.293224    one
2013-01-04  0.952359  1.056309 -2.857191  0.668887  three
2013-01-06 -0.270472 -1.841921  2.008927  1.140431  three

5.index

reindex

Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned ``NaN``.

重新设置df的索引,并且df的顺序是新索引的定义顺序,记录的位置可能发生变化,如果新的索引在原先的df存在,该索引的记录就是原纪录,否则,该记录填默认值NaN。

df = pd.DataFrame(np.random.randn(10, 4), columns=list("ABCD"))
print(df)
          A         B         C         D
0  0.903525 -0.664247 -0.645762 -0.762519
1  0.981854 -1.070156 -1.164206 -0.908125
2  0.309620 -0.786684 -0.960699  1.606932
3 -1.488677  0.281483  0.856681  0.613150
4  0.772205  0.601886  0.344716 -1.800654
5  0.769349  0.875296  0.074671 -0.333205
6  0.721913 -0.148773 -0.825000 -0.903127
7 -0.886161  0.625793  0.102159  0.264182
8 -0.225532 -0.221453  1.164743  1.037622
9 -0.046355 -1.238612  0.042434 -0.473256
df1 = df.reindex(list(range(15, 5, -1)))
print(df1)
           A         B         C         D
15       NaN       NaN       NaN       NaN
14       NaN       NaN       NaN       NaN
13       NaN       NaN       NaN       NaN
12       NaN       NaN       NaN       NaN
11       NaN       NaN       NaN       NaN
10       NaN       NaN       NaN       NaN
9  -0.046355 -1.238612  0.042434 -0.473256
8  -0.225532 -0.221453  1.164743  1.037622
7  -0.886161  0.625793  0.102159  0.264182
6   0.721913 -0.148773 -0.825000 -0.903127

关于index

重新设置索引,相当于仅将df的索引改变一下,记录还是在原先的位置。

df1 = df.copy()
df1.index = range(15, 5, -1)
print(df1)
           A         B         C         D
15  0.903525 -0.664247 -0.645762 -0.762519
14  0.981854 -1.070156 -1.164206 -0.908125
13  0.309620 -0.786684 -0.960699  1.606932
12 -1.488677  0.281483  0.856681  0.613150
11  0.772205  0.601886  0.344716 -1.800654
10  0.769349  0.875296  0.074671 -0.333205
9   0.721913 -0.148773 -0.825000 -0.903127
8  -0.886161  0.625793  0.102159  0.264182
7  -0.225532 -0.221453  1.164743  1.037622
6  -0.046355 -1.238612  0.042434 -0.473256

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值