pandas选择数据

最新推荐文章于 2024-03-11 15:25:08 发布

灵海之森

最新推荐文章于 2024-03-11 15:25:08 发布

阅读量190

点赞数

文章标签： pandas

本文链接：https://blog.csdn.net/qq_43814415/article/details/134401368

版权

一.使用索引或列名选择数据

1.loc方法基于标签选择数据，可以接受布尔数组。
2.iloc方法基于位置索引选择数据

二、基础使用

定义一个df

df = pd.DataFrame(np.random.randint(1,100,size=(6,5)), index=[1,2,3,4,5,6],columns=['A', 'B', 'C', 'D', 'E'])

先选择列名，得到一个series，再根据索引选择具体的值

print(df[‘A’][5])# 64
df[[‘B’, ‘A’]] = df[[‘A’, ‘B’]]# 快速替换两列
替换两列也可以直接用真实的数据：

df.loc[:, ['B', 'A']] = df[['A', 'B']]#尝试替换，失败
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()# 成功
df.iloc[:, [1, 0]] = df[['A', 'B']] # 成功

选择数据：

print(df.loc[1:3, 'A':'C'])# 先是index，后是column
print()
print(df.iloc[0:3, 0:3])
print()
print(df[df['A'] > 50]) # 选择 'A' 列中值大于 50 的所有行

三、属性

感觉是根据列名来的，比方说下面是’A’列

print(df.A)

四、切片

print(df[:3])# 提取前三行

切片语法 start:stop:step
start 是开始的索引，如果省略，默认从第一行开始。
stop 是结束的索引，在这个例子中被省略了，所以会选择到最后一行。
step 是步长，在这个例子中为 3，意味着从 start 开始每隔 step - 1 行选择一行。
步长为 -1，意味着切片操作是从序列的末尾向前进行，每次移动一个元素。
因此，s[::-1] 会从 s 的最后一个元素开始，逆序遍历整个序列，直到序列的开始，从而创建一个逆序的序列。

print(df[::3])#从第一行开始，然后每三行选择一次

五、标签.loc选择数据

接收单个标签、标签的数组、标签的切片、布尔数组、对象（函数和方法等）。
切片时应确保索引上下界都存在，是排好序的，且不重复。

print(df.loc[1:3, 'A':'C'])

索引上界和下界都被包含。1和3是index，不是位置

布尔数组的举例：

In [56]: df1.loc['a'] > 0
Out[56]: 
A     True
B    False
C    False
D    False
Name: a, dtype: bool

In [57]: df1.loc[:, df1.loc['a'] > 0]
Out[57]: 
          A
a  0.132003
b  1.130127
c  1.024180
d  0.974466
e  0.545952
f -1.281247

六、根据位置选择数据.iloc

在 Pandas 中，“chained assignment” 指的是对 DataFrame 进行连续索引操作以赋值的行为。如：df['A'][0] = 100
当使用链式赋值时，Pandas 可能会返回一个 DataFrame 或其某个部分的副本而非视图.
上界包括，但下界不包含。
接收一个整数、一个整数数组、整数切片、布尔数组、对象。举例：

In [68]: s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))

In [69]: s1
Out[69]: 
0    0.695775
2    0.341734
4    0.959726
6   -1.110336
8   -0.619976
dtype: float64

In [70]: s1.iloc[:3]
Out[70]: 
0    0.695775
2    0.341734
4    0.959726
dtype: float64

In [71]: s1.iloc[3]
Out[71]: -1.110336102891167

七、根据对象选择数据

主要使用lambda函数，

In [98]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list('abcdef'),
   ....:                    columns=list('ABCD'))
   ....: 

In [99]: df1
Out[99]: 
          A         B         C         D
a -0.023688  2.410179  1.450520  0.206053
b -0.251905 -2.213588  1.063327  1.266143
c  0.299368 -0.863838  0.408204 -1.048089
d -0.025747 -0.988387  0.094055  1.262731
e  1.289997  0.082423 -0.055758  0.536580
f -0.489682  0.369374 -0.034571 -2.484478

In [100]: df1.loc[lambda df: df['A'] > 0, :]
Out[100]: 
          A         B         C         D
c  0.299368 -0.863838  0.408204 -1.048089
e  1.289997  0.082423 -0.055758  0.536580

In [101]: df1.loc[:, lambda df: ['A', 'B']]
Out[101]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [102]: df1.iloc[:, lambda df: [0, 1]]
Out[102]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [103]: df1[lambda df: df.columns[0]]
Out[103]: 
a   -0.023688
b   -0.251905
c    0.299368
d   -0.025747
e    1.289997
f   -0.489682
Name: A, dtype: float64

八、将位置和标签结合选择数据

可以使用loc或者iloc，原理就是使用一些方法得到位置或者标签

In [107]: dfd = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [4, 5, 6]},
   .....:                    index=list('abc'))
   .....: 

In [108]: dfd
Out[108]: 
   A  B
a  1  4
b  2  5
c  3  6

In [109]: dfd.loc[dfd.index[[0, 2]], 'A']
Out[109]: 
a    1
c    3
Name: A, dtype: int64

In [110]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[110]: 
a    1
c    3
Name: A, dtype: int64

重新索引：reindex
index.intersection方法用于找出两个索引的共同元素，不会改变原 Series 的长度或引入 NaN。

九、随机采样sample

In [122]: s = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1 row.默认
In [123]: s.sample()
Out[123]: 
4    4
dtype: int64

# One may specify either a number of rows:指定数目
In [124]: s.sample(n=3)
Out[124]: 
0    0
4    4
1    1
dtype: int64

# Or a fraction of the rows:指定比例
In [125]: s.sample(frac=0.5)
Out[125]: 
5    5
3    3
1    1
dtype: int64

replace参数控制可多次采样同一行，默认否，即False
默认的采样概率是一致的，可以通过weights参数控制：

In [129]: s = pd.Series([0, 1, 2, 3, 4, 5])

In [130]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [131]: s.sample(n=3, weights=example_weights)
Out[131]: 
5    5
4    4
3    3
dtype: int64

# Weights will be re-normalized automatically
In [132]: example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [133]: s.sample(n=1, weights=example_weights2)
Out[133]: 
0    0
dtype: int64

random_state参数控制随机性

十、快速选取标量

at和iat方法分别提供从标签和位置选取标量的方法

In [151]: s.iat[5]
Out[151]: 5

In [152]: df.at[dates[5], 'A']
Out[152]: 0.1136484096888855

In [153]: df.iat[3, 0]
Out[153]: -0.7067711336300845

十一、布尔索引

|是or，&是and，~是not

In [158]: s = pd.Series(range(-3, 4))

In [159]: s
Out[159]: 
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [160]: s[s > 0]
Out[160]: 
4    1
5    2
6    3
dtype: int64

In [161]: s[(s < -1) | (s > 0.5)]
Out[161]: 
0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [162]: s[~(s < 0)]
Out[162]: 
3    0
4    1
5    2
6    3
dtype: int64

列表推导式和映射函数也可以使用

In [164]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
   .....:                     'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

# only want 'two' or 'three'
In [165]: criterion = df2['a'].map(lambda x: x.startswith('t'))

In [166]: df2[criterion]
Out[166]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# equivalent but slower
In [167]: df2[[x.startswith('t') for x in df2['a']]]
Out[167]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# Multiple criteria
In [168]: df2[criterion & (df2['b'] == 'x')]
Out[168]: 
       a  b         c
3  three  x  0.361719

十二、isin函数

根据元素是否存在返回布尔值。也可以作用于索引对象

In [175]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [176]: s
Out[176]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [177]: s.isin([2, 4, 6])
Out[177]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool

In [178]: s[s.isin([2, 4, 6])]
Out[178]: 
2    2
0    4
dtype: int64

可以用于多重索引

In [181]: s_mi = pd.Series(np.arange(6),
   .....:                  index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
   .....: 

In [182]: s_mi
Out[182]: 
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int64

In [183]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[183]: 
0  c    2
1  a    3
dtype: int64

In [184]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[184]: 
0  a    0
   c    2
1  a    3
   c    5
dtype: int64

应用于dataframe中，是返回与其大小一致的布尔dataframe。可以针对全部元素，也可以只面向个别标签，如：

In [185]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
   .....:                    'ids2': ['a', 'n', 'c', 'n']})
   .....: 

In [186]: values = ['a', 'b', 1, 3]

In [187]: df.isin(values)
Out[187]: 
    vals    ids   ids2
0   True   True   True
1  False   True  False
2   True  False  False
3  False  False  False

In [188]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [189]: df.isin(values)
Out[189]: 
    vals    ids   ids2
0   True   True  False
1  False   True  False
2   True  False  False
3  False  False  False

In [192]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [193]: row_mask = df.isin(values).all(1)# 1就是axi=1，即沿着列轴，行方向切

In [194]: df[row_mask]
Out[194]: 
   vals ids ids2
0     1   a    a

十三、 where函数

与[]的区别在于该函数返回值是和原始值的形状一样的

In [195]: s[s > 0]
Out[195]: 
3    1
2    2
1    3
0    4
dtype: int64

In [196]: s.where(s > 0)
Out[196]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

.where()方法接受一个条件和一个替代值作为参数。如：

In [215]: df3 = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [4, 5, 6],
   .....:                     'C': [7, 8, 9]})
   .....: 

In [216]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[216]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9

十四、mask函数

是where的布尔逆运算

In [217]: s.mask(s >= 0)
Out[217]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [218]: df.mask(df >= 0)
Out[218]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

十五、query()函数

使用表达式查询数据

In [226]: n = 10

In [227]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [228]: df
Out[228]: 
          a         b         c
0  0.438921  0.118680  0.863670
1  0.138138  0.577363  0.686602
2  0.595307  0.564592  0.520630
3  0.913052  0.926075  0.616184
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
6  0.792342  0.216974  0.564056
7  0.397890  0.454131  0.915716
8  0.074315  0.437913  0.019794
9  0.559209  0.502065  0.026437

# pure python
In [229]: df[(df['a'] < df['b']) & (df['b'] < df['c'])]
Out[229]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716

# query
In [230]: df.query('(a < b) & (b < c)')
Out[230]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716

放在多重索引中也是可以的，可以指定索引级别：

df.query('ilevel_0 == "red"')

使用 ==/!= 将值列表与列进行比较的方式与 in/not in 类似。

十六、重复值的处理

duplicated用于检测是否存在重复值
drop_duplicates移除重复值
默认第一次出现的值是非重复的，也就是保留的，但是也可以控制，使用keep参数，有三个值，‘first’，‘last’，False，分别对应保留第一个值，保留最后一个值，不保留。

In [294]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

In [295]: df2
Out[295]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [296]: df2.duplicated('a')
Out[296]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [297]: df2.duplicated('a', keep='last')
Out[297]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [298]: df2.duplicated('a', keep=False)
Out[298]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [299]: df2.drop_duplicates('a')
Out[299]: 
       a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329

In [300]: df2.drop_duplicates('a', keep='last')
Out[300]: 
       a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [301]: df2.drop_duplicates('a', keep=False)
Out[301]: 
       a  b         c
5  three  x -1.964475
6   four  x  1.298329

也可以传入一个列表，如[‘a’,‘b’]，将之作为一个subset

十七、get方法

可以返回一个值，类似于df[‘col1’]

In [310]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [311]: s.get('a')  # equivalent to s['a']
Out[311]: 1

In [312]: s.get('x', default=-1)
Out[312]: -1

十八、因式分解factorize

In [313]: df = pd.DataFrame({'col': ["A", "A", "B", "B"],
   .....:                    'A': [80, 23, np.nan, 22],
   .....:                    'B': [80, 55, 76, 67]})
   .....: 

In [314]: df
Out[314]: 
  col     A   B
0   A  80.0  80
1   A  23.0  55
2   B   NaN  76
3   B  22.0  67

In [315]: idx, cols = pd.factorize(df['col'])

In [316]: df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Out[316]: array([80., 23., 76., 67.])

让我们详细解释这一行代码：

idx, cols = pd.factorize(df['col'])

这行代码使用了 Pandas 的 factorize 函数，它的作用是对输入的序列（这里是 df['col']
列）进行因子化处理。因子化是将具体的值映射为整数索引的过程，通常用于处理分类数据。我们来分解这个操作：

pd.factorize(...):

这个函数接受一个可迭代的序列（在这个例子中是 df['col']，其内容是 [‘A’, ‘A’, ‘B’, ‘B’]）。
函数的目的是将这个序列中的每个唯一值映射到一个整数。
返回值是两个元素的元组：第一个元素是一个数组，表示原序列中每个元素的整数索引；第二个元素是一个唯一值数组，表示原序列中的唯一值。

idx, cols:

idx（索引）: 这是 factorize 函数返回的第一个元素。在我们的例子中，对于输入 [‘A’, ‘A’, ‘B’, ‘B’]，idx 会是 [0, 0, 1,
1]。这表示第一个和第二个元素都对应于唯一值数组中的第一个元素（‘A’），而第三个和第四个元素对应于第二个元素（‘B’）。
cols（唯一值）: 这是 factorize 函数返回的第二个元素，表示原始序列中的唯一值。在这个例子中，cols 将是 [‘A’, ‘B’]。

因此，这行代码的作用是创建两个数组：idx 映射原始 ‘col’ 列中的每个值到一个整数索引，而 cols
则包含了这些唯一的标签值。这种映射在数据分析中尤其有用，因为它允许我们使用数值操作来处理原本是分类的数据。
当然，让我们继续用中文来详细解释这段代码的最后两步：

第五步：重排 DataFrame 并转换为 NumPy 数组

python df.reindex(cols, axis=1).to_numpy()

df.reindex(cols, axis=1)：这个方法根据 cols 数组重排 df 的列。由于 cols 是通过对 ‘col’ 列进行因子化得到的，所以它包含了 ‘A’ 和 ‘B’。重排之后的 DataFrame 会按照 ‘A’ 和 ‘B’ 的顺序排列列。
.to_numpy()：将重排后的 DataFrame 转换成一个 NumPy 数组。这一步是为了便于下一步使用 NumPy 的高级索引功能。

第六步：使用高级索引选取元素

python np.arange(len(df)), idx

np.arange(len(df))：生成一个从 0 到 df 长度减 1 的数组，实际上就是生成了一个行索引数组，例如 [0, 1, 2, 3]。
idx：之前通过因子化得到的数组，表示 ‘col’ 列中每个元素对应的列索引，例如对于 [‘A’, ‘A’, ‘B’, ‘B’]，idx 为 [0, 0, 1, 1]。
这里使用了 NumPy 的高级索引。通过配对 np.arange(len(df)) 和 idx，我们为每一行选取了一个特定的列。具体来说，对于每一行，它根据 ‘col’ 列的值（‘A’ 或 ‘B’）来决定是从 ‘A’ 列还是
‘B’ 列中取值。
结果是 [80., 23., 76., 67.]，这个数组包含了根据 ‘col’ 列的指示从 ‘A’ 或 ‘B’ 列中选取的元素。

简单来说，这段代码的目的是根据 ‘col’ 列的值来决定每行应该从 ‘A’ 列还是 ‘B’ 列中提取数据。

十九、索引对象Index

可以创建索引

In [317]: index = pd.Index(['e', 'd', 'a', 'b'])

In [318]: index
Out[318]: Index(['e', 'd', 'a', 'b'], dtype='object')

In [319]: 'd' in index
Out[319]: True

使用dtype参数控制类型

In [331]: index = pd.Index(list(range(5)), name='rows')

In [332]: columns = pd.Index(['A', 'B', 'C'], name='cols')

In [333]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)

In [334]: df
Out[334]: 
cols         A         B         C
rows                              
0     1.295989 -1.051694  1.340429
1    -2.366110  0.428241  0.387275
2     0.433306  0.929548  0.278094
3     2.154730 -0.315628  0.264223
4     1.126818  1.132290 -0.353310

In [335]: df['A']
Out[335]: 
rows
0    1.295989
1   -2.366110
2    0.433306
3    2.154730
4    1.126818
Name: A, dtype: float64

index对象的操作函数包括difference，union，intersection等

In [346]: a = pd.Index(['c', 'b', 'a'])

In [347]: b = pd.Index(['c', 'e', 'd'])

In [348]: a.difference(b)
Out[348]: Index(['a', 'b'], dtype='object')

fillna方法用一个指定值填充缺失值：

In [355]: idx1 = pd.Index([1, np.nan, 3, 4])

In [356]: idx1
Out[356]: Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [357]: idx1.fillna(2)
Out[357]: Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [358]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
   .....:                          pd.NaT,
   .....:                          pd.Timestamp('2011-01-03')])
   .....: 

In [359]: idx2
Out[359]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [360]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[360]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

set_index()采用列名或列表作为索引：

In [361]: data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
   .....:                      'b': ['one', 'two', 'one', 'two'],
   .....:                      'c': ['z', 'y', 'x', 'w'],
   .....:                      'd': [1., 2., 3, 4]})
   .....: 

In [362]: data
Out[362]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

In [363]: indexed1 = data.set_index('c')

In [364]: indexed1
Out[364]: 
     a    b    d
c               
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0

In [365]: indexed2 = data.set_index(['a', 'b'])

In [366]: indexed2
Out[366]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

reset_index()与上面的相反，重置索引

In [371]: data
Out[371]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

In [372]: data.reset_index()
Out[372]: 
   index    a    b  c    d
0      0  bar  one  z  1.0
1      1  bar  two  y  2.0
2      2  foo  one  x  3.0
3      3  foo  two  w  4.0