pandas DataFrame数据处理

鱼跃龙门Smile

已于 2024-01-15 15:27:10 修改

阅读量951

点赞数 7

文章标签： pandas python

于 2024-01-15 15:19:44 首次发布

本文链接：https://blog.csdn.net/zj88189748/article/details/135557122

版权

1.DataFrame修改index、columns

6.2删除数据集中指定的数据：drop()

7.1.3查询语文成绩大于平均语文成绩的数据

7.1.4查询语文成绩大于30或者数学成绩等于90的数据

7.1.5使用in查询

7.2查询包含某些数据的方法：isin()

8.多级索引

9.练习

1.DataFrame修改index、columns

1.1直接覆盖

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(9).reshape(3,3),index=['sh','cs','bj'],columns=['a','b','c'])
print('df原始数据:')
print(df)

# 修改索引，直接覆盖
print('修改索引，直接覆盖:')
df.index=['shanghai','changsha','beijing']
print(df)

#修改列名，直接覆盖
print('修改列名，直接覆盖:')
df.columns=['d','e','f']
print(df)

运行结果：

df原始数据:
    a  b  c
sh  0  1  2
cs  3  4  5
bj  6  7  8

修改索引，直接覆盖:
          a  b  c
shanghai  0  1  2
changsha  3  4  5
beijing   6  7  8

修改列名，直接覆盖:
          d  e  f
shanghai  0  1  2
changsha  3  4  5
beijing   6  7  8

1.2批量修改

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(9).reshape(3,3),index=['sh','cs','bj'],columns=['a','b','c'])
print('df原始数据:')
print(df)

#通过rename方法批量修改
print('通过rename方法批量修改:')
def add_suffix(x):
    return x+'_zs'
df = df.rename(index=add_suffix,columns=add_suffix)
print(df)

运行结果：

df原始数据:
    a  b  c
sh  0  1  2
cs  3  4  5
bj  6  7  8

通过rename方法批量修改:
       a_zs  b_zs  c_zs
sh_zs     0     1     2
cs_zs     3     4     5
bj_zs     6     7     8

1.3设置索引：set_index()

import numpy as np
import pandas as pd

df = pd.DataFrame({'A': range(4),
                    'B': [2, 8, 9, 1],
                    'C': list('abcd'),
                    'D': [1, 2, 34, 4]})
print('df原数据:')
print(df)

print('把C作为索引列，并且不删除原数据:')
set_df = df.set_index('C',drop=False)
print(set_df)

print('把C作为索引列，并且删除原数据:')
set_df = df.set_index('C',drop=True)
print(set_df)

运行结果：

df原数据:
   A  B  C   D
0  0  2  a   1
1  1  8  b   2
2  2  9  c  34
3  3  1  d   4

把C作为索引列，并且不删除原数据:
   A  B  C   D
C             
a  0  2  a   1
b  1  8  b   2
c  2  9  c  34
d  3  1  d   4

把C作为索引列，并且删除原数据:
   A  B   D
C          
a  0  2   1
b  1  8   2
c  2  9  34
d  3  1   4

2.添加数据

2.1新增列

import pandas as pd

data = {
    'Date': ['2023-09-01', '2023-09-02', '2023-09-03'],
    'Steps': [8000, 9000, 7500]
}
df = pd.DataFrame(data)
print('df原始数据:')
print(df)

# 新增列,一般用新增列比较多，新增行用连接合并
print('df新增cost列数据后:')
df['cost'] = [30,20,25]
print(df)

运行结果：

df原始数据:
         Date  Steps
0  2023-09-01   8000
1  2023-09-02   9000
2  2023-09-03   7500

df新增cost列数据后:
         Date  Steps  cost
0  2023-09-01   8000    30
1  2023-09-02   9000    20
2  2023-09-03   7500    25

3.拼接：concat()

import pandas as pd

yb_data = {
    'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
    'Forecast': ['Sunny', 'Partly Cloudy', 'Rainy', 'Cloudy', 'Sunny']
}
yb_df = pd.DataFrame(yb_data)
print('yb_df原始数据:')
print(yb_df)
qw_data = {
    'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
    'Temperature (°c)': [28, 29, 24, 22, 27]
}
qw_df = pd.DataFrame(qw_data)
print('qw_df原始数据:')
print(qw_df)

# 拼接两个df，沿着0轴方向拼接
print('拼接两个df，沿着0轴方向拼接:')
concat_df = pd.concat([yb_df,qw_df],axis=0)
print(concat_df)

# 拼接两个df，沿着1轴方向拼接
print('拼接两个df，沿着1轴方向拼接:')
concat_df = pd.concat([yb_df,qw_df],axis=1)
print(concat_df)

运行结果：

yb_df原始数据:
         Day       Forecast
0     Monday          Sunny
1    Tuesday  Partly Cloudy
2  Wednesday          Rainy
3   Thursday         Cloudy
4     Friday          Sunny

qw_df原始数据:
         Day  Temperature (°c)
0     Monday                28
1    Tuesday                29
2  Wednesday                24
3   Thursday                22
4     Friday                27

拼接两个df，沿着0轴方向拼接:
         Day       Forecast  Temperature (°c)
0     Monday          Sunny               NaN
1    Tuesday  Partly Cloudy               NaN
2  Wednesday          Rainy               NaN
3   Thursday         Cloudy               NaN
4     Friday          Sunny               NaN
0     Monday            NaN              28.0
1    Tuesday            NaN              29.0
2  Wednesday            NaN              24.0
3   Thursday            NaN              22.0
4     Friday            NaN              27.0

拼接两个df，沿着1轴方向拼接:
         Day       Forecast        Day  Temperature (°c)
0     Monday          Sunny     Monday                28
1    Tuesday  Partly Cloudy    Tuesday                29
2  Wednesday          Rainy  Wednesday                24
3   Thursday         Cloudy   Thursday                22
4     Friday          Sunny     Friday                27

4.连接：merge()

"""
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
参数说明：
left：左侧的DataFrame对象。
right：右侧的DataFrame对象。
how：合并方式。可选值有'inner'、'outer'、'left'和'right'。默认值为'inner'，表示内连接。
on：用于连接的列名。如果左右两个DataFrame对象中的列名相同，则可以使用该参数进行连接。
left_on：左侧DataFrame对象中用于连接的列名。
right_on：右侧DataFrame对象中用于连接的列名。
left_index：是否使用左侧DataFrame对象的索引进行连接。默认值为False。
right_index：是否使用右侧DataFrame对象的索引进行连接。默认值为False。
sort：是否按照连接键对结果进行排序。默认值为False。
suffixes：当左右两个DataFrame对象中存在相同列名时，用于追加到列名后的后缀。默认值为('_x', '_y')。
copy：是否复制数据。默认值为True。
indicator：是否在结果DataFrame中增加一个特殊的列，用于标识每一行是在哪个DataFrame中存在。默认值为False。
validate：是否进行连接的验证。可选值有'one_to_one'、'one_to_many'、'many_to_one'和'many_to_many'。默认值为None。
"""

4.1内连接：inner

import pandas as pd

user_data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Rose', 'Bob', 'Jack', 'David', 'Lucy'],
    'Email': ['rose@163.com', 'bob@163.com', 'jack@163.com', 'david@163.com', 'lucy@163.com']
}
user_df = pd.DataFrame(user_data)
print('user_df原始数据:')
print(user_df)

buy_data = {
    'CustomerID': [1, 2, 1, 3, 4, 3, 6],
    'OrderID': [101, 102, 103, 104, 105, 106, 107],
    'Product': ['A', 'B', 'C', 'D', 'E', 'A', 'B'],
    'Quantity': [2, 1, 3, 2, 4, 1, 2]
}

buy_df = pd.DataFrame(buy_data)
print('buy_df原始数据:')
print(buy_df)

# 内连接
merge_df = pd.merge(user_df,buy_df,on='CustomerID',how='inner')
print('内连接:')
print(merge_df)

运行结果：

user_df原始数据:
   CustomerID   Name          Email
0           1   Rose   rose@163.com
1           2    Bob    bob@163.com
2           3   Jack   jack@163.com
3           4  David  david@163.com
4           5   Lucy   lucy@163.com

buy_df原始数据:
   CustomerID  OrderID Product  Quantity
0           1      101       A         2
1           2      102       B         1
2           1      103       C         3
3           3      104       D         2
4           4      105       E         4
5           3      106       A         1
6           6      107       B         2

内连接:
   CustomerID   Name          Email  OrderID Product  Quantity
0           1   Rose   rose@163.com      101       A         2
1           1   Rose   rose@163.com      103       C         3
2           2    Bob    bob@163.com      102       B         1
3           3   Jack   jack@163.com      104       D         2
4           3   Jack   jack@163.com      106       A         1
5           4  David  david@163.com      105       E         4

4.2左连接：left

import pandas as pd

user_data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Rose', 'Bob', 'Jack', 'David', 'Lucy'],
    'Email': ['rose@163.com', 'bob@163.com', 'jack@163.com', 'david@163.com', 'lucy@163.com']
}
user_df = pd.DataFrame(user_data)
print('user_df原始数据:')
print(user_df)

buy_data = {
    'CustomerID': [1, 2, 1, 3, 4, 3, 6],
    'OrderID': [101, 102, 103, 104, 105, 106, 107],
    'Product': ['A', 'B', 'C', 'D', 'E', 'A', 'B'],
    'Quantity': [2, 1, 3, 2, 4, 1, 2]
}

buy_df = pd.DataFrame(buy_data)
print('buy_df原始数据:')
print(buy_df)

# 内连接
merge_df = pd.merge(user_df,buy_df,on='CustomerID',how='left')
print('左连接:')
print(merge_df)

运行结果：

user_df原始数据:
   CustomerID   Name          Email
0           1   Rose   rose@163.com
1           2    Bob    bob@163.com
2           3   Jack   jack@163.com
3           4  David  david@163.com
4           5   Lucy   lucy@163.com

buy_df原始数据:
   CustomerID  OrderID Product  Quantity
0           1      101       A         2
1           2      102       B         1
2           1      103       C         3
3           3      104       D         2
4           4      105       E         4
5           3      106       A         1
6           6      107       B         2

左连接:
   CustomerID   Name          Email  OrderID Product  Quantity
0           1   Rose   rose@163.com    101.0       A       2.0
1           1   Rose   rose@163.com    103.0       C       3.0
2           2    Bob    bob@163.com    102.0       B       1.0
3           3   Jack   jack@163.com    104.0       D       2.0
4           3   Jack   jack@163.com    106.0       A       1.0
5           4  David  david@163.com    105.0       E       4.0
6           5   Lucy   lucy@163.com      NaN     NaN       NaN

4.3右连接：right

和左连接类似

import pandas as pd

user_data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Rose', 'Bob', 'Jack', 'David', 'Lucy'],
    'Email': ['rose@163.com', 'bob@163.com', 'jack@163.com', 'david@163.com', 'lucy@163.com']
}
user_df = pd.DataFrame(user_data)
print('user_df原始数据:')
print(user_df)

buy_data = {
    'CustomerID': [1, 2, 1, 3, 4, 3, 6],
    'OrderID': [101, 102, 103, 104, 105, 106, 107],
    'Product': ['A', 'B', 'C', 'D', 'E', 'A', 'B'],
    'Quantity': [2, 1, 3, 2, 4, 1, 2]
}

buy_df = pd.DataFrame(buy_data)
print('buy_df原始数据:')
print(buy_df)

# 右连接
merge_df = pd.merge(user_df,buy_df,on='CustomerID',how='right')
print('右连接:')
print(merge_df)

运行结果：

user_df原始数据:
   CustomerID   Name          Email
0           1   Rose   rose@163.com
1           2    Bob    bob@163.com
2           3   Jack   jack@163.com
3           4  David  david@163.com
4           5   Lucy   lucy@163.com

buy_df原始数据:
   CustomerID  OrderID Product  Quantity
0           1      101       A         2
1           2      102       B         1
2           1      103       C         3
3           3      104       D         2
4           4      105       E         4
5           3      106       A         1
6           6      107       B         2

右连接:
   CustomerID   Name          Email  OrderID Product  Quantity
0           1   Rose   rose@163.com      101       A         2
1           2    Bob    bob@163.com      102       B         1
2           1   Rose   rose@163.com      103       C         3
3           3   Jack   jack@163.com      104       D         2
4           4  David  david@163.com      105       E         4
5           3   Jack   jack@163.com      106       A         1
6           6    NaN            NaN      107       B         2

4.4外连接：outer

import pandas as pd

user_data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Rose', 'Bob', 'Jack', 'David', 'Lucy'],
    'Email': ['rose@163.com', 'bob@163.com', 'jack@163.com', 'david@163.com', 'lucy@163.com']
}
user_df = pd.DataFrame(user_data)
print('user_df原始数据:')
print(user_df)

buy_data = {
    'CustomerID': [1, 2, 1, 3, 4, 3, 6],
    'OrderID': [101, 102, 103, 104, 105, 106, 107],
    'Product': ['A', 'B', 'C', 'D', 'E', 'A', 'B'],
    'Quantity': [2, 1, 3, 2, 4, 1, 2]
}

buy_df = pd.DataFrame(buy_data)
print('buy_df原始数据:')
print(buy_df)

# 外连接
merge_df = pd.merge(user_df,buy_df,on='CustomerID',how='outer')
print('外连接:')
print(merge_df)

运行结果：

user_df原始数据:
   CustomerID   Name          Email
0           1   Rose   rose@163.com
1           2    Bob    bob@163.com
2           3   Jack   jack@163.com
3           4  David  david@163.com
4           5   Lucy   lucy@163.com

buy_df原始数据:
   CustomerID  OrderID Product  Quantity
0           1      101       A         2
1           2      102       B         1
2           1      103       C         3
3           3      104       D         2
4           4      105       E         4
5           3      106       A         1
6           6      107       B         2

外连接:
   CustomerID   Name          Email  OrderID Product  Quantity
0           1   Rose   rose@163.com    101.0       A       2.0
1           1   Rose   rose@163.com    103.0       C       3.0
2           2    Bob    bob@163.com    102.0       B       1.0
3           3   Jack   jack@163.com    104.0       D       2.0
4           3   Jack   jack@163.com    106.0       A       1.0
5           4  David  david@163.com    105.0       E       4.0
6           5   Lucy   lucy@163.com      NaN     NaN       NaN
7           6    NaN            NaN    107.0       B       2.0

5.索引连接：join()

"""
def join(
        self,
        other: DataFrame | Series | Iterable[DataFrame | Series],
        on: IndexLabel | None = None,
        how: MergeHow = "left",
        lsuffix: str = "",
        rsuffix: str = "",
        sort: bool = False,
        validate: str | None = None）
参数说明：
other【要合并的表】、
on【合并other表的列索引或列名可以是列表】、
how【合并方式，可选'left', 'right', 'outer', 'inner', 'cross'，默认为left】、
lsuffix【列名重复时，合并后左表列名使用的后缀，默认''】、
rsuffix【列名重复时，合并后右表列名使用的后缀，默认''】、
sort【True时根据合并的索引排列合并结果，False时根据how参数排序，默认False】、
validate【设置合并数据类型，支持"one_to_one" or "1:1"、"one_to_many" or "1:m"、"many_to_one" or "m:1"、"many_to_many" or "m:m"】
"""

5.1有相同列名时的连接

import pandas as pd

students_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [18, 19, 18, 20, 19]
}
students_df = pd.DataFrame(students_data,index=list('abcde'))
print('students_df原始数据:')
print(students_df)
scores_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math': [90, 85, 92, 78, 88],
    'Science': [88, 87, 91, 79, 90]
}
scores_df = pd.DataFrame(scores_data,index=list('cdefg'))
print('scores_df原始数据:')
print(scores_df)

#基于索引的合并join()，当有相同列名时，需设置左边或者右边其中一个列名加上后缀
# 设置左边相同的那个列名后缀lsuffix='_left'
df = students_df.join(scores_df,lsuffix='_left')
print('基于索引的合并:')
print(df)

运行结果：

students_df原始数据:
   StudentID     Name  Age
a          1    Alice   18
b          2      Bob   19
c          3  Charlie   18
d          4    David   20
e          5      Eve   19

scores_df原始数据:
   StudentID  Math  Science
c          1    90       88
d          2    85       87
e          3    92       91
f          4    78       79
g          5    88       90

基于索引的合并:
   StudentID_left     Name  Age  StudentID  Math  Science
a               1    Alice   18        NaN   NaN      NaN
b               2      Bob   19        NaN   NaN      NaN
c               3  Charlie   18        1.0  90.0     88.0
d               4    David   20        2.0  85.0     87.0
e               5      Eve   19        3.0  92.0     91.0

5.2根据相同的列名进行数据匹配

import pandas as pd

students_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [18, 19, 18, 20, 19]
}
students_df = pd.DataFrame(students_data)
#把StudentID设置为行索引
students_df.set_index('StudentID',inplace=True)
print('students_df原始数据:')
print(students_df)

scores_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math': [90, 85, 92, 78, 88],
    'Science': [88, 87, 91, 79, 90]
}
scores_df = pd.DataFrame(scores_data)
#把StudentID设置为行索引
scores_df.set_index('StudentID',inplace=True)
print('scores_df原始数据:')
print(scores_df)

#基于索引的合并join()，当有相同列名时，想根据列名"StudentID",匹配两个表中列名"StudentID"相同的数据，
# 将两个表要匹配的列名设置为表的索引，然后使用on参数进行匹配
df = students_df.join(scores_df,on='StudentID')
print('基于相同索引StudentID的合并:')
print(df)

运行结果：

students_df原始数据:
              Name  Age
StudentID              
1            Alice   18
2              Bob   19
3          Charlie   18
4            David   20
5              Eve   19

scores_df原始数据:
           Math  Science
StudentID               
1            90       88
2            85       87
3            92       91
4            78       79
5            88       90

基于相同索引StudentID的合并:
              Name  Age  Math  Science
StudentID                             
1            Alice   18    90       88
2              Bob   19    85       87
3          Charlie   18    92       91
4            David   20    78       79
5              Eve   19    88       90

6.数据删除

6.1删除特定的列：del

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,columns=list('abcd'))
print('df原始数据:')
print(df)

#del删除特定的列
del df['a']
print('del删除a列后的数据:')
print(df)

运行结果：

df原始数据:
           a          b          c          d
0   0.023600  55.164221  22.564766  88.598519
1  11.685332  31.011234  61.363767  48.929718
2  44.680275  46.587721  49.290019  32.071490
3  41.177778  62.977722   6.162506  75.664697

del删除a列后的数据:
           b          c          d
0  55.164221  22.564766  88.598519
1  31.011234  61.363767  48.929718
2  46.587721  49.290019  32.071490
3  62.977722   6.162506  75.664697

6.2删除数据集中指定的数据：drop()

"""
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
参数说明:
labels:要删除的行或列的标签,可以是索引值、列名、数组等。
axis:删除的轴,0代表行,1代表列。
index,columns:与labels功能类似,但可以直接指定索引名或列名。
level:在多重索引的情况下删除特定级别的行或列。
inplace:是否在原对象上进行操作,默认为False。
errors:如果删除的行或列不存在,抛出异常的类型。
返回值:
默认返回一个删除指定行/列后的新对象,如果inplace=True,则返回None。
"""

6.2.1删除一行数据

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(20,size=(4,4))*10,columns=list('abcd'))
print('df原始数据:')
print(df)

#方法一：
#默认按行删除(axis=0)
print('默认按行labels=0删除:')
drop_df = df.drop(0)
print(drop_df)

#方法二：
print('方法二index=1删除行:')
drop_df = df.drop(index=1)
print(drop_df)

运行结果：

df原始数据:
     a   b    c    d
0   80  90   60  100
1   90  50   90   10
2   20  80   20  170
3  160  20  160  160

默认按行labels=0删除:
     a   b    c    d
1   90  50   90   10
2   20  80   20  170
3  160  20  160  160

方法二index=1删除行:
     a   b    c    d
0   80  90   60  100
2   20  80   20  170
3  160  20  160  160

6.2.2删除多行

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(20,size=(4,4))*10,columns=list('abcd'))
print('df原始数据:')
print(df)

#方法一：
#默认按行删除(axis=0)
print('方法一按行labels=[0,2]删除多行:')
drop_df = df.drop([0,2])
print(drop_df)

#方法二：
print('方法二index=[1,3]删除多行:')
drop_df = df.drop(index=[1,3])
print(drop_df)

运行结果：

df原始数据:
     a   b    c    d
0   20  30  110   90
1  150  40   70   60
2   50  40   50  170
3   50  70   30   60

方法一默认按行labels=[0,2]删除多行:
     a   b   c   d
1  150  40  70  60
3   50  70  30  60

方法二index=[1,3]删除多行:
    a   b    c    d
0  20  30  110   90
2  50  40   50  170

6.2.3删除列

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(20,size=(4,4))*10,columns=list('abcd'))
print('df原始数据:')
print(df)

print("方法一labels=['a','c']:")
#axis=0按行删除，axis=1按列删除
drop_df = df.drop(['a','c'],axis=1)
print(drop_df)

print("方法二columns=['b','d']:")
drop_df = df.drop(columns=['b','d'],axis=1)
print(drop_df)

运行结果：

df原始数据:
     a    b    c    d
0  180   50  160   60
1   50   70   90  120
2    0  130   30   60
3  130   70   40   10

方法一labels=['a','c']:
     b    d
0   50   60
1   70  120
2  130   60
3   70   10

方法二columns=['b','d']:
     a    c
0  180  160
1   50   90
2    0   30
3  130   40

6.2.4同时删除n行n列

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(20,size=(4,4))*10,columns=list('abcd'))
print('df原始数据:')
print(df)

#同时删除n行n列
print('同时删除n行n列:')
drop_df = df.drop(index=[0,2],columns=['a','c'])
print(drop_df)

运行结果：

df原始数据:
     a    b    c    d
0   50   30   80  100
1   40  170  190  110
2  110  170   50  170
3  150   20  190  170

同时删除n行n列:
     b    d
1  170  110
3   20  170

7.查询数据

7.1查询方法：query()

类似于sql中where的条件查询

7.1.1单个查询条件

import pandas as pd

data = {
    'A':[1,2,3,4,5],
    'B':[10,20,30,40,50]
}
df = pd.DataFrame(data)
print('df原始数据:')
print(df)

#A列中大于2的数据,查询的语句要用引号引起来
print('A列中大于2的数据，查询的语句要用引号引起来:')
qurey_df = df.query('A>2')
print(qurey_df)

运行结果：

df原始数据:
   A   B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

A列中大于2的数据，查询的语句要用引号引起来:
   A   B
2  3  30
3  4  40
4  5  50

7.1.2多个条件查询

两者都需要满足的条件使用符号 & 或 and

只需要满足其中之一的条件使用符号 | 或 or

import pandas as pd

data = {
    'A':[1,2,3,4,5],
    'B':[10,20,30,40,50]
}
df = pd.DataFrame(data)
print('df原始数据:')
print(df)

print('A列中大于2并且B小于40的数据，查询的语句要用引号引起来:')
print('使用and:')
qurey_df = df.query('A>2 and B<40')
print(qurey_df)
#或者
print('使用&:')
qurey_df = df.query('A>2 & B<40')
print(qurey_df)

print('A列中大于2或者B小于20的数据，查询的语句要用引号引起来:')
print('使用or:')
qurey_df = df.query('A>2 or B<20')
print(qurey_df)

print('使用|:')
qurey_df = df.query('A>2 | B<20')
print(qurey_df)

运行结果：

df原始数据:
   A   B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

A列中大于2并且B小于40的数据，查询的语句要用引号引起来:

使用and:
   A   B
2  3  30

使用&:
   A   B
2  3  30

A列中大于2或者B小于20的数据，查询的语句要用引号引起来:

使用or:
   A   B
0  1  10
2  3  30
3  4  40
4  5  50

使用|:
   A   B
0  1  10
2  3  30
3  4  40
4  5  50

7.1.3查询语文成绩大于平均语文成绩的数据

import numpy as np
import pandas as pd

data = np.random.randint(0,100,(10,3))
index = [f'stu_{i}' for i in range(1,11)]
columns = ['语文','数学','英语']
df = pd.DataFrame(data,index=index,columns=columns)
print('df原始数据:')
print(df)

# 查询语文成绩大于平均语文成绩的数据
average = df['语文'].mean()
#固定语法：@变量
print('查询语文成绩大于平均语文成绩的数据:')
query_df = df.query('语文 > @average')
print(query_df)

运行结果：

df原始数据:
        语文  数学  英语
stu_1   45  98  41
stu_2   35  83  55
stu_3   57  36  83
stu_4   58  34  32
stu_5   32  53   1
stu_6   17  91  85
stu_7   34  75  22
stu_8   37  25  40
stu_9   74  12  57
stu_10   1  58  13

查询语文成绩大于平均语文成绩的数据:
       语文  数学  英语
stu_1  45  98  41
stu_3  57  36  83
stu_4  58  34  32
stu_9  74  12  57

7.1.4查询语文成绩大于30或者数学成绩等于90的数据

import numpy as np
import pandas as pd

data = np.random.randint(0,100,(10,3))
index = [f'stu_{i}' for i in range(1,11)]
columns = ['语文','数学','英语']
df = pd.DataFrame(data,index=index,columns=columns)
print('df原始数据:')
print(df)

# 查询语文成绩大于30或者数学成绩等于90的数据
print('查询语文成绩大于30或者数学成绩等于90的数据:')
query_df = df.query('语文 > 30 or 数学 == 90')
print(query_df)

运行结果：

df原始数据:
        语文  数学  英语
stu_1   38  46  99
stu_2   42  90  34
stu_3   21  61  25
stu_4   84  31  33
stu_5   10  88  52
stu_6    5  79  19
stu_7   83  16  37
stu_8   21  95  50
stu_9   13  84  81
stu_10  55  54  70

查询语文成绩大于30或者数学成绩等于90的数据:
        语文  数学  英语
stu_1   38  46  99
stu_2   42  90  34
stu_4   84  31  33
stu_7   83  16  37
stu_10  55  54  70

7.1.5使用in查询

import pandas as pd

data = {
    'A':[1,2,3,4,5],
    'B':[10,20,30,40,50]
}
df = pd.DataFrame(data)
print('df原始数据:')
print(df)

# 使用in查询B列值为20和40的数据
print('使用in查询B列值为20和40的数据:')
query_df = df.query('B in (20,40)')
print(query_df)

运行结果：

df原始数据:
   A   B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

使用in查询B列值为20和40的数据:
   A   B
1  2  20
3  4  40

7.2查询包含某些数据的方法：isin()

import pandas as pd

data = {
    'City': ['长沙', '北京', '上海', '成都', '云南'],
    'Population (millions)': [84, 39, 27, 23, 15]
}
df = pd.DataFrame(data)
print('df原始数据:')
print(df)

#isin返回bool值
print('isin返回bool值:')
print(df['City'].isin(['长沙','成都']))
#查询包含长沙、成都的数据
print('查询包含长沙、成都的数据:')
isin_df = df[df['City'].isin(['长沙','成都'])]
print(isin_df)

运行结果：

df原始数据:
  City  Population (millions)
0   长沙                     84
1   北京                     39
2   上海                     27
3   成都                     23
4   云南                     15

isin返回bool值:
0     True
1    False
2    False
3     True
4    False
Name: City, dtype: bool

查询包含长沙、成都的数据:
  City  Population (millions)
0   长沙                     84
3   成都                     23

8.多级索引

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.random.randint(70, 100, size=(2, 4)),
                   index=['girl', 'boy'],
                   columns=[['English', 'English', 'Chinese', 'Chinese'],
                            ['like', 'dislike', 'like', 'disklike']])
print('df1原始数据:')
print(df1)
df2 = pd.DataFrame(np.random.randint(70, 120, size=(4, 2)),
                   columns=['girl', 'boy'],
                   index=pd.MultiIndex.from_product([['English', 'Chinese'], ['like', 'dislike']]))
print('df2原始数据:')
print(df2)
print('多级索引查询，查询索引为English的数据:')
q_df = df2.loc['English',:]
print(q_df)

s = pd.Series(np.random.randint (0, 100, size=6), index=[['a','a','b','b','c','c'],
                                                         ['期中','期末','期中','期末','期中','期末']])
print('s原始数据:')
print(s)

print('查询s第4行数据:')
print(s[3])

print('查询索引c、期末的数据:')
print(s['c','期末'])

print('s重置索引后的数据:')
s = s.reset_index()
print(s)

运行结果：

df1原始数据:
     English         Chinese         
        like dislike    like disklike
girl      92      92      98       96
boy       78      85      96       85

df2原始数据:
                 girl  boy
English like       72  102
        dislike    91  119
Chinese like       70  111
        dislike    90   70

多级索引查询，查询索引为English的数据:
         girl  boy
like       72  102
dislike    91  119

s原始数据:
a  期中    20
   期末    37
b  期中    52
   期末    72
c  期中    34
   期末    22
dtype: int32

查询s第4行数据:
72

查询索引c、期末的数据:
22

s重置索引后的数据:
  level_0 level_1   0
0       a      期中  20
1       a      期末  37
2       b      期中  52
3       b      期末  72
4       c      期中  34
5       c      期末  22

9.练习

"""
# 1. df = pd. DataFrame (np.random.randint(10,20,(3,3)),index=['a','b','c'])，现
添加age列，并添加相应数据。
# 2.例1中的df
①请按age列进行降序排序
②按索引升序排序
③新增按age进行的排名列。
# 3.将df中的c行的age改为2
# 4.增加priority列，数据只有'yes’,'no';然后priority列中的yes,no替换为布尔值True, False
"""

练习代码：

df = pd. DataFrame (np.random.randint(10,20,(3,3)),index=['a','b','c'])
print('df原始数据:')
print(df)

print('添加age列，并添加相应数据:')
df['age'] = [28,19,20]
print(df)

print('请按age列进行降序排序:')
sort_df = df.sort_values(by='age',ascending=False)
print(sort_df)

print('按索引升序排序:')
sort_df = df.sort_index()
print(sort_df)

print('新增按age进行的排名列:')
df['age_rank'] = df['age'].rank()
print(df)

print('将df中的c行的age改为2:')
df.loc['c','age'] = 2
print(df)

print('增加priority列，数据只有"yes","no":')
df['priority']= ['yes','no','yes']
print(df)

print('priority列中的yes,no替换为布尔值True, False:')
print(df['priority']=='yes')
#方法一：
# df['priority'] = [True if priority== 'yes' else False for priority in df['priority']]
#方法二：
df['priority'] = df['priority']=='yes'
print(df)

运行结果：

df原始数据:
    0   1   2
a  11  13  19
b  16  11  16
c  19  11  11

添加age列，并添加相应数据:

    0   1   2  age
a  11  13  19   28
b  16  11  16   19
c  19  11  11   20

请按age列进行降序排序:
    0   1   2  age
a  11  13  19   28
c  19  11  11   20
b  16  11  16   19

按索引升序排序:
    0   1   2  age
a  11  13  19   28
b  16  11  16   19
c  19  11  11   20

新增按age进行的排名列:
    0   1   2  age  age_rank
a  11  13  19   28       3.0
b  16  11  16   19       1.0
c  19  11  11   20       2.0

将df中的c行的age改为2:
    0   1   2  age  age_rank
a  11  13  19   28       3.0
b  16  11  16   19       1.0
c  19  11  11    2       2.0

增加priority列，数据只有"yes","no":
    0   1   2  age  age_rank priority
a  11  13  19   28       3.0      yes
b  16  11  16   19       1.0       no
c  19  11  11    2       2.0      yes

priority列中的yes,no替换为布尔值True, False:
a     True
b    False
c     True

Name: priority, dtype: bool
    0   1   2  age  age_rank  priority
a  11  13  19   28       3.0      True
b  16  11  16   19       1.0     False
c  19  11  11    2       2.0      True

后续持续学习更新