文章最前: 我是Octopus,这个名字来源于我的中文名--章鱼;我热爱编程、热爱算法、热爱开源。所有源码在我的个人github ;这博客是记录我学习的点点滴滴,如果您对 Python、Java、AI、算法有兴趣,可以关注我的动态,一起学习,共同进步。
导入pandas和numpy
import pandas as pd
import numpy as np
一.数据的增删改
1.增加行
(1)手动输入新增行的内容
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
如何手动添加行?
df1.loc[5] = ["baby","shanghai",80]
df1
name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
5 | baby | shanghai | 80 |
---|
(2)将同字段的DataFrame添加进来
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df1_1 = pd.DataFrame({"name":["faker","lucy"],
"city":["guangzhou","shenzhen"],
"score":[70,75]},
columns=["name","city","score"])
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
df1_1
| name | city | score |
---|
0 | faker | guangzhou | 70 |
---|
1 | lucy | shenzhen | 75 |
---|
如何将同字段的DataFrame增加到原DataFrame中呢?
#如果直接添加进来,索引号不会顺接上去
df1.append(df1_1)
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
0 | faker | guangzhou | 70 |
---|
1 | lucy | shenzhen | 75 |
---|
df1.append?
#正确的写法如下,这样索引号就顺接上去了
df1.append(df1_1,ignore_index=True)
name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
5 | faker | guangzhou | 70 |
---|
6 | lucy | shenzhen | 75 |
---|
还有一种做法,通过concat拼接同字段的DataFrame
pd.concat([df1,df1_1],ignore_index=True)
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
5 | faker | guangzhou | 70 |
---|
6 | lucy | shenzhen | 75 |
---|
2.删除行
### 示例数据
df_concat = pd.concat([df1,df1_1],ignore_index=True)
df_concat
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
5 | faker | guangzhou | 70 |
---|
6 | lucy | shenzhen | 75 |
---|
如何删除行?
#删除第7行,也即是索引号为6的这一行
df_concat.drop(6,inplace=True)
df_concat
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
5 | faker | guangzhou | 70 |
---|
#删除第4行和第6行
df_concat.drop([3,5])
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
4 | candy | suzhou | 50 |
---|
3.修改行
若要修改行,则要先选出需要修改的一行或多行,再重新赋值
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
何将第一行的ray修改成demon,hangzhou改成wenzhou,10改成35?
df1.loc[0] = ["demon","hangzhou",35]
df1
| name | city | score |
---|
0 | demon | hangzhou | 35 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
如何进行多行修改?
df1.loc[0:2] = [["d","j","l"],["h","b","h"],[40,50,60]]
df1
| name | city | score |
---|
0 | d | j | l |
---|
1 | h | b | h |
---|
2 | 40 | 50 | 60 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
4.增加列
(1)在末尾插入列
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
如何末尾增加一列:gender(性别)
df1["gender"] = ["male","male","female","male","female"]
df1
| name | city | score | gender |
---|
0 | ray | hangzhou | 10 | male |
---|
1 | jack | beijing | 30 | male |
---|
2 | lucy | hangzhou | 20 | female |
---|
3 | bob | chengdu | 15 | male |
---|
4 | candy | suzhou | 50 | female |
---|
(2)在任意位置插入新列
我希望在第2列的位置插入新的一列:height(身高)
df1.insert(1,"height",[170,165,172,180,169]) #第1个参数1表示索引号即插入的位置,第2个参数填列的名称,第3个参数填值
df1
| name | height | city | score | gender |
---|
0 | ray | 170 | hangzhou | 10 | male |
---|
1 | jack | 165 | beijing | 30 | male |
---|
2 | lucy | 172 | hangzhou | 20 | female |
---|
3 | bob | 180 | chengdu | 15 | male |
---|
4 | candy | 169 | suzhou | 50 | female |
---|
5.删除列
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
(1)del DataFrame["colname"]
del df1["score"]
df1
| name | city |
---|
0 | ray | hangzhou |
---|
1 | jack | beijing |
---|
2 | lucy | hangzhou |
---|
3 | bob | chengdu |
---|
4 | candy | suzhou |
---|
(2)DataFrame.drop(["colname"],axis = 1)
先重新运行下生成df1的式子,初始化df1
df1.drop(["city"],axis=1)
| name | score |
---|
0 | ray | 10 |
---|
1 | jack | 30 |
---|
2 | lucy | 20 |
---|
3 | bob | 15 |
---|
4 | candy | 50 |
---|
6.修改列
若要修改列,则要先选出需要修改的一列或多列,再重新赋值
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
修改score列
df1["score"] = 50
df1
| name | city | score |
---|
0 | ray | hangzhou | 50 |
---|
1 | jack | beijing | 50 |
---|
2 | lucy | hangzhou | 50 |
---|
3 | bob | chengdu | 50 |
---|
4 | candy | suzhou | 50 |
---|
修改city和score列
df1[["city","score"]] = [["hz","bj","hz","cd","sz"],60]
df1
| name | city | score |
---|
0 | ray | hz | 60 |
---|
1 | jack | bj | 60 |
---|
2 | lucy | hz | 60 |
---|
3 | bob | cd | 60 |
---|
4 | candy | sz | 60 |
---|
二.数据集的合并
如何merge?
将列作为键合并
示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df2 = pd.DataFrame({"name":["ray","lucy","demon"],
"age":[15,17,16]},
columns=["name","age"])
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
df2
| name | age |
---|
0 | ray | 15 |
---|
1 | lucy | 17 |
---|
2 | demon | 16 |
---|
(1)inner连接(交集)
pd.merge(df1,df2) #默认连接方式是交集;若没有指定,则默认将重叠列的列名作为键
| name | city | score | age |
---|
0 | ray | hangzhou | 10 | 15 |
---|
1 | lucy | hangzhou | 20 | 17 |
---|
pd.merge(df1,df2,how="inner") #也可以显示指定连接方式为inner,等价于不填参数how="inner"
| name | city | score | age |
---|
0 | ray | hangzhou | 10 | 15 |
---|
1 | lucy | hangzhou | 20 | 17 |
---|
pd.merge(df1,df2,on="name") #也可以显式地指定键为“name”列
| name | city | score | age |
---|
0 | ray | hangzhou | 10 | 15 |
---|
1 | lucy | hangzhou | 20 | 17 |
---|
#因此完整地写法是
pd.merge(df1,df2,on="name",how="inner")
| name | city | score | age |
---|
0 | ray | hangzhou | 10 | 15 |
---|
1 | lucy | hangzhou | 20 | 17 |
---|
(2)outer连接(并集)
pd.merge(df1,df2,on="name",how="outer")
| name | city | score | age |
---|
0 | ray | hangzhou | 10.0 | 15.0 |
---|
1 | jack | beijing | 30.0 | NaN |
---|
2 | lucy | hangzhou | 20.0 | 17.0 |
---|
3 | bob | chengdu | 15.0 | NaN |
---|
4 | candy | suzhou | 50.0 | NaN |
---|
5 | demon | NaN | NaN | 16.0 |
---|
(3)left连接(保左加右)
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
df2
| name | age |
---|
0 | ray | 15 |
---|
1 | lucy | 17 |
---|
2 | demon | 16 |
---|
pd.merge(df1,df2,on="name",how="left")
| name | city | score | age |
---|
0 | ray | hangzhou | 10 | 15.0 |
---|
1 | jack | beijing | 30 | NaN |
---|
2 | lucy | hangzhou | 20 | 17.0 |
---|
3 | bob | chengdu | 15 | NaN |
---|
4 | candy | suzhou | 50 | NaN |
---|
(4)right连接(保右加左)
pd.merge(df1,df2,on="name",how="right")
| name | city | score | age |
---|
0 | ray | hangzhou | 10.0 | 15 |
---|
1 | lucy | hangzhou | 20.0 | 17 |
---|
2 | demon | NaN | NaN | 16 |
---|
一些可能遇到的问题:
Q1:如果两个数据指定列的列名不一样怎么办?
df3 = df2.rename(columns={"name":"name2"})
df3
| name2 | age |
---|
0 | ray | 15 |
---|
1 | lucy | 17 |
---|
2 | demon | 16 |
---|
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
pd.merge(df1,df3,left_on="name",right_on="name2",how="inner")
| name | city | score | name2 | age |
---|
0 | ray | hangzhou | 10 | ray | 15 |
---|
1 | lucy | hangzhou | 20 | lucy | 17 |
---|
Q2:如果需要多个键来进行合并怎么办呢?
#给df1增加新的一行,名称为已出现的ray
df1.loc[5] = ["ray","wuhan",80]
df1
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
5 | ray | wuhan | 80 |
---|
#给df2增加新的一列city
df2["city"] = ["hangzhou","hangzhou","heilongjiang"]
df2
| name | age | city |
---|
0 | ray | 15 | hangzhou |
---|
1 | lucy | 17 | hangzhou |
---|
2 | demon | 16 | heilongjiang |
---|
pd.merge(df1,df2,on=["name","city"],how="left")
| name | city | score | age |
---|
0 | ray | hangzhou | 10 | 15.0 |
---|
1 | jack | beijing | 30 | NaN |
---|
2 | lucy | hangzhou | 20 | 17.0 |
---|
3 | bob | chengdu | 15 | NaN |
---|
4 | candy | suzhou | 50 | NaN |
---|
5 | ray | wuhan | 80 | NaN |
---|
将索引作为键来合并
#示例数据
left1 = pd.DataFrame({"key":list("acba"),"value":range(4)})
right1 = pd.DataFrame({"value2":[10,20]},index=["a","b"])
pd.merge(left1,right1,left_on="key",right_index=True,how="inner")
| key | value | value2 |
---|
0 | a | 0 | 10 |
---|
3 | a | 3 | 10 |
---|
2 | b | 2 | 20 |
---|
三.数据的轴向连接
axis=0:表示在横轴上工作,所谓横轴也即是行,而行的方向是上下,因此你可以理解为在上下方向执行操作
axis=1:表示在纵轴上工作,所谓纵轴也即是列,而列的方向是左右,因此你可以理解为在左右方向直行操作
那么数据的轴向连接也就是指:当axis=0时,将两份或多份数据按照上下方向拼接起来;当axis=1时,将两份或多份数据按照左右方向拼接起来。
(1)横轴上的连接,axis=0时(concat默认axis=0)
两份数据的字段完全相同的情况:
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df2 = pd.DataFrame({"name":["faker","fizz"],
"city":["wenzhou","shanghai"],
"score":[55,80]},
columns=["name","city","score"])
按横轴连接df1和df2
pd.concat([df1,df2],ignore_index=True)
| name | city | score |
---|
0 | ray | hangzhou | 10 |
---|
1 | jack | beijing | 30 |
---|
2 | lucy | hangzhou | 20 |
---|
3 | bob | chengdu | 15 |
---|
4 | candy | suzhou | 50 |
---|
5 | faker | wenzhou | 55 |
---|
6 | fizz | shanghai | 80 |
---|
两份数据的字段存在不同的情况下:
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
"city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
"score":[10,30,20,15,50]},
columns=["name","city","score"])
df2 = pd.DataFrame({"name":["faker","fizz"],
"city":["wenzhou","shanghai"],
"gender":["male","female"]},
columns=["name","city","gender"])
按横轴连接df1和df2
pd.concat([df1,df2],ignore_index=True)
| city | gender | name | score |
---|
0 | hangzhou | NaN | ray | 10.0 |
---|
1 | beijing | NaN | jack | 30.0 |
---|
2 | hangzhou | NaN | lucy | 20.0 |
---|
3 | chengdu | NaN | bob | 15.0 |
---|
4 | suzhou | NaN | candy | 50.0 |
---|
5 | wenzhou | male | faker | NaN |
---|
6 | shanghai | female | fizz | NaN |
---|
会得到这两份数据的并集,没有的值会以NaN的方式填充
在连接轴上创建一个层次化索引
df_concat = pd.concat([df1,df2],keys=["df1","df2"])
df_concat
| | city | gender | name | score |
---|
df1 | 0 | hangzhou | NaN | ray | 10.0 |
---|
1 | beijing | NaN | jack | 30.0 |
---|
2 | hangzhou | NaN | lucy | 20.0 |
---|
3 | chengdu | NaN | bob | 15.0 |
---|
4 | suzhou | NaN | candy | 50.0 |
---|
df2 | 0 | wenzhou | male | faker | NaN |
---|
1 | shanghai | female | fizz | NaN |
---|
当要访问df1或df2时,可以从这个合并的数据集里提取
#访问df2
df_concat.loc["df2"]
city | gender | name | score |
---|
0 | wenzhou | male | faker | NaN |
---|
1 | shanghai | female | fizz | NaN |
---|
#进一步访问df2中的第2行
df_concat.loc["df2"].loc[1]
#返回的是Series
city shanghai
gender female
name fizz
score NaN
Name: 1, dtype: object
#进一步访问df2中的第2行
df_concat.loc["df2"].loc[[1]]
#返回的是DataFrame
(2)纵轴上的连接,axis=1时
按纵轴方向合并df1和df2
pd.concat([df1,df2],axis=1)
| name | city | score | name | city | gender |
---|
0 | ray | hangzhou | 10 | faker | wenzhou | male |
---|
1 | jack | beijing | 30 | fizz | shanghai | female |
---|
2 | lucy | hangzhou | 20 | NaN | NaN | NaN |
---|
3 | bob | chengdu | 15 | NaN | NaN | NaN |
---|
4 | candy | suzhou | 50 | NaN | NaN | NaN |
---|
四.合并重叠数据
#示例数据
data1 = pd.DataFrame({"score":[60,np.nan,75,80],
"level":[np.nan,"a",np.nan,"f"],
"cost":[1000,1500,np.nan,1200]})
data2 = pd.DataFrame({"score":[34,58,np.nan],
"level":[np.nan,"c","s"]})
data1.combine_first(data2)
| cost | level | score |
---|
0 | 1000.0 | NaN | 60.0 |
---|
1 | 1500.0 | a | 58.0 |
---|
2 | NaN | s | 75.0 |
---|
3 | 1200.0 | f | 80.0 |
---|
data1和data2有索引重叠的部分:即level列和score列的前三行。那么对于data1中的数据,如果data1已有数据,则继续用data1的数据,如果data1中有缺失数据,那么对于缺失数据用参数里的对象data2中的对应值来补充