DataFrame-数据处理1

文章最前: 我是Octopus,这个名字来源于我的中文名--章鱼;我热爱编程、热爱算法、热爱开源。所有源码在我的个人github ;这博客是记录我学习的点点滴滴,如果您对 Python、Java、AI、算法有兴趣,可以关注我的动态,一起学习,共同进步。


导入pandas和numpy

import pandas as pd
import numpy as np

一.数据的增删改

1.增加行

(1)手动输入新增行的内容
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50

如何手动添加行?

df1.loc[5] = ["baby","shanghai",80]
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
5babyshanghai80

(2)将同字段的DataFrame添加进来

#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df1_1 = pd.DataFrame({"name":["faker","lucy"],
                    "city":["guangzhou","shenzhen"],
                    "score":[70,75]},
                  columns=["name","city","score"])
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
df1_1
namecityscore
0fakerguangzhou70
1lucyshenzhen75

如何将同字段的DataFrame增加到原DataFrame中呢?

#如果直接添加进来,索引号不会顺接上去
df1.append(df1_1)
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
0fakerguangzhou70
1lucyshenzhen75
df1.append? 
#正确的写法如下,这样索引号就顺接上去了
df1.append(df1_1,ignore_index=True)
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
5fakerguangzhou70
6lucyshenzhen75

 还有一种做法,通过concat拼接同字段的DataFrame

pd.concat([df1,df1_1],ignore_index=True)
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
5fakerguangzhou70
6lucyshenzhen75

2.删除行 

### 示例数据
df_concat = pd.concat([df1,df1_1],ignore_index=True)
df_concat
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
5fakerguangzhou70
6lucyshenzhen75

如何删除行?

#删除第7行,也即是索引号为6的这一行
df_concat.drop(6,inplace=True)
df_concat
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
5fakerguangzhou70
#删除第4行和第6行
df_concat.drop([3,5])
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
4candysuzhou50

 3.修改行

若要修改行,则要先选出需要修改的一行或多行,再重新赋值

#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50

 何将第一行的ray修改成demon,hangzhou改成wenzhou,10改成35?

df1.loc[0] = ["demon","hangzhou",35]
df1
namecityscore
0demonhangzhou35
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50

 如何进行多行修改?

df1.loc[0:2] = [["d","j","l"],["h","b","h"],[40,50,60]]
df1
namecityscore
0djl
1hbh
2405060
3bobchengdu15
4candysuzhou50

4.增加列

(1)在末尾插入列
#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50

如何末尾增加一列:gender(性别)

df1["gender"] = ["male","male","female","male","female"]
df1
namecityscoregender
0rayhangzhou10male
1jackbeijing30male
2lucyhangzhou20female
3bobchengdu15male
4candysuzhou50female
 (2)在任意位置插入新列

我希望在第2列的位置插入新的一列:height(身高)

df1.insert(1,"height",[170,165,172,180,169])  #第1个参数1表示索引号即插入的位置,第2个参数填列的名称,第3个参数填值
df1
nameheightcityscoregender
0ray170hangzhou10male
1jack165beijing30male
2lucy172hangzhou20female
3bob180chengdu15male
4candy169suzhou50female

 5.删除列

#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
(1)del DataFrame["colname"] 
del df1["score"]
df1
namecity
0rayhangzhou
1jackbeijing
2lucyhangzhou
3bobchengdu
4candysuzhou
(2)DataFrame.drop(["colname"],axis = 1)

先重新运行下生成df1的式子,初始化df1 

df1.drop(["city"],axis=1)
namescore
0ray10
1jack30
2lucy20
3bob15
4candy50

 6.修改列

若要修改列,则要先选出需要修改的一列或多列,再重新赋值

#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50

修改score列

df1["score"] = 50
df1
namecityscore
0rayhangzhou50
1jackbeijing50
2lucyhangzhou50
3bobchengdu50
4candysuzhou50

 修改city和score列

df1[["city","score"]] = [["hz","bj","hz","cd","sz"],60]
df1
namecityscore
0rayhz60
1jackbj60
2lucyhz60
3bobcd60
4candysz60

 二.数据集的合并

如何merge?

将列作为键合并

示例数据

df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])

df2 = pd.DataFrame({"name":["ray","lucy","demon"],
                   "age":[15,17,16]},
                  columns=["name","age"])
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
df2
nameage
0ray15
1lucy17
2demon16

 (1)inner连接(交集)

pd.merge(df1,df2) #默认连接方式是交集;若没有指定,则默认将重叠列的列名作为键
namecityscoreage
0rayhangzhou1015
1lucyhangzhou2017
pd.merge(df1,df2,how="inner")   #也可以显示指定连接方式为inner,等价于不填参数how="inner"
namecityscoreage
0rayhangzhou1015
1lucyhangzhou2017
pd.merge(df1,df2,on="name")   #也可以显式地指定键为“name”列
namecityscoreage
0rayhangzhou1015
1lucyhangzhou2017
#因此完整地写法是
pd.merge(df1,df2,on="name",how="inner")
namecityscoreage
0rayhangzhou1015
1lucyhangzhou2017

 (2)outer连接(并集)

pd.merge(df1,df2,on="name",how="outer")
namecityscoreage
0rayhangzhou10.015.0
1jackbeijing30.0NaN
2lucyhangzhou20.017.0
3bobchengdu15.0NaN
4candysuzhou50.0NaN
5demonNaNNaN16.0

 (3)left连接(保左加右)

df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
df2
nameage
0ray15
1lucy17
2demon16
pd.merge(df1,df2,on="name",how="left")
namecityscoreage
0rayhangzhou1015.0
1jackbeijing30NaN
2lucyhangzhou2017.0
3bobchengdu15NaN
4candysuzhou50NaN

 (4)right连接(保右加左)

pd.merge(df1,df2,on="name",how="right")
namecityscoreage
0rayhangzhou10.015
1lucyhangzhou20.017
2demonNaNNaN16

一些可能遇到的问题:

Q1:如果两个数据指定列的列名不一样怎么办?

df3 = df2.rename(columns={"name":"name2"})
df3
name2age
0ray15
1lucy17
2demon16
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
pd.merge(df1,df3,left_on="name",right_on="name2",how="inner")
namecityscorename2age
0rayhangzhou10ray15
1lucyhangzhou20lucy17

Q2:如果需要多个键来进行合并怎么办呢?

#给df1增加新的一行,名称为已出现的ray
df1.loc[5] = ["ray","wuhan",80]
df1
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
5raywuhan80
#给df2增加新的一列city
df2["city"] = ["hangzhou","hangzhou","heilongjiang"]
df2
nameagecity
0ray15hangzhou
1lucy17hangzhou
2demon16heilongjiang
pd.merge(df1,df2,on=["name","city"],how="left")
namecityscoreage
0rayhangzhou1015.0
1jackbeijing30NaN
2lucyhangzhou2017.0
3bobchengdu15NaN
4candysuzhou50NaN
5raywuhan80NaN

 将索引作为键来合并

#示例数据
left1 = pd.DataFrame({"key":list("acba"),"value":range(4)})
right1 = pd.DataFrame({"value2":[10,20]},index=["a","b"])
pd.merge(left1,right1,left_on="key",right_index=True,how="inner")
keyvaluevalue2
0a010
3a310
2b220

 三.数据的轴向连接

axis=0:表示在横轴上工作,所谓横轴也即是行,而行的方向是上下,因此你可以理解为在上下方向执行操作

axis=1:表示在纵轴上工作,所谓纵轴也即是列,而列的方向是左右,因此你可以理解为在左右方向直行操作

那么数据的轴向连接也就是指:当axis=0时,将两份或多份数据按照上下方向拼接起来;当axis=1时,将两份或多份数据按照左右方向拼接起来。

(1)横轴上的连接,axis=0时(concat默认axis=0)

两份数据的字段完全相同的情况:

#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df2 = pd.DataFrame({"name":["faker","fizz"],
                    "city":["wenzhou","shanghai"],
                    "score":[55,80]},
                  columns=["name","city","score"])

按横轴连接df1和df2

pd.concat([df1,df2],ignore_index=True)
namecityscore
0rayhangzhou10
1jackbeijing30
2lucyhangzhou20
3bobchengdu15
4candysuzhou50
5fakerwenzhou55
6fizzshanghai80

 两份数据的字段存在不同的情况下:

#示例数据
df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
                    "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
                    "score":[10,30,20,15,50]},
                  columns=["name","city","score"])
df2 = pd.DataFrame({"name":["faker","fizz"],
                    "city":["wenzhou","shanghai"],
                    "gender":["male","female"]},
                  columns=["name","city","gender"])

按横轴连接df1和df2

pd.concat([df1,df2],ignore_index=True)
citygendernamescore
0hangzhouNaNray10.0
1beijingNaNjack30.0
2hangzhouNaNlucy20.0
3chengduNaNbob15.0
4suzhouNaNcandy50.0
5wenzhoumalefakerNaN
6shanghaifemalefizzNaN

会得到这两份数据的并集,没有的值会以NaN的方式填充

在连接轴上创建一个层次化索引

df_concat = pd.concat([df1,df2],keys=["df1","df2"])
df_concat
citygendernamescore
df10hangzhouNaNray10.0
1beijingNaNjack30.0
2hangzhouNaNlucy20.0
3chengduNaNbob15.0
4suzhouNaNcandy50.0
df20wenzhoumalefakerNaN
1shanghaifemalefizzNaN

当要访问df1或df2时,可以从这个合并的数据集里提取

#访问df2
df_concat.loc["df2"]
citygendernamescore
0wenzhoumalefakerNaN
1shanghaifemalefizzNaN
#进一步访问df2中的第2行
df_concat.loc["df2"].loc[1]
#返回的是Series


city      shanghai
gender      female
name          fizz
score          NaN
Name: 1, dtype: object
#进一步访问df2中的第2行
df_concat.loc["df2"].loc[[1]]
#返回的是DataFrame

(2)纵轴上的连接,axis=1时 

按纵轴方向合并df1和df2

pd.concat([df1,df2],axis=1)
namecityscorenamecitygender
0rayhangzhou10fakerwenzhoumale
1jackbeijing30fizzshanghaifemale
2lucyhangzhou20NaNNaNNaN
3bobchengdu15NaNNaNNaN
4candysuzhou50NaNNaNNaN

四.合并重叠数据

#示例数据
data1 = pd.DataFrame({"score":[60,np.nan,75,80],
                     "level":[np.nan,"a",np.nan,"f"],
                    "cost":[1000,1500,np.nan,1200]})
data2 = pd.DataFrame({"score":[34,58,np.nan],
                    "level":[np.nan,"c","s"]})
data1.combine_first(data2)
costlevelscore
01000.0NaN60.0
11500.0a58.0
2NaNs75.0
31200.0f80.0

 data1和data2有索引重叠的部分:即level列和score列的前三行。那么对于data1中的数据,如果data1已有数据,则继续用data1的数据,如果data1中有缺失数据,那么对于缺失数据用参数里的对象data2中的对应值来补充

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值