pandas数据分析给力教程【完整版】(五)

pandas的拼接操作

上一篇:pandas数据分析给力教程【完整版】(四)
下一篇:pandas数据分析给力教程【完整版】(六)

pandas的拼接分为两种:

  • 级联:pd.concat, pd.append
  • 合并:pd.merge
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt

0. 回顾numpy的级联

============================================

练习12:

  1. 生成2个3*3的矩阵,对其分别进行两个维度上的级联

============================================

n1 = np.random.randint(0,100, size=(3,3))
n2 = np.random.randint(0,100, size=(3,3))
display(n1,n2)
array([[99, 49, 30],
       [36, 48, 70],
       [18, 90, 77]])



array([[82, 12,  2],
       [40, 46, 99],
       [55,  5, 77]])
np.concatenate((n1,n2), axis=0)
array([[99, 49, 30],
       [36, 48, 70],
       [18, 90, 77],
       [82, 12,  2],
       [40, 46, 99],
       [55,  5, 77]])
np.concatenate((n1,n2), axis=1)
array([[99, 49, 30, 82, 12,  2],
       [36, 48, 70, 40, 46, 99],
       [18, 90, 77, 55,  5, 77]])
np.hstack((n1,n2))
array([[99, 49, 30, 82, 12,  2],
       [36, 48, 70, 40, 46, 99],
       [18, 90, 77, 55,  5, 77]])
np.vstack((n1,n2))
array([[99, 49, 30],
       [36, 48, 70],
       [18, 90, 77],
       [82, 12,  2],
       [40, 46, 99],
       [55,  5, 77]])

为方便讲解,我们首先定义一个生成DataFrame的函数:

def make_df(index, cols):
    df = DataFrame({col: [col + str(i) for i in index] for col in cols})
    df.index = index
    return df
make_df([1,2,3,4], list('ABCD'))
ABCD
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4

1. 使用pd.concat()级联

pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

objs
axis=0
join='outer'
join_axes=None
ignore_index=False
keys = [value1,value2...]
pd.concat  concatenate()
pd.concat()

1) 简单级联

df1 = make_df([1,2,3,4], list('ABCD'))
df2 = make_df([1,2,3,4], list('ABCD'))

和np.concatenate一样,优先增加行数(默认axis=0)

# axis=0, 默认增加行数
pd.concat((df1,df2))
ABCD
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4

pd.concat((df1,df2), axis=1, ignore_index=True)
01234567
1A1B1C1D1A1B1C1D1
2A2B2C2D2A2B2C2D2
3A3B3C3D3A3B3C3D3
4A4B4C4D4A4B4C4D4

可以通过设置axis来改变级联方向

级连会把该方向上索引相同的元素放在一行(一列),index/columns在级联时可以重复

也可以选择忽略ignore_index,重新索引

pd.concat((df1,df2), ignore_index=True)
ABCD
0A1B1C1D1
1A2B2C2D2
2A3B3C3D3
3A4B4C4D4
4A1B1C1D1
5A2B2C2D2
6A3B3C3D3
7A4B4C4D4

或者使用多层索引 keys

concat([x,y],keys=[‘x’,‘y’])

pd.concat((df1,df2), keys=['df1', 'df2'])
ABCD
df11A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
df21A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4

============================================

练习13:

  1. 想一想级联的应用场景?

  2. 使用昨天的知识,建立一个期中考试张三、李四的成绩表ddd

  3. 假设新增考试学科"计算机",如何实现?

  4. 新增王老五同学的成绩,如何实现?

============================================

data = np.random.randint(0,150, size=(4,3))
index = ['张三', '李四', '王五', '赵六']
columns = ['语文', '数学', '英语']
df = DataFrame(index=index, data=data, columns=columns)
df
语文数学英语
张三2911536
李四73109100
王五7914071
赵六92177
df['计算机'] = np.xx
data = np.random.randint(0,150, size=(4,1))
index = ['张三', '李四', '王五', '赵六']
columns = ['计算机']
computer = DataFrame(index=index, data=data, columns=columns)
computer 
计算机
张三134
李四69
王五117
赵六21
df = pd.concat((df, computer), axis=1)
data = np.random.randint(0,150, size=(1,4))
index = ['老王']
columns = ['语文', '数学', '英语', '计算机']
wang = DataFrame(index=index, data=data, columns=columns)
wang
语文数学英语计算机
老王551342119
pd.concat((df, wang))
语文数学英语计算机
张三2911536134
李四7310910069
王五7914071117
赵六9217721
老王551342119

2) 不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致

df1
ABCD
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
df2 = make_df([2,3,4,5], list('BCDE'))
df2
BCDE
2B2C2D2E2
3B3C3D3E3
4B4C4D4E4
5B5C5D5E5
# 不匹配级联,匹配索引进行级联,不匹配的索引,补NaN
pd.concat((df1, df2), sort=True)
ABCDE
1A1B1C1D1NaN
2A2B2C2D2NaN
3A3B3C3D3NaN
4A4B4C4D4NaN
2NaNB2C2D2E2
3NaNB3C3D3E3
4NaNB4C4D4E4
5NaNB5C5D5E5
pd.concat((df1, df2), sort=True, axis=1)
ABCDBCDE
1A1B1C1D1NaNNaNNaNNaN
2A2B2C2D2B2C2D2E2
3A3B3C3D3B3C3D3E3
4A4B4C4D4B4C4D4E4
5NaNNaNNaNNaNB5C5D5E5

有3种连接方式:

  • 外连接:补NaN(默认模式)
pd.concat((df1, df2), sort=True, axis=1, join='inner') # 内连接
ABCDBCDE
2A2B2C2D2B2C2D2E2
3A3B3C3D3B3C3D3E3
4A4B4C4D4B4C4D4E4
  • 内连接:只连接匹配的项

  • 连接指定轴 join_axes

pd.concat((df1, df2), sort=True, axis=1, join_axes=[df1.index])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.
ABCDBCDE
1A1B1C1D1NaNNaNNaNNaN
2A2B2C2D2B2C2D2E2
3A3B3C3D3B3C3D3E3
4A4B4C4D4B4C4D4E4
pd.concat((df1, df2), sort=True, axis=1, join_axes=[df2.index])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.
ABCDBCDE
2A2B2C2D2B2C2D2E2
3A3B3C3D3B3C3D3E3
4A4B4C4D4B4C4D4E4
5NaNNaNNaNNaNB5C5D5E5
pd.concat((df1, df2), sort=True, axis=0, join_axes=[df2.columns])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.
BCDE
1B1C1D1NaN
2B2C2D2NaN
3B3C3D3NaN
4B4C4D4NaN
2B2C2D2E2
3B3C3D3E3
4B4C4D4E4
5B5C5D5E5

============================================

练习14:

假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联df

============================================

data = np.random.randint(0,150, size=(4,3))
index = ['张三', '李四', '王老五', '赵小六']
columns = ['语文', '数学', '英语']
df = DataFrame(index=index, data=data, columns=columns)
df
语文数学英语
张三55111131
李四13312240
王老五671233
赵小六12140101
data = np.random.randint(0,150, size=(3,3))
index = ['李四', '王老五', '赵小六']
columns = ['语文', '数学', '英语']
ddd = DataFrame(index=index, data=data, columns=columns)
ddd
语文数学英语
李四5223124
王老五3310636
赵小六585958
# 增加行数
pd.concat((df, ddd), keys=['期中', '期末'])
语文数学英语
期中张三55111131
李四13312240
王老五671233
赵小六12140101
期末李四5223124
王老五3310636
赵小六585958
pd.concat((df, ddd), keys=['期中', '期末'], axis=1, sort=True)
期中期末
语文数学英语语文数学英语
张三55111131NaNNaNNaN
李四1331224052.023.0124.0
王老五67123333.0106.036.0
赵小六1214010158.059.058.0
pd.concat((df, ddd), keys=['期中', '期末'], axis=1, sort=True, join_axes=[ddd.index])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.
期中期末
语文数学英语语文数学英语
李四133122405223124
王老五6712333310636
赵小六12140101585958

3) 使用append()函数添加

由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加

注意: concat是pd的方法, append是DataFrame对象的方法.
df.append(ddd, verify_integrity=False)
语文数学英语
张三55111131
李四13312240
王老五671233
赵小六12140101
李四5223124
王老五3310636
赵小六585958

============================================

练习15:

新建一个只有张三李四王老五的期末考试成绩单ddd3,使用append()与期中考试成绩表ddd级联

============================================

归纳总结:
    pandas的级联: pd.concat((df1, df2, df3, ...), ignore_index<忽略索引>, keys<说明索引>, join<内外连接,inner,outer>, join_axes<指定左右连接,>)
    dataframe.append(other, ignore_index, verify_integrity, sort)

2. 使用pd.merge()合并

merge与concat的区别在于,merge需要依据某一共同的列来进行合并

使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

注意每一列元素的顺序不要求一致

1) 一对一合并

df1 = DataFrame({'name':['张三','李四','Chales'],'id':[1,2,3],'age':[22,21,25]})

df2 = DataFrame({'sex':['男','男','女'],'id':[2,3,4],'group':['sale','search','service']})
display(df1, df2)
nameidage
0张三122
1李四221
2Chales325
sexidgroup
02sale
13search
24service
pd.merge(df1, df2)
nameidagesexgroup
0李四221sale
1Chales325search
df1.merge(df2)
nameidagesexgroup
0李四221sale
1Chales325search

2) 多对一合并

df1 = DataFrame({'name':['张三','李四','Chales'],'id':[1,2,2],'age':[22,21,25]})

df2 = DataFrame({'sex':['男','男','女'],'id':[2,3,4],'group':['sale','search','service']})
display(df1,df2)
nameidage
0张三122
1李四221
2Chales225
sexidgroup
02sale
13search
24service
pd.merge(df1,df2)
nameidagesexgroup
0李四221sale
1Chales225sale

3) 多对多合并

df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'sex':['男','男','女'],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)
namesalaryage
0张三1000022
1李四1200021
2张三2000025
sexnamegroup
0张三sale
1张三search
2凡凡service
pd.merge(df1,df2)
namesalaryagesexgroup
0张三1000022sale
1张三1000022search
2张三2000025sale
3张三2000025search

4) key的规范化

  • 使用on=显式指定哪一列为key,当有多个key相同时使用
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'age':[21,18,29],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)
namesalaryage
0张三1000022
1李四1200021
2张三2000025
agenamegroup
021张三sale
118张三search
229凡凡service
# 当有多个列数据相同的时候, 需要使用on来指定具体用哪一列合并
pd.merge(df1, df2, on='name')
namesalaryage_xage_ygroup
0张三100002221sale
1张三100002218search
2张三200002521sale
3张三200002518search
  • 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不相等时使用
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'年龄':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)
namesalaryage
0张三1000022
1李四1200021
2张三2000025
年龄名字group
021张三sale
118张三search
229凡凡service
pd.merge(df1, df2, left_on='name', right_on='名字')
namesalaryage年龄名字group
0张三100002221张三sale
1张三100002218张三search
2张三200002521张三sale
3张三200002518张三search

当左边的列和右边的index相同的时候,使用right_index=True

df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'年龄':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']},
                index = [22,21,25])
display(df1,df2)
namesalaryage
0张三1000022
1李四1200021
2张三2000025
年龄名字group
2221张三sale
2118张三search
2529凡凡service
pd.merge(df1, df2, left_on='age', right_index=True)
namesalaryage年龄名字group
0张三100002221张三sale
1李四120002118张三search
2张三200002529凡凡service
pd.merge(df2, df1, left_index=True, right_on='age')
年龄名字groupnamesalaryage
021张三sale张三1000022
118张三search李四1200021
229凡凡service张三2000025

============================================

练习16:

  1. 假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?

  2. 如果ddd4中张三的名字被打错了,成为了张十三,怎么办?

  3. 自行练习多对一,多对多的情况

  4. 自学left_index,right_index

============================================

5) 内合并与外合并

  • 内合并:只保留两者都有的key(默认模式)
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'age':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1, df2)
namesalaryage
0张三1000022
1李四1200021
2张三2000025
age名字group
021张三sale
118张三search
229凡凡service
# 默认是内合并
pd.merge(df1, df2, left_on='name', right_on='名字')
namesalaryage_xage_y名字group
0张三100002221张三sale
1张三100002218张三search
2张三200002521张三sale
3张三200002518张三search
  • 外合并 how=‘outer’:补NaN
pd.merge(df1, df2, left_on='name', right_on='名字', how='outer')
namesalaryage_xage_y名字group
0张三10000.022.021.0张三sale
1张三10000.022.018.0张三search
2张三20000.025.021.0张三sale
3张三20000.025.018.0张三search
4李四12000.021.0NaNNaNNaN
5NaNNaNNaN29.0凡凡service
  • 左合并、右合并:how=‘left’,how=‘right’,
pd.merge(df1, df2, left_on='name', right_on='名字', how='left')
namesalaryage_xage_y名字group
0张三100002221.0张三sale
1张三100002218.0张三search
2李四1200021NaNNaNNaN
3张三200002521.0张三sale
4张三200002518.0张三search
# 右合并
pd.merge(df1, df2, left_on='name', right_on='名字', how='right')
namesalaryage_xage_y名字group
0张三10000.022.021张三sale
1张三20000.025.021张三sale
2张三10000.022.018张三search
3张三20000.025.018张三search
4NaNNaNNaN29凡凡service

============================================

练习17:

  1. 考虑应用情景,使用多种方式合并ddd与ddd4

============================================

6) 列冲突的解决

当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名

#期中
df1 = DataFrame({'name':['张三','李四','张三'],'degree':[120,118,149],'age':[22,21,25]})

#期末考试
df2 = DataFrame({'degree':[99,97,129],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1, df2)
namedegreeage
0张三12022
1李四11821
2张三14925
degreenamegroup
099张三sale
197张三search
2129凡凡service
# suffix 后缀, prefix 前缀
pd.merge(df1, df2, on='name', suffixes=['_df1', '_df2'])
namedegree_df1agedegree_df2group
0张三1202299sale
1张三1202297search
2张三1492599sale
3张三1492597search

可以使用suffixes=自己指定后缀

============================================

练习18:

假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?

============================================

pd.merge(how<指定合并方式,内外左右>, 
         on<指定要合并的列>, 
         left_on<单独指定左边表要合并的列.>, 
         right_on<指定右边边要合并的列>, 
         left_index/right_index<指定使用index来进行合并>, 
         suffixes<指定相同列名的后缀>)
假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?
如果ddd4中张三的名字被打错了,成为了张十三,怎么办?
# 用名字作为 列名
index = ['语文', '数学', '英语']
columns = ['张三', '李四', '王老五']
data = np.random.randint(0,150, size=(3,3))
ddd = DataFrame(data=data, index=index, columns=columns)
ddd
张三李四王老五
语文979963
数学11611014
英语610834
index = ['语文', '数学', '英语']
columns = ['张三', '赵小六']
data = np.random.randint(0,150, size=(3,2))
ddd4 = DataFrame(data=data, index=index, columns=columns)
ddd4
张三赵小六
语文5782
数学13947
英语12077
ddd4['张三'] = ddd['张三']
ddd4
张三赵小六
语文9782
数学11647
英语677
# 使用右边表的索引.
result = pd.merge(ddd, ddd4, on='张三', left_index=True)
result
张三李四王老五赵小六
语文97996382
数学1161101447
英语61083477
ddd4.rename({'张三': '张十三'}, axis=1, inplace=True)
ddd4
张十三赵小六
语文9782
数学11647
英语677
result = pd.merge(ddd, ddd4, left_on='张三', right_on='张十三', left_index=True)
result.drop(columns='张十三', inplace=True)
result
张三李四王老五赵小六
语文97996382
数学1161101447
英语61083477

作业

3. 案例分析:美国各州人口数据分析

作业知识补充

# unique() 去重函数
s = Series(['Tom','Lucy','Tom','dancer','Lucy'])
s.unique()
array(['Tom', 'Lucy', 'dancer'], dtype=object)
n = DataFrame({'name':['Tom','Lucy','Tom','dancer','Lucy'],'age':[12,13,12,11,15]})
n
nameage
0Tom12
1Lucy13
2Tom12
3dancer11
4Lucy15
# 查询 名字是lucy,并且年龄大于13岁的行
n.query('name=="Lucy" & age>13')
nameage
4Lucy15

首先导入文件,并查看数据样本

abb = pd.read_csv('../data/state-abbrevs.csv')
abb.head()
stateabbreviation
0AlabamaAL
1AlaskaAK
2ArizonaAZ
3ArkansasAR
4CaliforniaCA
pop = pd.read_csv('../data/state-population.csv')
pop.head()
state/regionagesyearpopulation
0ALunder1820121117489.0
1ALtotal20124817528.0
2ALunder1820101130966.0
3ALtotal20104785570.0
4ALunder1820111125763.0
areas = pd.read_csv('../data/state-areas.csv')
areas.head()
statearea (sq. mi)
0Alabama52423
1Alaska656425
2Arizona114006
3Arkansas53182
4California163707

合并pop与abbrevs两个DataFrame,分别依据state/region列和abbreviation列来合并。

为了保留所有信息,使用外合并。

abb_pop = pd.merge(abb, pop, left_on='abbreviation', right_on='state/region', how='outer')

去除abbreviation的那一列(axis=1)

abb_pop.drop(columns='abbreviation', inplace=True)

查看存在缺失数据的列。

使用.isnull().any(),只有某一列存在一个缺失数据,就会显示True。

abb_pop.isnull().any()
state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

查看缺失数据

abb_pop
statestate/regionagesyearpopulation
0AlabamaALunder1820121117489.0
1AlabamaALtotal20124817528.0
2AlabamaALunder1820101130966.0
3AlabamaALtotal20104785570.0
4AlabamaALunder1820111125763.0
5AlabamaALtotal20114801627.0
6AlabamaALtotal20094757938.0
7AlabamaALunder1820091134192.0
8AlabamaALunder1820131111481.0
9AlabamaALtotal20134833722.0
10AlabamaALtotal20074672840.0
11AlabamaALunder1820071132296.0
12AlabamaALtotal20084718206.0
13AlabamaALunder1820081134927.0
14AlabamaALtotal20054569805.0
15AlabamaALunder1820051117229.0
16AlabamaALtotal20064628981.0
17AlabamaALunder1820061126798.0
18AlabamaALtotal20044530729.0
19AlabamaALunder1820041113662.0
20AlabamaALtotal20034503491.0
21AlabamaALunder1820031113083.0
22AlabamaALtotal20014467634.0
23AlabamaALunder1820011120409.0
24AlabamaALtotal20024480089.0
25AlabamaALunder1820021116590.0
26AlabamaALunder1819991121287.0
27AlabamaALtotal19994430141.0
28AlabamaALtotal20004452173.0
29AlabamaALunder1820001122273.0
..................
2514NaNUSAunder18199971946051.0
2515NaNUSAtotal2000282162411.0
2516NaNUSAunder18200072376189.0
2517NaNUSAtotal1999279040181.0
2518NaNUSAtotal2001284968955.0
2519NaNUSAunder18200172671175.0
2520NaNUSAtotal2002287625193.0
2521NaNUSAunder18200272936457.0
2522NaNUSAtotal2003290107933.0
2523NaNUSAunder18200373100758.0
2524NaNUSAtotal2004292805298.0
2525NaNUSAunder18200473297735.0
2526NaNUSAtotal2005295516599.0
2527NaNUSAunder18200573523669.0
2528NaNUSAtotal2006298379912.0
2529NaNUSAunder18200673757714.0
2530NaNUSAtotal2007301231207.0
2531NaNUSAunder18200774019405.0
2532NaNUSAtotal2008304093966.0
2533NaNUSAunder18200874104602.0
2534NaNUSAunder18201373585872.0
2535NaNUSAtotal2013316128839.0
2536NaNUSAtotal2009306771529.0
2537NaNUSAunder18200974134167.0
2538NaNUSAunder18201074119556.0
2539NaNUSAtotal2010309326295.0
2540NaNUSAunder18201173902222.0
2541NaNUSAtotal2011311582564.0
2542NaNUSAunder18201273708179.0
2543NaNUSAtotal2012313873685.0

2544 rows × 5 columns

abb_pop.isnull().any(axis=1)
0       False
1       False
2       False
3       False
4       False
        ...  
2539     True
2540     True
2541     True
2542     True
2543     True
Length: 2544, dtype: bool
abb_pop[abb_pop.isnull().any(axis=1)]
statestate/regionagesyearpopulation
2448NaNPRunder181990NaN
2449NaNPRtotal1990NaN
2450NaNPRtotal1991NaN
2451NaNPRunder181991NaN
2452NaNPRtotal1993NaN
2453NaNPRunder181993NaN
2454NaNPRunder181992NaN
2455NaNPRtotal1992NaN
2456NaNPRunder181994NaN
2457NaNPRtotal1994NaN
2458NaNPRtotal1995NaN
2459NaNPRunder181995NaN
2460NaNPRunder181996NaN
2461NaNPRtotal1996NaN
2462NaNPRunder181998NaN
2463NaNPRtotal1998NaN
2464NaNPRtotal1997NaN
2465NaNPRunder181997NaN
2466NaNPRtotal1999NaN
2467NaNPRunder181999NaN
2468NaNPRtotal20003810605.0
2469NaNPRunder1820001089063.0
2470NaNPRtotal20013818774.0
2471NaNPRunder1820011077566.0
2472NaNPRtotal20023823701.0
2473NaNPRunder1820021065051.0
2474NaNPRtotal20043826878.0
2475NaNPRunder1820041035919.0
2476NaNPRtotal20033826095.0
2477NaNPRunder1820031050615.0
..................
2514NaNUSAunder18199971946051.0
2515NaNUSAtotal2000282162411.0
2516NaNUSAunder18200072376189.0
2517NaNUSAtotal1999279040181.0
2518NaNUSAtotal2001284968955.0
2519NaNUSAunder18200172671175.0
2520NaNUSAtotal2002287625193.0
2521NaNUSAunder18200272936457.0
2522NaNUSAtotal2003290107933.0
2523NaNUSAunder18200373100758.0
2524NaNUSAtotal2004292805298.0
2525NaNUSAunder18200473297735.0
2526NaNUSAtotal2005295516599.0
2527NaNUSAunder18200573523669.0
2528NaNUSAtotal2006298379912.0
2529NaNUSAunder18200673757714.0
2530NaNUSAtotal2007301231207.0
2531NaNUSAunder18200774019405.0
2532NaNUSAtotal2008304093966.0
2533NaNUSAunder18200874104602.0
2534NaNUSAunder18201373585872.0
2535NaNUSAtotal2013316128839.0
2536NaNUSAtotal2009306771529.0
2537NaNUSAunder18200974134167.0
2538NaNUSAunder18201074119556.0
2539NaNUSAtotal2010309326295.0
2540NaNUSAunder18201173902222.0
2541NaNUSAtotal2011311582564.0
2542NaNUSAunder18201273708179.0
2543NaNUSAtotal2012313873685.0

96 rows × 5 columns

根据数据是否缺失情况显示数据,如果缺失为True,那么显示

找到有哪些state/region使得state的值为NaN,使用unique()查看非重复值

abb_pop[abb_pop.isnull().any(axis=1)]['state/region'].unique()
array(['PR', 'USA'], dtype=object)

为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN!

记住这样清除缺失数据NaN的方法!

cond = abb_pop[abb_pop.isnull().any(axis=1)]['state/region'] == 'PR'
abb_pop[abb_pop.isnull().any(axis=1)]['state'][cond]= 'Puerto Rico'
d:\1903\.venv\lib\site-packages\pandas\core\generic.py:9116: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
d:\1903\.venv\lib\site-packages\IPython\core\interactiveshell.py:3326: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)

abb_pop[abb_pop.isnull().any(axis=1)][cond]
statestate/regionagesyearpopulation
2448NaNPRunder181990NaN
2449NaNPRtotal1990NaN
2450NaNPRtotal1991NaN
2451NaNPRunder181991NaN
2452NaNPRtotal1993NaN
2453NaNPRunder181993NaN
2454NaNPRunder181992NaN
2455NaNPRtotal1992NaN
2456NaNPRunder181994NaN
2457NaNPRtotal1994NaN
2458NaNPRtotal1995NaN
2459NaNPRunder181995NaN
2460NaNPRunder181996NaN
2461NaNPRtotal1996NaN
2462NaNPRunder181998NaN
2463NaNPRtotal1998NaN
2464NaNPRtotal1997NaN
2465NaNPRunder181997NaN
2466NaNPRtotal1999NaN
2467NaNPRunder181999NaN
2468NaNPRtotal20003810605.0
2469NaNPRunder1820001089063.0
2470NaNPRtotal20013818774.0
2471NaNPRunder1820011077566.0
2472NaNPRtotal20023823701.0
2473NaNPRunder1820021065051.0
2474NaNPRtotal20043826878.0
2475NaNPRunder1820041035919.0
2476NaNPRtotal20033826095.0
2477NaNPRunder1820031050615.0
2478NaNPRtotal20053821362.0
2479NaNPRunder1820051019447.0
2480NaNPRtotal20063805214.0
2481NaNPRunder182006998543.0
2482NaNPRtotal20073782995.0
2483NaNPRunder182007973613.0
2484NaNPRtotal20083760866.0
2485NaNPRunder182008945705.0
2486NaNPRunder182013814068.0
2487NaNPRtotal20133615086.0
2488NaNPRtotal20093740410.0
2489NaNPRunder182009920794.0
2490NaNPRtotal20103721208.0
2491NaNPRunder182010896945.0
2492NaNPRunder182011869327.0
2493NaNPRtotal20113686580.0
2494NaNPRunder182012841740.0
2495NaNPRtotal20123651545.0
abb_pop[abb_pop.isnull().any(axis=1)].loc[cond, 'state'] = 'Puerto Rico'
d:\1903\.venv\lib\site-packages\pandas\core\indexing.py:480: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
abb_pop[abb_pop.isnull().any(axis=1)][cond]
statestate/regionagesyearpopulation
2448NaNPRunder181990NaN
2449NaNPRtotal1990NaN
2450NaNPRtotal1991NaN
2451NaNPRunder181991NaN
2452NaNPRtotal1993NaN
2453NaNPRunder181993NaN
2454NaNPRunder181992NaN
2455NaNPRtotal1992NaN
2456NaNPRunder181994NaN
2457NaNPRtotal1994NaN
2458NaNPRtotal1995NaN
2459NaNPRunder181995NaN
2460NaNPRunder181996NaN
2461NaNPRtotal1996NaN
2462NaNPRunder181998NaN
2463NaNPRtotal1998NaN
2464NaNPRtotal1997NaN
2465NaNPRunder181997NaN
2466NaNPRtotal1999NaN
2467NaNPRunder181999NaN
2468NaNPRtotal20003810605.0
2469NaNPRunder1820001089063.0
2470NaNPRtotal20013818774.0
2471NaNPRunder1820011077566.0
2472NaNPRtotal20023823701.0
2473NaNPRunder1820021065051.0
2474NaNPRtotal20043826878.0
2475NaNPRunder1820041035919.0
2476NaNPRtotal20033826095.0
2477NaNPRunder1820031050615.0
2478NaNPRtotal20053821362.0
2479NaNPRunder1820051019447.0
2480NaNPRtotal20063805214.0
2481NaNPRunder182006998543.0
2482NaNPRtotal20073782995.0
2483NaNPRunder182007973613.0
2484NaNPRtotal20083760866.0
2485NaNPRunder182008945705.0
2486NaNPRunder182013814068.0
2487NaNPRtotal20133615086.0
2488NaNPRtotal20093740410.0
2489NaNPRunder182009920794.0
2490NaNPRtotal20103721208.0
2491NaNPRunder182010896945.0
2492NaNPRunder182011869327.0
2493NaNPRtotal20113686580.0
2494NaNPRunder182012841740.0
2495NaNPRtotal20123651545.0
temp = abb_pop[abb_pop.isnull().any(axis=1)].copy()
temp.loc[cond, 'state'] = 'Puerto Rico'
temp
statestate/regionagesyearpopulation
2448Puerto RicoPRunder181990NaN
2449Puerto RicoPRtotal1990NaN
2450Puerto RicoPRtotal1991NaN
2451Puerto RicoPRunder181991NaN
2452Puerto RicoPRtotal1993NaN
2453Puerto RicoPRunder181993NaN
2454Puerto RicoPRunder181992NaN
2455Puerto RicoPRtotal1992NaN
2456Puerto RicoPRunder181994NaN
2457Puerto RicoPRtotal1994NaN
2458Puerto RicoPRtotal1995NaN
2459Puerto RicoPRunder181995NaN
2460Puerto RicoPRunder181996NaN
2461Puerto RicoPRtotal1996NaN
2462Puerto RicoPRunder181998NaN
2463Puerto RicoPRtotal1998NaN
2464Puerto RicoPRtotal1997NaN
2465Puerto RicoPRunder181997NaN
2466Puerto RicoPRtotal1999NaN
2467Puerto RicoPRunder181999NaN
2468Puerto RicoPRtotal20003810605.0
2469Puerto RicoPRunder1820001089063.0
2470Puerto RicoPRtotal20013818774.0
2471Puerto RicoPRunder1820011077566.0
2472Puerto RicoPRtotal20023823701.0
2473Puerto RicoPRunder1820021065051.0
2474Puerto RicoPRtotal20043826878.0
2475Puerto RicoPRunder1820041035919.0
2476Puerto RicoPRtotal20033826095.0
2477Puerto RicoPRunder1820031050615.0
..................
2514NaNUSAunder18199971946051.0
2515NaNUSAtotal2000282162411.0
2516NaNUSAunder18200072376189.0
2517NaNUSAtotal1999279040181.0
2518NaNUSAtotal2001284968955.0
2519NaNUSAunder18200172671175.0
2520NaNUSAtotal2002287625193.0
2521NaNUSAunder18200272936457.0
2522NaNUSAtotal2003290107933.0
2523NaNUSAunder18200373100758.0
2524NaNUSAtotal2004292805298.0
2525NaNUSAunder18200473297735.0
2526NaNUSAtotal2005295516599.0
2527NaNUSAunder18200573523669.0
2528NaNUSAtotal2006298379912.0
2529NaNUSAunder18200673757714.0
2530NaNUSAtotal2007301231207.0
2531NaNUSAunder18200774019405.0
2532NaNUSAtotal2008304093966.0
2533NaNUSAunder18200874104602.0
2534NaNUSAunder18201373585872.0
2535NaNUSAtotal2013316128839.0
2536NaNUSAtotal2009306771529.0
2537NaNUSAunder18200974134167.0
2538NaNUSAunder18201074119556.0
2539NaNUSAtotal2010309326295.0
2540NaNUSAunder18201173902222.0
2541NaNUSAtotal2011311582564.0
2542NaNUSAunder18201273708179.0
2543NaNUSAtotal2012313873685.0

96 rows × 5 columns

cond = abb_pop[abb_pop.isnull().any(axis=1)]['state/region'] == 'USA'
temp.loc[cond, 'state'] = 'United State of American'
temp
statestate/regionagesyearpopulation
2448Puerto RicoPRunder181990NaN
2449Puerto RicoPRtotal1990NaN
2450Puerto RicoPRtotal1991NaN
2451Puerto RicoPRunder181991NaN
2452Puerto RicoPRtotal1993NaN
2453Puerto RicoPRunder181993NaN
2454Puerto RicoPRunder181992NaN
2455Puerto RicoPRtotal1992NaN
2456Puerto RicoPRunder181994NaN
2457Puerto RicoPRtotal1994NaN
2458Puerto RicoPRtotal1995NaN
2459Puerto RicoPRunder181995NaN
2460Puerto RicoPRunder181996NaN
2461Puerto RicoPRtotal1996NaN
2462Puerto RicoPRunder181998NaN
2463Puerto RicoPRtotal1998NaN
2464Puerto RicoPRtotal1997NaN
2465Puerto RicoPRunder181997NaN
2466Puerto RicoPRtotal1999NaN
2467Puerto RicoPRunder181999NaN
2468Puerto RicoPRtotal20003810605.0
2469Puerto RicoPRunder1820001089063.0
2470Puerto RicoPRtotal20013818774.0
2471Puerto RicoPRunder1820011077566.0
2472Puerto RicoPRtotal20023823701.0
2473Puerto RicoPRunder1820021065051.0
2474Puerto RicoPRtotal20043826878.0
2475Puerto RicoPRunder1820041035919.0
2476Puerto RicoPRtotal20033826095.0
2477Puerto RicoPRunder1820031050615.0
..................
2514United State of AmericanUSAunder18199971946051.0
2515United State of AmericanUSAtotal2000282162411.0
2516United State of AmericanUSAunder18200072376189.0
2517United State of AmericanUSAtotal1999279040181.0
2518United State of AmericanUSAtotal2001284968955.0
2519United State of AmericanUSAunder18200172671175.0
2520United State of AmericanUSAtotal2002287625193.0
2521United State of AmericanUSAunder18200272936457.0
2522United State of AmericanUSAtotal2003290107933.0
2523United State of AmericanUSAunder18200373100758.0
2524United State of AmericanUSAtotal2004292805298.0
2525United State of AmericanUSAunder18200473297735.0
2526United State of AmericanUSAtotal2005295516599.0
2527United State of AmericanUSAunder18200573523669.0
2528United State of AmericanUSAtotal2006298379912.0
2529United State of AmericanUSAunder18200673757714.0
2530United State of AmericanUSAtotal2007301231207.0
2531United State of AmericanUSAunder18200774019405.0
2532United State of AmericanUSAtotal2008304093966.0
2533United State of AmericanUSAunder18200874104602.0
2534United State of AmericanUSAunder18201373585872.0
2535United State of AmericanUSAtotal2013316128839.0
2536United State of AmericanUSAtotal2009306771529.0
2537United State of AmericanUSAunder18200974134167.0
2538United State of AmericanUSAunder18201074119556.0
2539United State of AmericanUSAtotal2010309326295.0
2540United State of AmericanUSAunder18201173902222.0
2541United State of AmericanUSAtotal2011311582564.0
2542United State of AmericanUSAunder18201273708179.0
2543United State of AmericanUSAtotal2012313873685.0

96 rows × 5 columns

abb_pop[abb_pop.isnull().any(axis=1)] = temp
abb_pop[abb_pop.isnull().any(axis=1)] 
statestate/regionagesyearpopulation
2448Puerto RicoPRunder181990NaN
2449Puerto RicoPRtotal1990NaN
2450Puerto RicoPRtotal1991NaN
2451Puerto RicoPRunder181991NaN
2452Puerto RicoPRtotal1993NaN
2453Puerto RicoPRunder181993NaN
2454Puerto RicoPRunder181992NaN
2455Puerto RicoPRtotal1992NaN
2456Puerto RicoPRunder181994NaN
2457Puerto RicoPRtotal1994NaN
2458Puerto RicoPRtotal1995NaN
2459Puerto RicoPRunder181995NaN
2460Puerto RicoPRunder181996NaN
2461Puerto RicoPRtotal1996NaN
2462Puerto RicoPRunder181998NaN
2463Puerto RicoPRtotal1998NaN
2464Puerto RicoPRtotal1997NaN
2465Puerto RicoPRunder181997NaN
2466Puerto RicoPRtotal1999NaN
2467Puerto RicoPRunder181999NaN

合并各州面积数据areas,使用左合并。

思考一下为什么使用外合并?

abb_pop.head()
statestate/regionagesyearpopulation
0AlabamaALunder1820121117489.0
1AlabamaALtotal20124817528.0
2AlabamaALunder1820101130966.0
3AlabamaALtotal20104785570.0
4AlabamaALunder1820111125763.0
areas.head()
statearea (sq. mi)
0Alabama52423
1Alaska656425
2Arizona114006
3Arkansas53182
4California163707
abb_pop_areas = abb_pop.merge(areas, how='outer')
abb_pop_areas.head()
statestate/regionagesyearpopulationarea (sq. mi)
0AlabamaALunder1820121117489.052423.0
1AlabamaALtotal20124817528.052423.0
2AlabamaALunder1820101130966.052423.0
3AlabamaALtotal20104785570.052423.0
4AlabamaALunder1820111125763.052423.0

继续寻找存在缺失数据的列

abb_pop_areas.isnull().any()
state            False
state/region     False
ages             False
year             False
population        True
area (sq. mi)     True
dtype: bool

我们会发现area(sq.mi)这一列有缺失数据,为了找出是哪一行,我们需要找出是哪个state没有数据

abb_pop_areas[abb_pop_areas.isnull().any(axis=1)]
statestate/regionagesyearpopulationarea (sq. mi)
2448Puerto RicoPRunder181990NaN3515.0
2449Puerto RicoPRtotal1990NaN3515.0
2450Puerto RicoPRtotal1991NaN3515.0
2451Puerto RicoPRunder181991NaN3515.0
2452Puerto RicoPRtotal1993NaN3515.0
2453Puerto RicoPRunder181993NaN3515.0
2454Puerto RicoPRunder181992NaN3515.0
2455Puerto RicoPRtotal1992NaN3515.0
2456Puerto RicoPRunder181994NaN3515.0
2457Puerto RicoPRtotal1994NaN3515.0
2458Puerto RicoPRtotal1995NaN3515.0
2459Puerto RicoPRunder181995NaN3515.0
2460Puerto RicoPRunder181996NaN3515.0
2461Puerto RicoPRtotal1996NaN3515.0
2462Puerto RicoPRunder181998NaN3515.0
2463Puerto RicoPRtotal1998NaN3515.0
2464Puerto RicoPRtotal1997NaN3515.0
2465Puerto RicoPRunder181997NaN3515.0
2466Puerto RicoPRtotal1999NaN3515.0
2467Puerto RicoPRunder181999NaN3515.0
2496United State of AmericanUSAunder18199064218512.0NaN
2497United State of AmericanUSAtotal1990249622814.0NaN
2498United State of AmericanUSAtotal1991252980942.0NaN
2499United State of AmericanUSAunder18199165313018.0NaN
2500United State of AmericanUSAunder18199266509177.0NaN
2501United State of AmericanUSAtotal1992256514231.0NaN
2502United State of AmericanUSAtotal1993259918595.0NaN
2503United State of AmericanUSAunder18199367594938.0NaN
2504United State of AmericanUSAunder18199468640936.0NaN
2505United State of AmericanUSAtotal1994263125826.0NaN
.....................
2514United State of AmericanUSAunder18199971946051.0NaN
2515United State of AmericanUSAtotal2000282162411.0NaN
2516United State of AmericanUSAunder18200072376189.0NaN
2517United State of AmericanUSAtotal1999279040181.0NaN
2518United State of AmericanUSAtotal2001284968955.0NaN
2519United State of AmericanUSAunder18200172671175.0NaN
2520United State of AmericanUSAtotal2002287625193.0NaN
2521United State of AmericanUSAunder18200272936457.0NaN
2522United State of AmericanUSAtotal2003290107933.0NaN
2523United State of AmericanUSAunder18200373100758.0NaN
2524United State of AmericanUSAtotal2004292805298.0NaN
2525United State of AmericanUSAunder18200473297735.0NaN
2526United State of AmericanUSAtotal2005295516599.0NaN
2527United State of AmericanUSAunder18200573523669.0NaN
2528United State of AmericanUSAtotal2006298379912.0NaN
2529United State of AmericanUSAunder18200673757714.0NaN
2530United State of AmericanUSAtotal2007301231207.0NaN
2531United State of AmericanUSAunder18200774019405.0NaN
2532United State of AmericanUSAtotal2008304093966.0NaN
2533United State of AmericanUSAunder18200874104602.0NaN
2534United State of AmericanUSAunder18201373585872.0NaN
2535United State of AmericanUSAtotal2013316128839.0NaN
2536United State of AmericanUSAtotal2009306771529.0NaN
2537United State of AmericanUSAunder18200974134167.0NaN
2538United State of AmericanUSAunder18201074119556.0NaN
2539United State of AmericanUSAtotal2010309326295.0NaN
2540United State of AmericanUSAunder18201173902222.0NaN
2541United State of AmericanUSAtotal2011311582564.0NaN
2542United State of AmericanUSAunder18201273708179.0NaN
2543United State of AmericanUSAtotal2012313873685.0NaN

68 rows × 6 columns

去除含有缺失数据的行

abb_pop_areas.dropna(axis=0, how='any', inplace=True)

查看数据是否缺失

abb_pop_areas.isnull().any()
state            False
state/region     False
ages             False
year             False
population       False
area (sq. mi)    False
dtype: bool
abb_pop_areas.head()
statestate/regionagesyearpopulationarea (sq. mi)
0AlabamaALunder1820121117489.052423.0
1AlabamaALtotal20124817528.052423.0
2AlabamaALunder1820101130966.052423.0
3AlabamaALtotal20104785570.052423.0
4AlabamaALunder1820111125763.052423.0
abb_pop_areas.dtypes
state             object
state/region      object
ages              object
year               int64
population       float64
area (sq. mi)    float64
dtype: object

找出2010年的全民人口数据,df.query(查询语句)

result = abb_pop_areas.query('ages=="total" & year==2010')

对查询结果进行处理,以state列作为新的行索引:set_index

result.set_index(keys=['state'], inplace=True)
result.head()
state/regionagesyearpopulationarea (sq. mi)
state
AlabamaALtotal20104785570.052423.0
AlaskaAKtotal2010713868.0656425.0
ArizonaAZtotal20106408790.0114006.0
ArkansasARtotal20102922280.053182.0
CaliforniaCAtotal201037333601.0163707.0

计算人口密度。注意是Series/Series,其结果还是一个Series。

density = result['population'] / result['area (sq. mi)']
density.head()
state
Alabama        91.287603
Alaska          1.087509
Arizona        56.214497
Arkansas       54.948667
California    228.051342
dtype: float64

排序,并找出人口密度最高的五个州sort_values()

density.sort_values(inplace=True)
density.tail()
state
Connecticut              645.600649
Rhode Island             681.339159
New Jersey              1009.253268
Puerto Rico             1058.665149
District of Columbia    8898.897059
dtype: float64

找出人口密度最低的五个州

density.head()
state
Alaska           1.087509
Wyoming          5.768079
Montana          6.736171
North Dakota     9.537565
South Dakota    10.583512
dtype: float64

要点总结:

  • 统一用loc()索引
  • 善于使用.isnull().any()找到存在NaN的列
  • 善于使用.unique()确定该列中哪些key是我们需要的
  • 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息

回顾:Series/DataFrame运算与ndarray运算的区别

  • Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
  • ndarray有广播,通过重复已有值来计算
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值