pandas数据分析给力教程【完整版】（五）

最新推荐文章于 2023-01-23 12:14:20 发布

zly717216

最新推荐文章于 2023-01-23 12:14:20 发布

阅读量6.4k

点赞数

分类专栏： Pandas数据分析文章标签： python pandas 数据分析

本文链接：https://blog.csdn.net/zly717216/article/details/120751612

版权

Pandas数据分析专栏收录该内容

9 篇文章 3 订阅

订阅专栏

pandas的拼接操作

上一篇：pandas数据分析给力教程【完整版】（四）
下一篇：pandas数据分析给力教程【完整版】（六）

pandas的拼接分为两种：

级联：pd.concat, pd.append
合并：pd.merge

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt

0. 回顾numpy的级联

============================================

练习12：

生成2个3*3的矩阵，对其分别进行两个维度上的级联

============================================

n1 = np.random.randint(0,100, size=(3,3))
n2 = np.random.randint(0,100, size=(3,3))
display(n1,n2)

array([[99, 49, 30],
       [36, 48, 70],
       [18, 90, 77]])



array([[82, 12,  2],
       [40, 46, 99],
       [55,  5, 77]])

np.concatenate((n1,n2), axis=0)

array([[99, 49, 30],
       [36, 48, 70],
       [18, 90, 77],
       [82, 12,  2],
       [40, 46, 99],
       [55,  5, 77]])

np.concatenate((n1,n2), axis=1)

array([[99, 49, 30, 82, 12,  2],
       [36, 48, 70, 40, 46, 99],
       [18, 90, 77, 55,  5, 77]])

np.hstack((n1,n2))

array([[99, 49, 30, 82, 12,  2],
       [36, 48, 70, 40, 46, 99],
       [18, 90, 77, 55,  5, 77]])

np.vstack((n1,n2))

array([[99, 49, 30],
       [36, 48, 70],
       [18, 90, 77],
       [82, 12,  2],
       [40, 46, 99],
       [55,  5, 77]])

为方便讲解，我们首先定义一个生成DataFrame的函数：

def make_df(index, cols):
    df = DataFrame({col: [col + str(i) for i in index] for col in cols})
    df.index = index
    return df

make_df([1,2,3,4], list('ABCD'))

	A	B	C	D
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4

1. 使用pd.concat()级联

pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：

objs
axis=0
join='outer'
join_axes=None
ignore_index=False
keys = [value1,value2...]

pd.concat  concatenate()

pd.concat()

1) 简单级联

df1 = make_df([1,2,3,4], list('ABCD'))
df2 = make_df([1,2,3,4], list('ABCD'))

和np.concatenate一样，优先增加行数（默认axis=0）

# axis=0, 默认增加行数
pd.concat((df1,df2))

	A	B	C	D
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4

pd.concat((df1,df2), axis=1, ignore_index=True)

	0	1	2	3	4	5	6	7
1	A1	B1	C1	D1	A1	B1	C1	D1
2	A2	B2	C2	D2	A2	B2	C2	D2
3	A3	B3	C3	D3	A3	B3	C3	D3
4	A4	B4	C4	D4	A4	B4	C4	D4

可以通过设置axis来改变级联方向

级连会把该方向上索引相同的元素放在一行（一列），index/columns在级联时可以重复

也可以选择忽略ignore_index，重新索引

pd.concat((df1,df2), ignore_index=True)

	A	B	C	D
0	A1	B1	C1	D1
1	A2	B2	C2	D2
2	A3	B3	C3	D3
3	A4	B4	C4	D4
4	A1	B1	C1	D1
5	A2	B2	C2	D2
6	A3	B3	C3	D3
7	A4	B4	C4	D4

或者使用多层索引 keys

concat([x,y],keys=[‘x’,‘y’])

pd.concat((df1,df2), keys=['df1', 'df2'])

		A	B	C	D
df1	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4
df2	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4

============================================

练习13：

想一想级联的应用场景？
使用昨天的知识，建立一个期中考试张三、李四的成绩表ddd
假设新增考试学科"计算机"，如何实现？
新增王老五同学的成绩，如何实现？

============================================

data = np.random.randint(0,150, size=(4,3))
index = ['张三', '李四', '王五', '赵六']
columns = ['语文', '数学', '英语']
df = DataFrame(index=index, data=data, columns=columns)
df

	语文	数学	英语
张三	29	115	36
李四	73	109	100
王五	79	140	71
赵六	9	21	77

df['计算机'] = np.xx

data = np.random.randint(0,150, size=(4,1))
index = ['张三', '李四', '王五', '赵六']
columns = ['计算机']
computer = DataFrame(index=index, data=data, columns=columns)
computer

	计算机
张三	134
李四	69
王五	117
赵六	21

df = pd.concat((df, computer), axis=1)

data = np.random.randint(0,150, size=(1,4))
index = ['老王']
columns = ['语文', '数学', '英语', '计算机']
wang = DataFrame(index=index, data=data, columns=columns)
wang

	语文	数学	英语	计算机
老王	55	134	21	19

pd.concat((df, wang))

	语文	数学	英语	计算机
张三	29	115	36	134
李四	73	109	100	69
王五	79	140	71	117
赵六	9	21	77	21
老王	55	134	21	19

2) 不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致

df1

	A	B	C	D
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4

df2 = make_df([2,3,4,5], list('BCDE'))
df2

	B	C	D	E
2	B2	C2	D2	E2
3	B3	C3	D3	E3
4	B4	C4	D4	E4
5	B5	C5	D5	E5

# 不匹配级联,匹配索引进行级联,不匹配的索引,补NaN
pd.concat((df1, df2), sort=True)

	A	B	C	D	E
1	A1	B1	C1	D1	NaN
2	A2	B2	C2	D2	NaN
3	A3	B3	C3	D3	NaN
4	A4	B4	C4	D4	NaN
2	NaN	B2	C2	D2	E2
3	NaN	B3	C3	D3	E3
4	NaN	B4	C4	D4	E4
5	NaN	B5	C5	D5	E5

pd.concat((df1, df2), sort=True, axis=1)

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

有3种连接方式：

外连接：补NaN（默认模式）

pd.concat((df1, df2), sort=True, axis=1, join='inner') # 内连接

	A	B	C	D	B	C	D	E
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4

内连接：只连接匹配的项
连接指定轴 join_axes

pd.concat((df1, df2), sort=True, axis=1, join_axes=[df1.index])

d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4

pd.concat((df1, df2), sort=True, axis=1, join_axes=[df2.index])

d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.

	A	B	C	D	B	C	D	E
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

pd.concat((df1, df2), sort=True, axis=0, join_axes=[df2.columns])

d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.

	B	C	D	E
1	B1	C1	D1	NaN
2	B2	C2	D2	NaN
3	B3	C3	D3	NaN
4	B4	C4	D4	NaN
2	B2	C2	D2	E2
3	B3	C3	D3	E3
4	B4	C4	D4	E4
5	B5	C5	D5	E5

============================================

练习14：

假设【期末】考试ddd2的成绩没有张三的，只有李四、王老五、赵小六的，使用多种方法级联df

============================================

data = np.random.randint(0,150, size=(4,3))
index = ['张三', '李四', '王老五', '赵小六']
columns = ['语文', '数学', '英语']
df = DataFrame(index=index, data=data, columns=columns)
df

	语文	数学	英语
张三	55	111	131
李四	133	122	40
王老五	67	12	33
赵小六	12	140	101

data = np.random.randint(0,150, size=(3,3))
index = ['李四', '王老五', '赵小六']
columns = ['语文', '数学', '英语']
ddd = DataFrame(index=index, data=data, columns=columns)
ddd

	语文	数学	英语
李四	52	23	124
王老五	33	106	36
赵小六	58	59	58

# 增加行数
pd.concat((df, ddd), keys=['期中', '期末'])

		语文	数学	英语
期中	张三	55	111	131
	李四	133	122	40
	王老五	67	12	33
	赵小六	12	140	101
期末	李四	52	23	124
	王老五	33	106	36
	赵小六	58	59	58

pd.concat((df, ddd), keys=['期中', '期末'], axis=1, sort=True)

	期中			期末
	语文	数学	英语	语文	数学	英语
张三	55	111	131	NaN	NaN	NaN
李四	133	122	40	52.0	23.0	124.0
王老五	67	12	33	33.0	106.0	36.0
赵小六	12	140	101	58.0	59.0	58.0

pd.concat((df, ddd), keys=['期中', '期末'], axis=1, sort=True, join_axes=[ddd.index])

d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  """Entry point for launching an IPython kernel.

	期中			期末
	语文	数学	英语	语文	数学	英语
李四	133	122	40	52	23	124
王老五	67	12	33	33	106	36
赵小六	12	140	101	58	59	58

3) 使用append()函数添加

由于在后面级联的使用非常普遍，因此有一个函数append专门用于在后面添加

注意: concat是pd的方法, append是DataFrame对象的方法.

df.append(ddd, verify_integrity=False)

	语文	数学	英语
张三	55	111	131
李四	133	122	40
王老五	67	12	33
赵小六	12	140	101
李四	52	23	124
王老五	33	106	36
赵小六	58	59	58

============================================

练习15：

新建一个只有张三李四王老五的期末考试成绩单ddd3，使用append()与期中考试成绩表ddd级联

============================================

归纳总结:
    pandas的级联: pd.concat((df1, df2, df3, ...), ignore_index<忽略索引>, keys<说明索引>, join<内外连接,inner,outer>, join_axes<指定左右连接,>)
    dataframe.append(other, ignore_index, verify_integrity, sort)

2. 使用pd.merge()合并

merge与concat的区别在于，merge需要依据某一共同的列来进行合并

使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。

注意每一列元素的顺序不要求一致

1) 一对一合并

df1 = DataFrame({'name':['张三','李四','Chales'],'id':[1,2,3],'age':[22,21,25]})

df2 = DataFrame({'sex':['男','男','女'],'id':[2,3,4],'group':['sale','search','service']})
display(df1, df2)

	name	id	age
0	张三	1	22
1	李四	2	21
2	Chales	3	25

	sex	id	group
0	男	2	sale
1	男	3	search
2	女	4	service

pd.merge(df1, df2)

	name	id	age	sex	group
0	李四	2	21	男	sale
1	Chales	3	25	男	search

df1.merge(df2)

	name	id	age	sex	group
0	李四	2	21	男	sale
1	Chales	3	25	男	search

2) 多对一合并

df1 = DataFrame({'name':['张三','李四','Chales'],'id':[1,2,2],'age':[22,21,25]})

df2 = DataFrame({'sex':['男','男','女'],'id':[2,3,4],'group':['sale','search','service']})
display(df1,df2)

	name	id	age
0	张三	1	22
1	李四	2	21
2	Chales	2	25

	sex	id	group
0	男	2	sale
1	男	3	search
2	女	4	service

pd.merge(df1,df2)

	name	id	age	sex	group
0	李四	2	21	男	sale
1	Chales	2	25	男	sale

3) 多对多合并

df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'sex':['男','男','女'],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)

	name	salary	age
0	张三	10000	22
1	李四	12000	21
2	张三	20000	25

	sex	name	group
0	男	张三	sale
1	男	张三	search
2	女	凡凡	service

pd.merge(df1,df2)

	name	salary	age	sex	group
0	张三	10000	22	男	sale
1	张三	10000	22	男	search
2	张三	20000	25	男	sale
3	张三	20000	25	男	search

4) key的规范化

使用on=显式指定哪一列为key,当有多个key相同时使用

df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'age':[21,18,29],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)

	name	salary	age
0	张三	10000	22
1	李四	12000	21
2	张三	20000	25

	age	name	group
0	21	张三	sale
1	18	张三	search
2	29	凡凡	service

# 当有多个列数据相同的时候, 需要使用on来指定具体用哪一列合并
pd.merge(df1, df2, on='name')

	name	salary	age_x	age_y	group
0	张三	10000	22	21	sale
1	张三	10000	22	18	search
2	张三	20000	25	21	sale
3	张三	20000	25	18	search

使用left_on和right_on指定左右两边的列作为key，当左右两边的key都不相等时使用

df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'年龄':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)

	name	salary	age
0	张三	10000	22
1	李四	12000	21
2	张三	20000	25

	年龄	名字	group
0	21	张三	sale
1	18	张三	search
2	29	凡凡	service

pd.merge(df1, df2, left_on='name', right_on='名字')

	name	salary	age	年龄	名字	group
0	张三	10000	22	21	张三	sale
1	张三	10000	22	18	张三	search
2	张三	20000	25	21	张三	sale
3	张三	20000	25	18	张三	search

当左边的列和右边的index相同的时候,使用right_index=True

df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'年龄':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']},
                index = [22,21,25])
display(df1,df2)

	name	salary	age
0	张三	10000	22
1	李四	12000	21
2	张三	20000	25

	年龄	名字	group
22	21	张三	sale
21	18	张三	search
25	29	凡凡	service

pd.merge(df1, df2, left_on='age', right_index=True)

	name	salary	age	年龄	名字	group
0	张三	10000	22	21	张三	sale
1	李四	12000	21	18	张三	search
2	张三	20000	25	29	凡凡	service

pd.merge(df2, df1, left_index=True, right_on='age')

	年龄	名字	group	name	salary	age
0	21	张三	sale	张三	10000	22
1	18	张三	search	李四	12000	21
2	29	凡凡	service	张三	20000	25

============================================

练习16：

假设有两份成绩单，除了ddd是张三李四王老五之外，还有ddd4是张三和赵小六的成绩单，如何合并？
如果ddd4中张三的名字被打错了，成为了张十三，怎么办？
自行练习多对一，多对多的情况
自学left_index,right_index

============================================

5) 内合并与外合并

内合并：只保留两者都有的key（默认模式）

df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})

df2 = DataFrame({'age':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1, df2)

	name	salary	age
0	张三	10000	22
1	李四	12000	21
2	张三	20000	25

	age	名字	group
0	21	张三	sale
1	18	张三	search
2	29	凡凡	service

# 默认是内合并
pd.merge(df1, df2, left_on='name', right_on='名字')

	name	salary	age_x	age_y	名字	group
0	张三	10000	22	21	张三	sale
1	张三	10000	22	18	张三	search
2	张三	20000	25	21	张三	sale
3	张三	20000	25	18	张三	search

外合并 how=‘outer’：补NaN

pd.merge(df1, df2, left_on='name', right_on='名字', how='outer')

	name	salary	age_x	age_y	名字	group
0	张三	10000.0	22.0	21.0	张三	sale
1	张三	10000.0	22.0	18.0	张三	search
2	张三	20000.0	25.0	21.0	张三	sale
3	张三	20000.0	25.0	18.0	张三	search
4	李四	12000.0	21.0	NaN	NaN	NaN
5	NaN	NaN	NaN	29.0	凡凡	service

左合并、右合并：how=‘left’，how=‘right’，

pd.merge(df1, df2, left_on='name', right_on='名字', how='left')

	name	salary	age_x	age_y	名字	group
0	张三	10000	22	21.0	张三	sale
1	张三	10000	22	18.0	张三	search
2	李四	12000	21	NaN	NaN	NaN
3	张三	20000	25	21.0	张三	sale
4	张三	20000	25	18.0	张三	search

# 右合并
pd.merge(df1, df2, left_on='name', right_on='名字', how='right')

	name	salary	age_x	age_y	名字	group
0	张三	10000.0	22.0	21	张三	sale
1	张三	20000.0	25.0	21	张三	sale
2	张三	10000.0	22.0	18	张三	search
3	张三	20000.0	25.0	18	张三	search
4	NaN	NaN	NaN	29	凡凡	service

============================================

练习17：

考虑应用情景，使用多种方式合并ddd与ddd4

============================================

6) 列冲突的解决

当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key，配合suffixes指定冲突列名

#期中
df1 = DataFrame({'name':['张三','李四','张三'],'degree':[120,118,149],'age':[22,21,25]})

#期末考试
df2 = DataFrame({'degree':[99,97,129],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1, df2)

	name	degree	age
0	张三	120	22
1	李四	118	21
2	张三	149	25

	degree	name	group
0	99	张三	sale
1	97	张三	search
2	129	凡凡	service

# suffix 后缀, prefix 前缀
pd.merge(df1, df2, on='name', suffixes=['_df1', '_df2'])

	name	degree_df1	age	degree_df2	group
0	张三	120	22	99	sale
1	张三	120	22	97	search
2	张三	149	25	99	sale
3	张三	149	25	97	search

可以使用suffixes=自己指定后缀

============================================

练习18：

假设有两个同学都叫李四，ddd5、ddd6都是张三和李四的成绩表，如何合并？

============================================

pd.merge(how<指定合并方式,内外左右>, 
         on<指定要合并的列>, 
         left_on<单独指定左边表要合并的列.>, 
         right_on<指定右边边要合并的列>, 
         left_index/right_index<指定使用index来进行合并>, 
         suffixes<指定相同列名的后缀>)

假设有两份成绩单，除了ddd是张三李四王老五之外，还有ddd4是张三和赵小六的成绩单，如何合并？
如果ddd4中张三的名字被打错了，成为了张十三，怎么办？

# 用名字作为 列名
index = ['语文', '数学', '英语']
columns = ['张三', '李四', '王老五']
data = np.random.randint(0,150, size=(3,3))
ddd = DataFrame(data=data, index=index, columns=columns)
ddd

	张三	李四	王老五
语文	97	99	63
数学	116	110	14
英语	6	108	34

index = ['语文', '数学', '英语']
columns = ['张三', '赵小六']
data = np.random.randint(0,150, size=(3,2))
ddd4 = DataFrame(data=data, index=index, columns=columns)
ddd4

	张三	赵小六
语文	57	82
数学	139	47
英语	120	77

ddd4['张三'] = ddd['张三']
ddd4

	张三	赵小六
语文	97	82
数学	116	47
英语	6	77

# 使用右边表的索引.
result = pd.merge(ddd, ddd4, on='张三', left_index=True)
result

	张三	李四	王老五	赵小六
语文	97	99	63	82
数学	116	110	14	47
英语	6	108	34	77

ddd4.rename({'张三': '张十三'}, axis=1, inplace=True)

ddd4

	张十三	赵小六
语文	97	82
数学	116	47
英语	6	77

result = pd.merge(ddd, ddd4, left_on='张三', right_on='张十三', left_index=True)
result.drop(columns='张十三', inplace=True)
result

	张三	李四	王老五	赵小六
语文	97	99	63	82
数学	116	110	14	47
英语	6	108	34	77

作业

3. 案例分析：美国各州人口数据分析

作业知识补充

# unique() 去重函数
s = Series(['Tom','Lucy','Tom','dancer','Lucy'])

s.unique()

array(['Tom', 'Lucy', 'dancer'], dtype=object)

n = DataFrame({'name':['Tom','Lucy','Tom','dancer','Lucy'],'age':[12,13,12,11,15]})
n

	name	age
0	Tom	12
1	Lucy	13
2	Tom	12
3	dancer	11
4	Lucy	15

# 查询 名字是lucy,并且年龄大于13岁的行
n.query('name=="Lucy" & age>13')

	name	age
4	Lucy	15

首先导入文件，并查看数据样本

abb = pd.read_csv('../data/state-abbrevs.csv')
abb.head()

	state	abbreviation
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ
3	Arkansas	AR
4	California	CA

pop = pd.read_csv('../data/state-population.csv')
pop.head()

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0

areas = pd.read_csv('../data/state-areas.csv')
areas.head()

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707

合并pop与abbrevs两个DataFrame，分别依据state/region列和abbreviation列来合并。

为了保留所有信息，使用外合并。

abb_pop = pd.merge(abb, pop, left_on='abbreviation', right_on='state/region', how='outer')

去除abbreviation的那一列（axis=1）

abb_pop.drop(columns='abbreviation', inplace=True)

查看存在缺失数据的列。

使用.isnull().any()，只有某一列存在一个缺失数据，就会显示True。

abb_pop.isnull().any()

state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

查看缺失数据

abb_pop

	state	state/region	ages	year	population
0	Alabama	AL	under18	2012	1117489.0
1	Alabama	AL	total	2012	4817528.0
2	Alabama	AL	under18	2010	1130966.0
3	Alabama	AL	total	2010	4785570.0
4	Alabama	AL	under18	2011	1125763.0
5	Alabama	AL	total	2011	4801627.0
6	Alabama	AL	total	2009	4757938.0
7	Alabama	AL	under18	2009	1134192.0
8	Alabama	AL	under18	2013	1111481.0
9	Alabama	AL	total	2013	4833722.0
10	Alabama	AL	total	2007	4672840.0
11	Alabama	AL	under18	2007	1132296.0
12	Alabama	AL	total	2008	4718206.0
13	Alabama	AL	under18	2008	1134927.0
14	Alabama	AL	total	2005	4569805.0
15	Alabama	AL	under18	2005	1117229.0
16	Alabama	AL	total	2006	4628981.0
17	Alabama	AL	under18	2006	1126798.0
18	Alabama	AL	total	2004	4530729.0
19	Alabama	AL	under18	2004	1113662.0
20	Alabama	AL	total	2003	4503491.0
21	Alabama	AL	under18	2003	1113083.0
22	Alabama	AL	total	2001	4467634.0
23	Alabama	AL	under18	2001	1120409.0
24	Alabama	AL	total	2002	4480089.0
25	Alabama	AL	under18	2002	1116590.0
26	Alabama	AL	under18	1999	1121287.0
27	Alabama	AL	total	1999	4430141.0
28	Alabama	AL	total	2000	4452173.0
29	Alabama	AL	under18	2000	1122273.0
...	...	...	...	...	...
2514	NaN	USA	under18	1999	71946051.0
2515	NaN	USA	total	2000	282162411.0
2516	NaN	USA	under18	2000	72376189.0
2517	NaN	USA	total	1999	279040181.0
2518	NaN	USA	total	2001	284968955.0
2519	NaN	USA	under18	2001	72671175.0
2520	NaN	USA	total	2002	287625193.0
2521	NaN	USA	under18	2002	72936457.0
2522	NaN	USA	total	2003	290107933.0
2523	NaN	USA	under18	2003	73100758.0
2524	NaN	USA	total	2004	292805298.0
2525	NaN	USA	under18	2004	73297735.0
2526	NaN	USA	total	2005	295516599.0
2527	NaN	USA	under18	2005	73523669.0
2528	NaN	USA	total	2006	298379912.0
2529	NaN	USA	under18	2006	73757714.0
2530	NaN	USA	total	2007	301231207.0
2531	NaN	USA	under18	2007	74019405.0
2532	NaN	USA	total	2008	304093966.0
2533	NaN	USA	under18	2008	74104602.0
2534	NaN	USA	under18	2013	73585872.0
2535	NaN	USA	total	2013	316128839.0
2536	NaN	USA	total	2009	306771529.0
2537	NaN	USA	under18	2009	74134167.0
2538	NaN	USA	under18	2010	74119556.0
2539	NaN	USA	total	2010	309326295.0
2540	NaN	USA	under18	2011	73902222.0
2541	NaN	USA	total	2011	311582564.0
2542	NaN	USA	under18	2012	73708179.0
2543	NaN	USA	total	2012	313873685.0

2544 rows × 5 columns

abb_pop.isnull().any(axis=1)

0       False
1       False
2       False
3       False
4       False
        ...  
2539     True
2540     True
2541     True
2542     True
2543     True
Length: 2544, dtype: bool

abb_pop[abb_pop.isnull().any(axis=1)]

	state	state/region	ages	year	population
2448	NaN	PR	under18	1990	NaN
2449	NaN	PR	total	1990	NaN
2450	NaN	PR	total	1991	NaN
2451	NaN	PR	under18	1991	NaN
2452	NaN	PR	total	1993	NaN
2453	NaN	PR	under18	1993	NaN
2454	NaN	PR	under18	1992	NaN
2455	NaN	PR	total	1992	NaN
2456	NaN	PR	under18	1994	NaN
2457	NaN	PR	total	1994	NaN
2458	NaN	PR	total	1995	NaN
2459	NaN	PR	under18	1995	NaN
2460	NaN	PR	under18	1996	NaN
2461	NaN	PR	total	1996	NaN
2462	NaN	PR	under18	1998	NaN
2463	NaN	PR	total	1998	NaN
2464	NaN	PR	total	1997	NaN
2465	NaN	PR	under18	1997	NaN
2466	NaN	PR	total	1999	NaN
2467	NaN	PR	under18	1999	NaN
2468	NaN	PR	total	2000	3810605.0
2469	NaN	PR	under18	2000	1089063.0
2470	NaN	PR	total	2001	3818774.0
2471	NaN	PR	under18	2001	1077566.0
2472	NaN	PR	total	2002	3823701.0
2473	NaN	PR	under18	2002	1065051.0
2474	NaN	PR	total	2004	3826878.0
2475	NaN	PR	under18	2004	1035919.0
2476	NaN	PR	total	2003	3826095.0
2477	NaN	PR	under18	2003	1050615.0
...	...	...	...	...	...
2514	NaN	USA	under18	1999	71946051.0
2515	NaN	USA	total	2000	282162411.0
2516	NaN	USA	under18	2000	72376189.0
2517	NaN	USA	total	1999	279040181.0
2518	NaN	USA	total	2001	284968955.0
2519	NaN	USA	under18	2001	72671175.0
2520	NaN	USA	total	2002	287625193.0
2521	NaN	USA	under18	2002	72936457.0
2522	NaN	USA	total	2003	290107933.0
2523	NaN	USA	under18	2003	73100758.0
2524	NaN	USA	total	2004	292805298.0
2525	NaN	USA	under18	2004	73297735.0
2526	NaN	USA	total	2005	295516599.0
2527	NaN	USA	under18	2005	73523669.0
2528	NaN	USA	total	2006	298379912.0
2529	NaN	USA	under18	2006	73757714.0
2530	NaN	USA	total	2007	301231207.0
2531	NaN	USA	under18	2007	74019405.0
2532	NaN	USA	total	2008	304093966.0
2533	NaN	USA	under18	2008	74104602.0
2534	NaN	USA	under18	2013	73585872.0
2535	NaN	USA	total	2013	316128839.0
2536	NaN	USA	total	2009	306771529.0
2537	NaN	USA	under18	2009	74134167.0
2538	NaN	USA	under18	2010	74119556.0
2539	NaN	USA	total	2010	309326295.0
2540	NaN	USA	under18	2011	73902222.0
2541	NaN	USA	total	2011	311582564.0
2542	NaN	USA	under18	2012	73708179.0
2543	NaN	USA	total	2012	313873685.0

96 rows × 5 columns

根据数据是否缺失情况显示数据，如果缺失为True，那么显示

找到有哪些state/region使得state的值为NaN，使用unique()查看非重复值

abb_pop[abb_pop.isnull().any(axis=1)]['state/region'].unique()

array(['PR', 'USA'], dtype=object)

为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN！

记住这样清除缺失数据NaN的方法！

cond = abb_pop[abb_pop.isnull().any(axis=1)]['state/region'] == 'PR'

abb_pop[abb_pop.isnull().any(axis=1)]['state'][cond]= 'Puerto Rico'

d:\1903\.venv\lib\site-packages\pandas\core\generic.py:9116: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
d:\1903\.venv\lib\site-packages\IPython\core\interactiveshell.py:3326: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)

abb_pop[abb_pop.isnull().any(axis=1)][cond]

	state	state/region	ages	year	population
2448	NaN	PR	under18	1990	NaN
2449	NaN	PR	total	1990	NaN
2450	NaN	PR	total	1991	NaN
2451	NaN	PR	under18	1991	NaN
2452	NaN	PR	total	1993	NaN
2453	NaN	PR	under18	1993	NaN
2454	NaN	PR	under18	1992	NaN
2455	NaN	PR	total	1992	NaN
2456	NaN	PR	under18	1994	NaN
2457	NaN	PR	total	1994	NaN
2458	NaN	PR	total	1995	NaN
2459	NaN	PR	under18	1995	NaN
2460	NaN	PR	under18	1996	NaN
2461	NaN	PR	total	1996	NaN
2462	NaN	PR	under18	1998	NaN
2463	NaN	PR	total	1998	NaN
2464	NaN	PR	total	1997	NaN
2465	NaN	PR	under18	1997	NaN
2466	NaN	PR	total	1999	NaN
2467	NaN	PR	under18	1999	NaN
2468	NaN	PR	total	2000	3810605.0
2469	NaN	PR	under18	2000	1089063.0
2470	NaN	PR	total	2001	3818774.0
2471	NaN	PR	under18	2001	1077566.0
2472	NaN	PR	total	2002	3823701.0
2473	NaN	PR	under18	2002	1065051.0
2474	NaN	PR	total	2004	3826878.0
2475	NaN	PR	under18	2004	1035919.0
2476	NaN	PR	total	2003	3826095.0
2477	NaN	PR	under18	2003	1050615.0
2478	NaN	PR	total	2005	3821362.0
2479	NaN	PR	under18	2005	1019447.0
2480	NaN	PR	total	2006	3805214.0
2481	NaN	PR	under18	2006	998543.0
2482	NaN	PR	total	2007	3782995.0
2483	NaN	PR	under18	2007	973613.0
2484	NaN	PR	total	2008	3760866.0
2485	NaN	PR	under18	2008	945705.0
2486	NaN	PR	under18	2013	814068.0
2487	NaN	PR	total	2013	3615086.0
2488	NaN	PR	total	2009	3740410.0
2489	NaN	PR	under18	2009	920794.0
2490	NaN	PR	total	2010	3721208.0
2491	NaN	PR	under18	2010	896945.0
2492	NaN	PR	under18	2011	869327.0
2493	NaN	PR	total	2011	3686580.0
2494	NaN	PR	under18	2012	841740.0
2495	NaN	PR	total	2012	3651545.0

abb_pop[abb_pop.isnull().any(axis=1)].loc[cond, 'state'] = 'Puerto Rico'

d:\1903\.venv\lib\site-packages\pandas\core\indexing.py:480: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

abb_pop[abb_pop.isnull().any(axis=1)][cond]

	state	state/region	ages	year	population
2448	NaN	PR	under18	1990	NaN
2449	NaN	PR	total	1990	NaN
2450	NaN	PR	total	1991	NaN
2451	NaN	PR	under18	1991	NaN
2452	NaN	PR	total	1993	NaN
2453	NaN	PR	under18	1993	NaN
2454	NaN	PR	under18	1992	NaN
2455	NaN	PR	total	1992	NaN
2456	NaN	PR	under18	1994	NaN
2457	NaN	PR	total	1994	NaN
2458	NaN	PR	total	1995	NaN
2459	NaN	PR	under18	1995	NaN
2460	NaN	PR	under18	1996	NaN
2461	NaN	PR	total	1996	NaN
2462	NaN	PR	under18	1998	NaN
2463	NaN	PR	total	1998	NaN
2464	NaN	PR	total	1997	NaN
2465	NaN	PR	under18	1997	NaN
2466	NaN	PR	total	1999	NaN
2467	NaN	PR	under18	1999	NaN
2468	NaN	PR	total	2000	3810605.0
2469	NaN	PR	under18	2000	1089063.0
2470	NaN	PR	total	2001	3818774.0
2471	NaN	PR	under18	2001	1077566.0
2472	NaN	PR	total	2002	3823701.0
2473	NaN	PR	under18	2002	1065051.0
2474	NaN	PR	total	2004	3826878.0
2475	NaN	PR	under18	2004	1035919.0
2476	NaN	PR	total	2003	3826095.0
2477	NaN	PR	under18	2003	1050615.0
2478	NaN	PR	total	2005	3821362.0
2479	NaN	PR	under18	2005	1019447.0
2480	NaN	PR	total	2006	3805214.0
2481	NaN	PR	under18	2006	998543.0
2482	NaN	PR	total	2007	3782995.0
2483	NaN	PR	under18	2007	973613.0
2484	NaN	PR	total	2008	3760866.0
2485	NaN	PR	under18	2008	945705.0
2486	NaN	PR	under18	2013	814068.0
2487	NaN	PR	total	2013	3615086.0
2488	NaN	PR	total	2009	3740410.0
2489	NaN	PR	under18	2009	920794.0
2490	NaN	PR	total	2010	3721208.0
2491	NaN	PR	under18	2010	896945.0
2492	NaN	PR	under18	2011	869327.0
2493	NaN	PR	total	2011	3686580.0
2494	NaN	PR	under18	2012	841740.0
2495	NaN	PR	total	2012	3651545.0

temp = abb_pop[abb_pop.isnull().any(axis=1)].copy()

temp.loc[cond, 'state'] = 'Puerto Rico'

temp

	state	state/region	ages	year	population
2448	Puerto Rico	PR	under18	1990	NaN
2449	Puerto Rico	PR	total	1990	NaN
2450	Puerto Rico	PR	total	1991	NaN
2451	Puerto Rico	PR	under18	1991	NaN
2452	Puerto Rico	PR	total	1993	NaN
2453	Puerto Rico	PR	under18	1993	NaN
2454	Puerto Rico	PR	under18	1992	NaN
2455	Puerto Rico	PR	total	1992	NaN
2456	Puerto Rico	PR	under18	1994	NaN
2457	Puerto Rico	PR	total	1994	NaN
2458	Puerto Rico	PR	total	1995	NaN
2459	Puerto Rico	PR	under18	1995	NaN
2460	Puerto Rico	PR	under18	1996	NaN
2461	Puerto Rico	PR	total	1996	NaN
2462	Puerto Rico	PR	under18	1998	NaN
2463	Puerto Rico	PR	total	1998	NaN
2464	Puerto Rico	PR	total	1997	NaN
2465	Puerto Rico	PR	under18	1997	NaN
2466	Puerto Rico	PR	total	1999	NaN
2467	Puerto Rico	PR	under18	1999	NaN
2468	Puerto Rico	PR	total	2000	3810605.0
2469	Puerto Rico	PR	under18	2000	1089063.0
2470	Puerto Rico	PR	total	2001	3818774.0
2471	Puerto Rico	PR	under18	2001	1077566.0
2472	Puerto Rico	PR	total	2002	3823701.0
2473	Puerto Rico	PR	under18	2002	1065051.0
2474	Puerto Rico	PR	total	2004	3826878.0
2475	Puerto Rico	PR	under18	2004	1035919.0
2476	Puerto Rico	PR	total	2003	3826095.0
2477	Puerto Rico	PR	under18	2003	1050615.0
...	...	...	...	...	...
2514	NaN	USA	under18	1999	71946051.0
2515	NaN	USA	total	2000	282162411.0
2516	NaN	USA	under18	2000	72376189.0
2517	NaN	USA	total	1999	279040181.0
2518	NaN	USA	total	2001	284968955.0
2519	NaN	USA	under18	2001	72671175.0
2520	NaN	USA	total	2002	287625193.0
2521	NaN	USA	under18	2002	72936457.0
2522	NaN	USA	total	2003	290107933.0
2523	NaN	USA	under18	2003	73100758.0
2524	NaN	USA	total	2004	292805298.0
2525	NaN	USA	under18	2004	73297735.0
2526	NaN	USA	total	2005	295516599.0
2527	NaN	USA	under18	2005	73523669.0
2528	NaN	USA	total	2006	298379912.0
2529	NaN	USA	under18	2006	73757714.0
2530	NaN	USA	total	2007	301231207.0
2531	NaN	USA	under18	2007	74019405.0
2532	NaN	USA	total	2008	304093966.0
2533	NaN	USA	under18	2008	74104602.0
2534	NaN	USA	under18	2013	73585872.0
2535	NaN	USA	total	2013	316128839.0
2536	NaN	USA	total	2009	306771529.0
2537	NaN	USA	under18	2009	74134167.0
2538	NaN	USA	under18	2010	74119556.0
2539	NaN	USA	total	2010	309326295.0
2540	NaN	USA	under18	2011	73902222.0
2541	NaN	USA	total	2011	311582564.0
2542	NaN	USA	under18	2012	73708179.0
2543	NaN	USA	total	2012	313873685.0

96 rows × 5 columns

cond = abb_pop[abb_pop.isnull().any(axis=1)]['state/region'] == 'USA'

temp.loc[cond, 'state'] = 'United State of American'

temp

	state	state/region	ages	year	population
2448	Puerto Rico	PR	under18	1990	NaN
2449	Puerto Rico	PR	total	1990	NaN
2450	Puerto Rico	PR	total	1991	NaN
2451	Puerto Rico	PR	under18	1991	NaN
2452	Puerto Rico	PR	total	1993	NaN
2453	Puerto Rico	PR	under18	1993	NaN
2454	Puerto Rico	PR	under18	1992	NaN
2455	Puerto Rico	PR	total	1992	NaN
2456	Puerto Rico	PR	under18	1994	NaN
2457	Puerto Rico	PR	total	1994	NaN
2458	Puerto Rico	PR	total	1995	NaN
2459	Puerto Rico	PR	under18	1995	NaN
2460	Puerto Rico	PR	under18	1996	NaN
2461	Puerto Rico	PR	total	1996	NaN
2462	Puerto Rico	PR	under18	1998	NaN
2463	Puerto Rico	PR	total	1998	NaN
2464	Puerto Rico	PR	total	1997	NaN
2465	Puerto Rico	PR	under18	1997	NaN
2466	Puerto Rico	PR	total	1999	NaN
2467	Puerto Rico	PR	under18	1999	NaN
2468	Puerto Rico	PR	total	2000	3810605.0
2469	Puerto Rico	PR	under18	2000	1089063.0
2470	Puerto Rico	PR	total	2001	3818774.0
2471	Puerto Rico	PR	under18	2001	1077566.0
2472	Puerto Rico	PR	total	2002	3823701.0
2473	Puerto Rico	PR	under18	2002	1065051.0
2474	Puerto Rico	PR	total	2004	3826878.0
2475	Puerto Rico	PR	under18	2004	1035919.0
2476	Puerto Rico	PR	total	2003	3826095.0
2477	Puerto Rico	PR	under18	2003	1050615.0
...	...	...	...	...	...
2514	United State of American	USA	under18	1999	71946051.0
2515	United State of American	USA	total	2000	282162411.0
2516	United State of American	USA	under18	2000	72376189.0
2517	United State of American	USA	total	1999	279040181.0
2518	United State of American	USA	total	2001	284968955.0
2519	United State of American	USA	under18	2001	72671175.0
2520	United State of American	USA	total	2002	287625193.0
2521	United State of American	USA	under18	2002	72936457.0
2522	United State of American	USA	total	2003	290107933.0
2523	United State of American	USA	under18	2003	73100758.0
2524	United State of American	USA	total	2004	292805298.0
2525	United State of American	USA	under18	2004	73297735.0
2526	United State of American	USA	total	2005	295516599.0
2527	United State of American	USA	under18	2005	73523669.0
2528	United State of American	USA	total	2006	298379912.0
2529	United State of American	USA	under18	2006	73757714.0
2530	United State of American	USA	total	2007	301231207.0
2531	United State of American	USA	under18	2007	74019405.0
2532	United State of American	USA	total	2008	304093966.0
2533	United State of American	USA	under18	2008	74104602.0
2534	United State of American	USA	under18	2013	73585872.0
2535	United State of American	USA	total	2013	316128839.0
2536	United State of American	USA	total	2009	306771529.0
2537	United State of American	USA	under18	2009	74134167.0
2538	United State of American	USA	under18	2010	74119556.0
2539	United State of American	USA	total	2010	309326295.0
2540	United State of American	USA	under18	2011	73902222.0
2541	United State of American	USA	total	2011	311582564.0
2542	United State of American	USA	under18	2012	73708179.0
2543	United State of American	USA	total	2012	313873685.0

96 rows × 5 columns

abb_pop[abb_pop.isnull().any(axis=1)] = temp

abb_pop[abb_pop.isnull().any(axis=1)]

	state	state/region	ages	year	population
2448	Puerto Rico	PR	under18	1990	NaN
2449	Puerto Rico	PR	total	1990	NaN
2450	Puerto Rico	PR	total	1991	NaN
2451	Puerto Rico	PR	under18	1991	NaN
2452	Puerto Rico	PR	total	1993	NaN
2453	Puerto Rico	PR	under18	1993	NaN
2454	Puerto Rico	PR	under18	1992	NaN
2455	Puerto Rico	PR	total	1992	NaN
2456	Puerto Rico	PR	under18	1994	NaN
2457	Puerto Rico	PR	total	1994	NaN
2458	Puerto Rico	PR	total	1995	NaN
2459	Puerto Rico	PR	under18	1995	NaN
2460	Puerto Rico	PR	under18	1996	NaN
2461	Puerto Rico	PR	total	1996	NaN
2462	Puerto Rico	PR	under18	1998	NaN
2463	Puerto Rico	PR	total	1998	NaN
2464	Puerto Rico	PR	total	1997	NaN
2465	Puerto Rico	PR	under18	1997	NaN
2466	Puerto Rico	PR	total	1999	NaN
2467	Puerto Rico	PR	under18	1999	NaN

合并各州面积数据areas，使用左合并。

思考一下为什么使用外合并？

abb_pop.head()

	state	state/region	ages	year	population
0	Alabama	AL	under18	2012	1117489.0
1	Alabama	AL	total	2012	4817528.0
2	Alabama	AL	under18	2010	1130966.0
3	Alabama	AL	total	2010	4785570.0
4	Alabama	AL	under18	2011	1125763.0

areas.head()

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707

abb_pop_areas = abb_pop.merge(areas, how='outer')
abb_pop_areas.head()

	state	state/region	ages	year	population	area (sq. mi)
0	Alabama	AL	under18	2012	1117489.0	52423.0
1	Alabama	AL	total	2012	4817528.0	52423.0
2	Alabama	AL	under18	2010	1130966.0	52423.0
3	Alabama	AL	total	2010	4785570.0	52423.0
4	Alabama	AL	under18	2011	1125763.0	52423.0

继续寻找存在缺失数据的列

abb_pop_areas.isnull().any()

state            False
state/region     False
ages             False
year             False
population        True
area (sq. mi)     True
dtype: bool

我们会发现area(sq.mi)这一列有缺失数据，为了找出是哪一行，我们需要找出是哪个state没有数据

abb_pop_areas[abb_pop_areas.isnull().any(axis=1)]

	state	state/region	ages	year	population	area (sq. mi)
2448	Puerto Rico	PR	under18	1990	NaN	3515.0
2449	Puerto Rico	PR	total	1990	NaN	3515.0
2450	Puerto Rico	PR	total	1991	NaN	3515.0
2451	Puerto Rico	PR	under18	1991	NaN	3515.0
2452	Puerto Rico	PR	total	1993	NaN	3515.0
2453	Puerto Rico	PR	under18	1993	NaN	3515.0
2454	Puerto Rico	PR	under18	1992	NaN	3515.0
2455	Puerto Rico	PR	total	1992	NaN	3515.0
2456	Puerto Rico	PR	under18	1994	NaN	3515.0
2457	Puerto Rico	PR	total	1994	NaN	3515.0
2458	Puerto Rico	PR	total	1995	NaN	3515.0
2459	Puerto Rico	PR	under18	1995	NaN	3515.0
2460	Puerto Rico	PR	under18	1996	NaN	3515.0
2461	Puerto Rico	PR	total	1996	NaN	3515.0
2462	Puerto Rico	PR	under18	1998	NaN	3515.0
2463	Puerto Rico	PR	total	1998	NaN	3515.0
2464	Puerto Rico	PR	total	1997	NaN	3515.0
2465	Puerto Rico	PR	under18	1997	NaN	3515.0
2466	Puerto Rico	PR	total	1999	NaN	3515.0
2467	Puerto Rico	PR	under18	1999	NaN	3515.0
2496	United State of American	USA	under18	1990	64218512.0	NaN
2497	United State of American	USA	total	1990	249622814.0	NaN
2498	United State of American	USA	total	1991	252980942.0	NaN
2499	United State of American	USA	under18	1991	65313018.0	NaN
2500	United State of American	USA	under18	1992	66509177.0	NaN
2501	United State of American	USA	total	1992	256514231.0	NaN
2502	United State of American	USA	total	1993	259918595.0	NaN
2503	United State of American	USA	under18	1993	67594938.0	NaN
2504	United State of American	USA	under18	1994	68640936.0	NaN
2505	United State of American	USA	total	1994	263125826.0	NaN
...	...	...	...	...	...	...
2514	United State of American	USA	under18	1999	71946051.0	NaN
2515	United State of American	USA	total	2000	282162411.0	NaN
2516	United State of American	USA	under18	2000	72376189.0	NaN
2517	United State of American	USA	total	1999	279040181.0	NaN
2518	United State of American	USA	total	2001	284968955.0	NaN
2519	United State of American	USA	under18	2001	72671175.0	NaN
2520	United State of American	USA	total	2002	287625193.0	NaN
2521	United State of American	USA	under18	2002	72936457.0	NaN
2522	United State of American	USA	total	2003	290107933.0	NaN
2523	United State of American	USA	under18	2003	73100758.0	NaN
2524	United State of American	USA	total	2004	292805298.0	NaN
2525	United State of American	USA	under18	2004	73297735.0	NaN
2526	United State of American	USA	total	2005	295516599.0	NaN
2527	United State of American	USA	under18	2005	73523669.0	NaN
2528	United State of American	USA	total	2006	298379912.0	NaN
2529	United State of American	USA	under18	2006	73757714.0	NaN
2530	United State of American	USA	total	2007	301231207.0	NaN
2531	United State of American	USA	under18	2007	74019405.0	NaN
2532	United State of American	USA	total	2008	304093966.0	NaN
2533	United State of American	USA	under18	2008	74104602.0	NaN
2534	United State of American	USA	under18	2013	73585872.0	NaN
2535	United State of American	USA	total	2013	316128839.0	NaN
2536	United State of American	USA	total	2009	306771529.0	NaN
2537	United State of American	USA	under18	2009	74134167.0	NaN
2538	United State of American	USA	under18	2010	74119556.0	NaN
2539	United State of American	USA	total	2010	309326295.0	NaN
2540	United State of American	USA	under18	2011	73902222.0	NaN
2541	United State of American	USA	total	2011	311582564.0	NaN
2542	United State of American	USA	under18	2012	73708179.0	NaN
2543	United State of American	USA	total	2012	313873685.0	NaN

68 rows × 6 columns

去除含有缺失数据的行

abb_pop_areas.dropna(axis=0, how='any', inplace=True)

查看数据是否缺失

abb_pop_areas.isnull().any()

state            False
state/region     False
ages             False
year             False
population       False
area (sq. mi)    False
dtype: bool

abb_pop_areas.head()

	state	state/region	ages	year	population	area (sq. mi)
0	Alabama	AL	under18	2012	1117489.0	52423.0
1	Alabama	AL	total	2012	4817528.0	52423.0
2	Alabama	AL	under18	2010	1130966.0	52423.0
3	Alabama	AL	total	2010	4785570.0	52423.0
4	Alabama	AL	under18	2011	1125763.0	52423.0

abb_pop_areas.dtypes

state             object
state/region      object
ages              object
year               int64
population       float64
area (sq. mi)    float64
dtype: object

找出2010年的全民人口数据,df.query(查询语句)

result = abb_pop_areas.query('ages=="total" & year==2010')

对查询结果进行处理，以state列作为新的行索引:set_index

result.set_index(keys=['state'], inplace=True)
result.head()

	state/region	ages	year	population	area (sq. mi)
state
Alabama	AL	total	2010	4785570.0	52423.0
Alaska	AK	total	2010	713868.0	656425.0
Arizona	AZ	total	2010	6408790.0	114006.0
Arkansas	AR	total	2010	2922280.0	53182.0
California	CA	total	2010	37333601.0	163707.0

计算人口密度。注意是Series/Series，其结果还是一个Series。

density = result['population'] / result['area (sq. mi)']
density.head()

state
Alabama        91.287603
Alaska          1.087509
Arizona        56.214497
Arkansas       54.948667
California    228.051342
dtype: float64

排序，并找出人口密度最高的五个州sort_values()

density.sort_values(inplace=True)

density.tail()

state
Connecticut              645.600649
Rhode Island             681.339159
New Jersey              1009.253268
Puerto Rico             1058.665149
District of Columbia    8898.897059
dtype: float64

找出人口密度最低的五个州

density.head()

state
Alaska           1.087509
Wyoming          5.768079
Montana          6.736171
North Dakota     9.537565
South Dakota    10.583512
dtype: float64

要点总结：

统一用loc()索引
善于使用.isnull().any()找到存在NaN的列
善于使用.unique()确定该列中哪些key是我们需要的
一般使用外合并、左合并，目的只有一个：宁愿该列是NaN也不要丢弃其他列的信息

回顾：Series/DataFrame运算与ndarray运算的区别

Series与DataFrame没有广播，如果对应index没有值，则记为NaN；或者使用add的fill_value来补缺失值
ndarray有广播，通过重复已有值来计算

zly717216

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas数据分析给力教程【完整版】（五）

pandas的拼接操作pandas的拼接分为两种：级联：pd.concat, pd.append合并：pd.mergeimport numpy as npimport pandas as pdfrom pandas import Series, DataFrameimport matplotlib.pyplot as plt0. 回顾numpy的级联============================================练习12：生成2个3*3的矩阵，对其分别进
复制链接

扫一扫

专栏目录

	A	B	C	D
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4

	0	1	2	3	4	5	6	7
1	A1	B1	C1	D1	A1	B1	C1	D1
2	A2	B2	C2	D2	A2	B2	C2	D2
3	A3	B3	C3	D3	A3	B3	C3	D3
4	A4	B4	C4	D4	A4	B4	C4	D4

	A	B	C	D
0	A1	B1	C1	D1
1	A2	B2	C2	D2
2	A3	B3	C3	D3
3	A4	B4	C4	D4
4	A1	B1	C1	D1
5	A2	B2	C2	D2
6	A3	B3	C3	D3
7	A4	B4	C4	D4

		A	B	C	D
df1	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4
df2	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4

	A	B	C	D	E
1	A1	B1	C1	D1	NaN
2	A2	B2	C2	D2	NaN
3	A3	B3	C3	D3	NaN
4	A4	B4	C4	D4	NaN
2	NaN	B2	C2	D2	E2
3	NaN	B3	C3	D3	E3
4	NaN	B4	C4	D4	E4
5	NaN	B5	C5	D5	E5

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4

	A	B	C	D	B	C	D	E
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

	B	C	D	E
1	B1	C1	D1	NaN
2	B2	C2	D2	NaN
3	B3	C3	D3	NaN
4	B4	C4	D4	NaN
2	B2	C2	D2	E2
3	B3	C3	D3	E3
4	B4	C4	D4	E4
5	B5	C5	D5	E5

	A	B	C	D
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4

	0	1	2	3	4	5	6	7
1	A1	B1	C1	D1	A1	B1	C1	D1
2	A2	B2	C2	D2	A2	B2	C2	D2
3	A3	B3	C3	D3	A3	B3	C3	D3
4	A4	B4	C4	D4	A4	B4	C4	D4

	A	B	C	D
0	A1	B1	C1	D1
1	A2	B2	C2	D2
2	A3	B3	C3	D3
3	A4	B4	C4	D4
4	A1	B1	C1	D1
5	A2	B2	C2	D2
6	A3	B3	C3	D3
7	A4	B4	C4	D4

		A	B	C	D
df1	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4
df2	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4

	A	B	C	D	E
1	A1	B1	C1	D1	NaN
2	A2	B2	C2	D2	NaN
3	A3	B3	C3	D3	NaN
4	A4	B4	C4	D4	NaN
2	NaN	B2	C2	D2	E2
3	NaN	B3	C3	D3	E3
4	NaN	B4	C4	D4	E4
5	NaN	B5	C5	D5	E5

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4

	A	B	C	D	B	C	D	E
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

	B	C	D	E
1	B1	C1	D1	NaN
2	B2	C2	D2	NaN
3	B3	C3	D3	NaN
4	B4	C4	D4	NaN
2	B2	C2	D2	E2
3	B3	C3	D3	E3
4	B4	C4	D4	E4
5	B5	C5	D5	E5

	A	B	C	D
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4

	0	1	2	3	4	5	6	7
1	A1	B1	C1	D1	A1	B1	C1	D1
2	A2	B2	C2	D2	A2	B2	C2	D2
3	A3	B3	C3	D3	A3	B3	C3	D3
4	A4	B4	C4	D4	A4	B4	C4	D4

	A	B	C	D
0	A1	B1	C1	D1
1	A2	B2	C2	D2
2	A3	B3	C3	D3
3	A4	B4	C4	D4
4	A1	B1	C1	D1
5	A2	B2	C2	D2
6	A3	B3	C3	D3
7	A4	B4	C4	D4

		A	B	C	D
df1	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4
df2	1	A1	B1	C1	D1
	2	A2	B2	C2	D2
	3	A3	B3	C3	D3
	4	A4	B4	C4	D4

	A	B	C	D	E
1	A1	B1	C1	D1	NaN
2	A2	B2	C2	D2	NaN
3	A3	B3	C3	D3	NaN
4	A4	B4	C4	D4	NaN
2	NaN	B2	C2	D2	E2
3	NaN	B3	C3	D3	E3
4	NaN	B4	C4	D4	E4
5	NaN	B5	C5	D5	E5

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

	A	B	C	D	B	C	D	E
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4

	A	B	C	D	B	C	D	E
2	A2	B2	C2	D2	B2	C2	D2	E2
3	A3	B3	C3	D3	B3	C3	D3	E3
4	A4	B4	C4	D4	B4	C4	D4	E4
5	NaN	NaN	NaN	NaN	B5	C5	D5	E5

	B	C	D	E
1	B1	C1	D1	NaN
2	B2	C2	D2	NaN
3	B3	C3	D3	NaN
4	B4	C4	D4	NaN
2	B2	C2	D2	E2
3	B3	C3	D3	E3
4	B4	C4	D4	E4
5	B5	C5	D5	E5