pandas的拼接操作
上一篇:pandas数据分析给力教程【完整版】(四)
下一篇:pandas数据分析给力教程【完整版】(六)
pandas的拼接分为两种:
- 级联:pd.concat, pd.append
- 合并:pd.merge
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
0. 回顾numpy的级联
============================================
练习12:
- 生成2个3*3的矩阵,对其分别进行两个维度上的级联
============================================
n1 = np.random.randint(0,100, size=(3,3))
n2 = np.random.randint(0,100, size=(3,3))
display(n1,n2)
array([[99, 49, 30],
[36, 48, 70],
[18, 90, 77]])
array([[82, 12, 2],
[40, 46, 99],
[55, 5, 77]])
np.concatenate((n1,n2), axis=0)
array([[99, 49, 30],
[36, 48, 70],
[18, 90, 77],
[82, 12, 2],
[40, 46, 99],
[55, 5, 77]])
np.concatenate((n1,n2), axis=1)
array([[99, 49, 30, 82, 12, 2],
[36, 48, 70, 40, 46, 99],
[18, 90, 77, 55, 5, 77]])
np.hstack((n1,n2))
array([[99, 49, 30, 82, 12, 2],
[36, 48, 70, 40, 46, 99],
[18, 90, 77, 55, 5, 77]])
np.vstack((n1,n2))
array([[99, 49, 30],
[36, 48, 70],
[18, 90, 77],
[82, 12, 2],
[40, 46, 99],
[55, 5, 77]])
为方便讲解,我们首先定义一个生成DataFrame的函数:
def make_df(index, cols):
df = DataFrame({col: [col + str(i) for i in index] for col in cols})
df.index = index
return df
make_df([1,2,3,4], list('ABCD'))
A | B | C | D | |
---|---|---|---|---|
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
1. 使用pd.concat()级联
pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:
objs
axis=0
join='outer'
join_axes=None
ignore_index=False
keys = [value1,value2...]
pd.concat concatenate()
pd.concat()
1) 简单级联
df1 = make_df([1,2,3,4], list('ABCD'))
df2 = make_df([1,2,3,4], list('ABCD'))
和np.concatenate一样,优先增加行数(默认axis=0)
# axis=0, 默认增加行数
pd.concat((df1,df2))
A | B | C | D | |
---|---|---|---|---|
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
pd.concat((df1,df2), axis=1, ignore_index=True)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
1 | A1 | B1 | C1 | D1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 | A4 | B4 | C4 | D4 |
可以通过设置axis来改变级联方向
级连会把该方向上索引相同的元素放在一行(一列),index/columns在级联时可以重复
也可以选择忽略ignore_index,重新索引
pd.concat((df1,df2), ignore_index=True)
A | B | C | D | |
---|---|---|---|---|
0 | A1 | B1 | C1 | D1 |
1 | A2 | B2 | C2 | D2 |
2 | A3 | B3 | C3 | D3 |
3 | A4 | B4 | C4 | D4 |
4 | A1 | B1 | C1 | D1 |
5 | A2 | B2 | C2 | D2 |
6 | A3 | B3 | C3 | D3 |
7 | A4 | B4 | C4 | D4 |
或者使用多层索引 keys
concat([x,y],keys=[‘x’,‘y’])
pd.concat((df1,df2), keys=['df1', 'df2'])
A | B | C | D | ||
---|---|---|---|---|---|
df1 | 1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
4 | A4 | B4 | C4 | D4 | |
df2 | 1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
4 | A4 | B4 | C4 | D4 |
============================================
练习13:
-
想一想级联的应用场景?
-
使用昨天的知识,建立一个期中考试张三、李四的成绩表ddd
-
假设新增考试学科"计算机",如何实现?
-
新增王老五同学的成绩,如何实现?
============================================
data = np.random.randint(0,150, size=(4,3))
index = ['张三', '李四', '王五', '赵六']
columns = ['语文', '数学', '英语']
df = DataFrame(index=index, data=data, columns=columns)
df
语文 | 数学 | 英语 | |
---|---|---|---|
张三 | 29 | 115 | 36 |
李四 | 73 | 109 | 100 |
王五 | 79 | 140 | 71 |
赵六 | 9 | 21 | 77 |
df['计算机'] = np.xx
data = np.random.randint(0,150, size=(4,1))
index = ['张三', '李四', '王五', '赵六']
columns = ['计算机']
computer = DataFrame(index=index, data=data, columns=columns)
computer
计算机 | |
---|---|
张三 | 134 |
李四 | 69 |
王五 | 117 |
赵六 | 21 |
df = pd.concat((df, computer), axis=1)
data = np.random.randint(0,150, size=(1,4))
index = ['老王']
columns = ['语文', '数学', '英语', '计算机']
wang = DataFrame(index=index, data=data, columns=columns)
wang
语文 | 数学 | 英语 | 计算机 | |
---|---|---|---|---|
老王 | 55 | 134 | 21 | 19 |
pd.concat((df, wang))
语文 | 数学 | 英语 | 计算机 | |
---|---|---|---|---|
张三 | 29 | 115 | 36 | 134 |
李四 | 73 | 109 | 100 | 69 |
王五 | 79 | 140 | 71 | 117 |
赵六 | 9 | 21 | 77 | 21 |
老王 | 55 | 134 | 21 | 19 |
2) 不匹配级联
不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
df1
A | B | C | D | |
---|---|---|---|---|
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
df2 = make_df([2,3,4,5], list('BCDE'))
df2
B | C | D | E | |
---|---|---|---|---|
2 | B2 | C2 | D2 | E2 |
3 | B3 | C3 | D3 | E3 |
4 | B4 | C4 | D4 | E4 |
5 | B5 | C5 | D5 | E5 |
# 不匹配级联,匹配索引进行级联,不匹配的索引,补NaN
pd.concat((df1, df2), sort=True)
A | B | C | D | E | |
---|---|---|---|---|---|
1 | A1 | B1 | C1 | D1 | NaN |
2 | A2 | B2 | C2 | D2 | NaN |
3 | A3 | B3 | C3 | D3 | NaN |
4 | A4 | B4 | C4 | D4 | NaN |
2 | NaN | B2 | C2 | D2 | E2 |
3 | NaN | B3 | C3 | D3 | E3 |
4 | NaN | B4 | C4 | D4 | E4 |
5 | NaN | B5 | C5 | D5 | E5 |
pd.concat((df1, df2), sort=True, axis=1)
A | B | C | D | B | C | D | E | |
---|---|---|---|---|---|---|---|---|
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | B2 | C2 | D2 | E2 |
3 | A3 | B3 | C3 | D3 | B3 | C3 | D3 | E3 |
4 | A4 | B4 | C4 | D4 | B4 | C4 | D4 | E4 |
5 | NaN | NaN | NaN | NaN | B5 | C5 | D5 | E5 |
有3种连接方式:
- 外连接:补NaN(默认模式)
pd.concat((df1, df2), sort=True, axis=1, join='inner') # 内连接
A | B | C | D | B | C | D | E | |
---|---|---|---|---|---|---|---|---|
2 | A2 | B2 | C2 | D2 | B2 | C2 | D2 | E2 |
3 | A3 | B3 | C3 | D3 | B3 | C3 | D3 | E3 |
4 | A4 | B4 | C4 | D4 | B4 | C4 | D4 | E4 |
-
内连接:只连接匹配的项
-
连接指定轴 join_axes
pd.concat((df1, df2), sort=True, axis=1, join_axes=[df1.index])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
"""Entry point for launching an IPython kernel.
A | B | C | D | B | C | D | E | |
---|---|---|---|---|---|---|---|---|
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | B2 | C2 | D2 | E2 |
3 | A3 | B3 | C3 | D3 | B3 | C3 | D3 | E3 |
4 | A4 | B4 | C4 | D4 | B4 | C4 | D4 | E4 |
pd.concat((df1, df2), sort=True, axis=1, join_axes=[df2.index])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
"""Entry point for launching an IPython kernel.
A | B | C | D | B | C | D | E | |
---|---|---|---|---|---|---|---|---|
2 | A2 | B2 | C2 | D2 | B2 | C2 | D2 | E2 |
3 | A3 | B3 | C3 | D3 | B3 | C3 | D3 | E3 |
4 | A4 | B4 | C4 | D4 | B4 | C4 | D4 | E4 |
5 | NaN | NaN | NaN | NaN | B5 | C5 | D5 | E5 |
pd.concat((df1, df2), sort=True, axis=0, join_axes=[df2.columns])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
"""Entry point for launching an IPython kernel.
B | C | D | E | |
---|---|---|---|---|
1 | B1 | C1 | D1 | NaN |
2 | B2 | C2 | D2 | NaN |
3 | B3 | C3 | D3 | NaN |
4 | B4 | C4 | D4 | NaN |
2 | B2 | C2 | D2 | E2 |
3 | B3 | C3 | D3 | E3 |
4 | B4 | C4 | D4 | E4 |
5 | B5 | C5 | D5 | E5 |
============================================
练习14:
假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联df
============================================
data = np.random.randint(0,150, size=(4,3))
index = ['张三', '李四', '王老五', '赵小六']
columns = ['语文', '数学', '英语']
df = DataFrame(index=index, data=data, columns=columns)
df
语文 | 数学 | 英语 | |
---|---|---|---|
张三 | 55 | 111 | 131 |
李四 | 133 | 122 | 40 |
王老五 | 67 | 12 | 33 |
赵小六 | 12 | 140 | 101 |
data = np.random.randint(0,150, size=(3,3))
index = ['李四', '王老五', '赵小六']
columns = ['语文', '数学', '英语']
ddd = DataFrame(index=index, data=data, columns=columns)
ddd
语文 | 数学 | 英语 | |
---|---|---|---|
李四 | 52 | 23 | 124 |
王老五 | 33 | 106 | 36 |
赵小六 | 58 | 59 | 58 |
# 增加行数
pd.concat((df, ddd), keys=['期中', '期末'])
语文 | 数学 | 英语 | ||
---|---|---|---|---|
期中 | 张三 | 55 | 111 | 131 |
李四 | 133 | 122 | 40 | |
王老五 | 67 | 12 | 33 | |
赵小六 | 12 | 140 | 101 | |
期末 | 李四 | 52 | 23 | 124 |
王老五 | 33 | 106 | 36 | |
赵小六 | 58 | 59 | 58 |
pd.concat((df, ddd), keys=['期中', '期末'], axis=1, sort=True)
期中 | 期末 | |||||
---|---|---|---|---|---|---|
语文 | 数学 | 英语 | 语文 | 数学 | 英语 | |
张三 | 55 | 111 | 131 | NaN | NaN | NaN |
李四 | 133 | 122 | 40 | 52.0 | 23.0 | 124.0 |
王老五 | 67 | 12 | 33 | 33.0 | 106.0 | 36.0 |
赵小六 | 12 | 140 | 101 | 58.0 | 59.0 | 58.0 |
pd.concat((df, ddd), keys=['期中', '期末'], axis=1, sort=True, join_axes=[ddd.index])
d:\1903\.venv\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
"""Entry point for launching an IPython kernel.
期中 | 期末 | |||||
---|---|---|---|---|---|---|
语文 | 数学 | 英语 | 语文 | 数学 | 英语 | |
李四 | 133 | 122 | 40 | 52 | 23 | 124 |
王老五 | 67 | 12 | 33 | 33 | 106 | 36 |
赵小六 | 12 | 140 | 101 | 58 | 59 | 58 |
3) 使用append()函数添加
由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加
注意: concat是pd的方法, append是DataFrame对象的方法.
df.append(ddd, verify_integrity=False)
语文 | 数学 | 英语 | |
---|---|---|---|
张三 | 55 | 111 | 131 |
李四 | 133 | 122 | 40 |
王老五 | 67 | 12 | 33 |
赵小六 | 12 | 140 | 101 |
李四 | 52 | 23 | 124 |
王老五 | 33 | 106 | 36 |
赵小六 | 58 | 59 | 58 |
============================================
练习15:
新建一个只有张三李四王老五的期末考试成绩单ddd3,使用append()与期中考试成绩表ddd级联
============================================
归纳总结:
pandas的级联: pd.concat((df1, df2, df3, ...), ignore_index<忽略索引>, keys<说明索引>, join<内外连接,inner,outer>, join_axes<指定左右连接,>)
dataframe.append(other, ignore_index, verify_integrity, sort)
2. 使用pd.merge()合并
merge与concat的区别在于,merge需要依据某一共同的列来进行合并
使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
注意每一列元素的顺序不要求一致
1) 一对一合并
df1 = DataFrame({'name':['张三','李四','Chales'],'id':[1,2,3],'age':[22,21,25]})
df2 = DataFrame({'sex':['男','男','女'],'id':[2,3,4],'group':['sale','search','service']})
display(df1, df2)
name | id | age | |
---|---|---|---|
0 | 张三 | 1 | 22 |
1 | 李四 | 2 | 21 |
2 | Chales | 3 | 25 |
sex | id | group | |
---|---|---|---|
0 | 男 | 2 | sale |
1 | 男 | 3 | search |
2 | 女 | 4 | service |
pd.merge(df1, df2)
name | id | age | sex | group | |
---|---|---|---|---|---|
0 | 李四 | 2 | 21 | 男 | sale |
1 | Chales | 3 | 25 | 男 | search |
df1.merge(df2)
name | id | age | sex | group | |
---|---|---|---|---|---|
0 | 李四 | 2 | 21 | 男 | sale |
1 | Chales | 3 | 25 | 男 | search |
2) 多对一合并
df1 = DataFrame({'name':['张三','李四','Chales'],'id':[1,2,2],'age':[22,21,25]})
df2 = DataFrame({'sex':['男','男','女'],'id':[2,3,4],'group':['sale','search','service']})
display(df1,df2)
name | id | age | |
---|---|---|---|
0 | 张三 | 1 | 22 |
1 | 李四 | 2 | 21 |
2 | Chales | 2 | 25 |
sex | id | group | |
---|---|---|---|
0 | 男 | 2 | sale |
1 | 男 | 3 | search |
2 | 女 | 4 | service |
pd.merge(df1,df2)
name | id | age | sex | group | |
---|---|---|---|---|---|
0 | 李四 | 2 | 21 | 男 | sale |
1 | Chales | 2 | 25 | 男 | sale |
3) 多对多合并
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})
df2 = DataFrame({'sex':['男','男','女'],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)
name | salary | age | |
---|---|---|---|
0 | 张三 | 10000 | 22 |
1 | 李四 | 12000 | 21 |
2 | 张三 | 20000 | 25 |
sex | name | group | |
---|---|---|---|
0 | 男 | 张三 | sale |
1 | 男 | 张三 | search |
2 | 女 | 凡凡 | service |
pd.merge(df1,df2)
name | salary | age | sex | group | |
---|---|---|---|---|---|
0 | 张三 | 10000 | 22 | 男 | sale |
1 | 张三 | 10000 | 22 | 男 | search |
2 | 张三 | 20000 | 25 | 男 | sale |
3 | 张三 | 20000 | 25 | 男 | search |
4) key的规范化
- 使用on=显式指定哪一列为key,当有多个key相同时使用
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})
df2 = DataFrame({'age':[21,18,29],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)
name | salary | age | |
---|---|---|---|
0 | 张三 | 10000 | 22 |
1 | 李四 | 12000 | 21 |
2 | 张三 | 20000 | 25 |
age | name | group | |
---|---|---|---|
0 | 21 | 张三 | sale |
1 | 18 | 张三 | search |
2 | 29 | 凡凡 | service |
# 当有多个列数据相同的时候, 需要使用on来指定具体用哪一列合并
pd.merge(df1, df2, on='name')
name | salary | age_x | age_y | group | |
---|---|---|---|---|---|
0 | 张三 | 10000 | 22 | 21 | sale |
1 | 张三 | 10000 | 22 | 18 | search |
2 | 张三 | 20000 | 25 | 21 | sale |
3 | 张三 | 20000 | 25 | 18 | search |
- 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不相等时使用
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})
df2 = DataFrame({'年龄':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1,df2)
name | salary | age | |
---|---|---|---|
0 | 张三 | 10000 | 22 |
1 | 李四 | 12000 | 21 |
2 | 张三 | 20000 | 25 |
年龄 | 名字 | group | |
---|---|---|---|
0 | 21 | 张三 | sale |
1 | 18 | 张三 | search |
2 | 29 | 凡凡 | service |
pd.merge(df1, df2, left_on='name', right_on='名字')
name | salary | age | 年龄 | 名字 | group | |
---|---|---|---|---|---|---|
0 | 张三 | 10000 | 22 | 21 | 张三 | sale |
1 | 张三 | 10000 | 22 | 18 | 张三 | search |
2 | 张三 | 20000 | 25 | 21 | 张三 | sale |
3 | 张三 | 20000 | 25 | 18 | 张三 | search |
当左边的列和右边的index相同的时候,使用right_index=True
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})
df2 = DataFrame({'年龄':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']},
index = [22,21,25])
display(df1,df2)
name | salary | age | |
---|---|---|---|
0 | 张三 | 10000 | 22 |
1 | 李四 | 12000 | 21 |
2 | 张三 | 20000 | 25 |
年龄 | 名字 | group | |
---|---|---|---|
22 | 21 | 张三 | sale |
21 | 18 | 张三 | search |
25 | 29 | 凡凡 | service |
pd.merge(df1, df2, left_on='age', right_index=True)
name | salary | age | 年龄 | 名字 | group | |
---|---|---|---|---|---|---|
0 | 张三 | 10000 | 22 | 21 | 张三 | sale |
1 | 李四 | 12000 | 21 | 18 | 张三 | search |
2 | 张三 | 20000 | 25 | 29 | 凡凡 | service |
pd.merge(df2, df1, left_index=True, right_on='age')
年龄 | 名字 | group | name | salary | age | |
---|---|---|---|---|---|---|
0 | 21 | 张三 | sale | 张三 | 10000 | 22 |
1 | 18 | 张三 | search | 李四 | 12000 | 21 |
2 | 29 | 凡凡 | service | 张三 | 20000 | 25 |
============================================
练习16:
-
假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?
-
如果ddd4中张三的名字被打错了,成为了张十三,怎么办?
-
自行练习多对一,多对多的情况
-
自学left_index,right_index
============================================
5) 内合并与外合并
- 内合并:只保留两者都有的key(默认模式)
df1 = DataFrame({'name':['张三','李四','张三'],'salary':[10000,12000,20000],'age':[22,21,25]})
df2 = DataFrame({'age':[21,18,29],'名字':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1, df2)
name | salary | age | |
---|---|---|---|
0 | 张三 | 10000 | 22 |
1 | 李四 | 12000 | 21 |
2 | 张三 | 20000 | 25 |
age | 名字 | group | |
---|---|---|---|
0 | 21 | 张三 | sale |
1 | 18 | 张三 | search |
2 | 29 | 凡凡 | service |
# 默认是内合并
pd.merge(df1, df2, left_on='name', right_on='名字')
name | salary | age_x | age_y | 名字 | group | |
---|---|---|---|---|---|---|
0 | 张三 | 10000 | 22 | 21 | 张三 | sale |
1 | 张三 | 10000 | 22 | 18 | 张三 | search |
2 | 张三 | 20000 | 25 | 21 | 张三 | sale |
3 | 张三 | 20000 | 25 | 18 | 张三 | search |
- 外合并 how=‘outer’:补NaN
pd.merge(df1, df2, left_on='name', right_on='名字', how='outer')
name | salary | age_x | age_y | 名字 | group | |
---|---|---|---|---|---|---|
0 | 张三 | 10000.0 | 22.0 | 21.0 | 张三 | sale |
1 | 张三 | 10000.0 | 22.0 | 18.0 | 张三 | search |
2 | 张三 | 20000.0 | 25.0 | 21.0 | 张三 | sale |
3 | 张三 | 20000.0 | 25.0 | 18.0 | 张三 | search |
4 | 李四 | 12000.0 | 21.0 | NaN | NaN | NaN |
5 | NaN | NaN | NaN | 29.0 | 凡凡 | service |
- 左合并、右合并:how=‘left’,how=‘right’,
pd.merge(df1, df2, left_on='name', right_on='名字', how='left')
name | salary | age_x | age_y | 名字 | group | |
---|---|---|---|---|---|---|
0 | 张三 | 10000 | 22 | 21.0 | 张三 | sale |
1 | 张三 | 10000 | 22 | 18.0 | 张三 | search |
2 | 李四 | 12000 | 21 | NaN | NaN | NaN |
3 | 张三 | 20000 | 25 | 21.0 | 张三 | sale |
4 | 张三 | 20000 | 25 | 18.0 | 张三 | search |
# 右合并
pd.merge(df1, df2, left_on='name', right_on='名字', how='right')
name | salary | age_x | age_y | 名字 | group | |
---|---|---|---|---|---|---|
0 | 张三 | 10000.0 | 22.0 | 21 | 张三 | sale |
1 | 张三 | 20000.0 | 25.0 | 21 | 张三 | sale |
2 | 张三 | 10000.0 | 22.0 | 18 | 张三 | search |
3 | 张三 | 20000.0 | 25.0 | 18 | 张三 | search |
4 | NaN | NaN | NaN | 29 | 凡凡 | service |
============================================
练习17:
- 考虑应用情景,使用多种方式合并ddd与ddd4
============================================
6) 列冲突的解决
当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
#期中
df1 = DataFrame({'name':['张三','李四','张三'],'degree':[120,118,149],'age':[22,21,25]})
#期末考试
df2 = DataFrame({'degree':[99,97,129],'name':['张三','张三','凡凡'],'group':['sale','search','service']})
display(df1, df2)
name | degree | age | |
---|---|---|---|
0 | 张三 | 120 | 22 |
1 | 李四 | 118 | 21 |
2 | 张三 | 149 | 25 |
degree | name | group | |
---|---|---|---|
0 | 99 | 张三 | sale |
1 | 97 | 张三 | search |
2 | 129 | 凡凡 | service |
# suffix 后缀, prefix 前缀
pd.merge(df1, df2, on='name', suffixes=['_df1', '_df2'])
name | degree_df1 | age | degree_df2 | group | |
---|---|---|---|---|---|
0 | 张三 | 120 | 22 | 99 | sale |
1 | 张三 | 120 | 22 | 97 | search |
2 | 张三 | 149 | 25 | 99 | sale |
3 | 张三 | 149 | 25 | 97 | search |
可以使用suffixes=自己指定后缀
============================================
练习18:
假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?
============================================
pd.merge(how<指定合并方式,内外左右>,
on<指定要合并的列>,
left_on<单独指定左边表要合并的列.>,
right_on<指定右边边要合并的列>,
left_index/right_index<指定使用index来进行合并>,
suffixes<指定相同列名的后缀>)
假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?
如果ddd4中张三的名字被打错了,成为了张十三,怎么办?
# 用名字作为 列名
index = ['语文', '数学', '英语']
columns = ['张三', '李四', '王老五']
data = np.random.randint(0,150, size=(3,3))
ddd = DataFrame(data=data, index=index, columns=columns)
ddd
张三 | 李四 | 王老五 | |
---|---|---|---|
语文 | 97 | 99 | 63 |
数学 | 116 | 110 | 14 |
英语 | 6 | 108 | 34 |
index = ['语文', '数学', '英语']
columns = ['张三', '赵小六']
data = np.random.randint(0,150, size=(3,2))
ddd4 = DataFrame(data=data, index=index, columns=columns)
ddd4
张三 | 赵小六 | |
---|---|---|
语文 | 57 | 82 |
数学 | 139 | 47 |
英语 | 120 | 77 |
ddd4['张三'] = ddd['张三']
ddd4
张三 | 赵小六 | |
---|---|---|
语文 | 97 | 82 |
数学 | 116 | 47 |
英语 | 6 | 77 |
# 使用右边表的索引.
result = pd.merge(ddd, ddd4, on='张三', left_index=True)
result
张三 | 李四 | 王老五 | 赵小六 | |
---|---|---|---|---|
语文 | 97 | 99 | 63 | 82 |
数学 | 116 | 110 | 14 | 47 |
英语 | 6 | 108 | 34 | 77 |
ddd4.rename({'张三': '张十三'}, axis=1, inplace=True)
ddd4
张十三 | 赵小六 | |
---|---|---|
语文 | 97 | 82 |
数学 | 116 | 47 |
英语 | 6 | 77 |
result = pd.merge(ddd, ddd4, left_on='张三', right_on='张十三', left_index=True)
result.drop(columns='张十三', inplace=True)
result
张三 | 李四 | 王老五 | 赵小六 | |
---|---|---|---|---|
语文 | 97 | 99 | 63 | 82 |
数学 | 116 | 110 | 14 | 47 |
英语 | 6 | 108 | 34 | 77 |
作业
3. 案例分析:美国各州人口数据分析
作业知识补充
# unique() 去重函数
s = Series(['Tom','Lucy','Tom','dancer','Lucy'])
s.unique()
array(['Tom', 'Lucy', 'dancer'], dtype=object)
n = DataFrame({'name':['Tom','Lucy','Tom','dancer','Lucy'],'age':[12,13,12,11,15]})
n
name | age | |
---|---|---|
0 | Tom | 12 |
1 | Lucy | 13 |
2 | Tom | 12 |
3 | dancer | 11 |
4 | Lucy | 15 |
# 查询 名字是lucy,并且年龄大于13岁的行
n.query('name=="Lucy" & age>13')
name | age | |
---|---|---|
4 | Lucy | 15 |
首先导入文件,并查看数据样本
abb = pd.read_csv('../data/state-abbrevs.csv')
abb.head()
state | abbreviation | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
2 | Arizona | AZ |
3 | Arkansas | AR |
4 | California | CA |
pop = pd.read_csv('../data/state-population.csv')
pop.head()
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
areas = pd.read_csv('../data/state-areas.csv')
areas.head()
state | area (sq. mi) | |
---|---|---|
0 | Alabama | 52423 |
1 | Alaska | 656425 |
2 | Arizona | 114006 |
3 | Arkansas | 53182 |
4 | California | 163707 |
合并pop与abbrevs两个DataFrame,分别依据state/region列和abbreviation列来合并。
为了保留所有信息,使用外合并。
abb_pop = pd.merge(abb, pop, left_on='abbreviation', right_on='state/region', how='outer')
去除abbreviation的那一列(axis=1)
abb_pop.drop(columns='abbreviation', inplace=True)
查看存在缺失数据的列。
使用.isnull().any(),只有某一列存在一个缺失数据,就会显示True。
abb_pop.isnull().any()
state True
state/region False
ages False
year False
population True
dtype: bool
查看缺失数据
abb_pop
state | state/region | ages | year | population | |
---|---|---|---|---|---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 |
1 | Alabama | AL | total | 2012 | 4817528.0 |
2 | Alabama | AL | under18 | 2010 | 1130966.0 |
3 | Alabama | AL | total | 2010 | 4785570.0 |
4 | Alabama | AL | under18 | 2011 | 1125763.0 |
5 | Alabama | AL | total | 2011 | 4801627.0 |
6 | Alabama | AL | total | 2009 | 4757938.0 |
7 | Alabama | AL | under18 | 2009 | 1134192.0 |
8 | Alabama | AL | under18 | 2013 | 1111481.0 |
9 | Alabama | AL | total | 2013 | 4833722.0 |
10 | Alabama | AL | total | 2007 | 4672840.0 |
11 | Alabama | AL | under18 | 2007 | 1132296.0 |
12 | Alabama | AL | total | 2008 | 4718206.0 |
13 | Alabama | AL | under18 | 2008 | 1134927.0 |
14 | Alabama | AL | total | 2005 | 4569805.0 |
15 | Alabama | AL | under18 | 2005 | 1117229.0 |
16 | Alabama | AL | total | 2006 | 4628981.0 |
17 | Alabama | AL | under18 | 2006 | 1126798.0 |
18 | Alabama | AL | total | 2004 | 4530729.0 |
19 | Alabama | AL | under18 | 2004 | 1113662.0 |
20 | Alabama | AL | total | 2003 | 4503491.0 |
21 | Alabama | AL | under18 | 2003 | 1113083.0 |
22 | Alabama | AL | total | 2001 | 4467634.0 |
23 | Alabama | AL | under18 | 2001 | 1120409.0 |
24 | Alabama | AL | total | 2002 | 4480089.0 |
25 | Alabama | AL | under18 | 2002 | 1116590.0 |
26 | Alabama | AL | under18 | 1999 | 1121287.0 |
27 | Alabama | AL | total | 1999 | 4430141.0 |
28 | Alabama | AL | total | 2000 | 4452173.0 |
29 | Alabama | AL | under18 | 2000 | 1122273.0 |
... | ... | ... | ... | ... | ... |
2514 | NaN | USA | under18 | 1999 | 71946051.0 |
2515 | NaN | USA | total | 2000 | 282162411.0 |
2516 | NaN | USA | under18 | 2000 | 72376189.0 |
2517 | NaN | USA | total | 1999 | 279040181.0 |
2518 | NaN | USA | total | 2001 | 284968955.0 |
2519 | NaN | USA | under18 | 2001 | 72671175.0 |
2520 | NaN | USA | total | 2002 | 287625193.0 |
2521 | NaN | USA | under18 | 2002 | 72936457.0 |
2522 | NaN | USA | total | 2003 | 290107933.0 |
2523 | NaN | USA | under18 | 2003 | 73100758.0 |
2524 | NaN | USA | total | 2004 | 292805298.0 |
2525 | NaN | USA | under18 | 2004 | 73297735.0 |
2526 | NaN | USA | total | 2005 | 295516599.0 |
2527 | NaN | USA | under18 | 2005 | 73523669.0 |
2528 | NaN | USA | total | 2006 | 298379912.0 |
2529 | NaN | USA | under18 | 2006 | 73757714.0 |
2530 | NaN | USA | total | 2007 | 301231207.0 |
2531 | NaN | USA | under18 | 2007 | 74019405.0 |
2532 | NaN | USA | total | 2008 | 304093966.0 |
2533 | NaN | USA | under18 | 2008 | 74104602.0 |
2534 | NaN | USA | under18 | 2013 | 73585872.0 |
2535 | NaN | USA | total | 2013 | 316128839.0 |
2536 | NaN | USA | total | 2009 | 306771529.0 |
2537 | NaN | USA | under18 | 2009 | 74134167.0 |
2538 | NaN | USA | under18 | 2010 | 74119556.0 |
2539 | NaN | USA | total | 2010 | 309326295.0 |
2540 | NaN | USA | under18 | 2011 | 73902222.0 |
2541 | NaN | USA | total | 2011 | 311582564.0 |
2542 | NaN | USA | under18 | 2012 | 73708179.0 |
2543 | NaN | USA | total | 2012 | 313873685.0 |
2544 rows × 5 columns
abb_pop.isnull().any(axis=1)
0 False
1 False
2 False
3 False
4 False
...
2539 True
2540 True
2541 True
2542 True
2543 True
Length: 2544, dtype: bool
abb_pop[abb_pop.isnull().any(axis=1)]
state | state/region | ages | year | population | |
---|---|---|---|---|---|
2448 | NaN | PR | under18 | 1990 | NaN |
2449 | NaN | PR | total | 1990 | NaN |
2450 | NaN | PR | total | 1991 | NaN |
2451 | NaN | PR | under18 | 1991 | NaN |
2452 | NaN | PR | total | 1993 | NaN |
2453 | NaN | PR | under18 | 1993 | NaN |
2454 | NaN | PR | under18 | 1992 | NaN |
2455 | NaN | PR | total | 1992 | NaN |
2456 | NaN | PR | under18 | 1994 | NaN |
2457 | NaN | PR | total | 1994 | NaN |
2458 | NaN | PR | total | 1995 | NaN |
2459 | NaN | PR | under18 | 1995 | NaN |
2460 | NaN | PR | under18 | 1996 | NaN |
2461 | NaN | PR | total | 1996 | NaN |
2462 | NaN | PR | under18 | 1998 | NaN |
2463 | NaN | PR | total | 1998 | NaN |
2464 | NaN | PR | total | 1997 | NaN |
2465 | NaN | PR | under18 | 1997 | NaN |
2466 | NaN | PR | total | 1999 | NaN |
2467 | NaN | PR | under18 | 1999 | NaN |
2468 | NaN | PR | total | 2000 | 3810605.0 |
2469 | NaN | PR | under18 | 2000 | 1089063.0 |
2470 | NaN | PR | total | 2001 | 3818774.0 |
2471 | NaN | PR | under18 | 2001 | 1077566.0 |
2472 | NaN | PR | total | 2002 | 3823701.0 |
2473 | NaN | PR | under18 | 2002 | 1065051.0 |
2474 | NaN | PR | total | 2004 | 3826878.0 |
2475 | NaN | PR | under18 | 2004 | 1035919.0 |
2476 | NaN | PR | total | 2003 | 3826095.0 |
2477 | NaN | PR | under18 | 2003 | 1050615.0 |
... | ... | ... | ... | ... | ... |
2514 | NaN | USA | under18 | 1999 | 71946051.0 |
2515 | NaN | USA | total | 2000 | 282162411.0 |
2516 | NaN | USA | under18 | 2000 | 72376189.0 |
2517 | NaN | USA | total | 1999 | 279040181.0 |
2518 | NaN | USA | total | 2001 | 284968955.0 |
2519 | NaN | USA | under18 | 2001 | 72671175.0 |
2520 | NaN | USA | total | 2002 | 287625193.0 |
2521 | NaN | USA | under18 | 2002 | 72936457.0 |
2522 | NaN | USA | total | 2003 | 290107933.0 |
2523 | NaN | USA | under18 | 2003 | 73100758.0 |
2524 | NaN | USA | total | 2004 | 292805298.0 |
2525 | NaN | USA | under18 | 2004 | 73297735.0 |
2526 | NaN | USA | total | 2005 | 295516599.0 |
2527 | NaN | USA | under18 | 2005 | 73523669.0 |
2528 | NaN | USA | total | 2006 | 298379912.0 |
2529 | NaN | USA | under18 | 2006 | 73757714.0 |
2530 | NaN | USA | total | 2007 | 301231207.0 |
2531 | NaN | USA | under18 | 2007 | 74019405.0 |
2532 | NaN | USA | total | 2008 | 304093966.0 |
2533 | NaN | USA | under18 | 2008 | 74104602.0 |
2534 | NaN | USA | under18 | 2013 | 73585872.0 |
2535 | NaN | USA | total | 2013 | 316128839.0 |
2536 | NaN | USA | total | 2009 | 306771529.0 |
2537 | NaN | USA | under18 | 2009 | 74134167.0 |
2538 | NaN | USA | under18 | 2010 | 74119556.0 |
2539 | NaN | USA | total | 2010 | 309326295.0 |
2540 | NaN | USA | under18 | 2011 | 73902222.0 |
2541 | NaN | USA | total | 2011 | 311582564.0 |
2542 | NaN | USA | under18 | 2012 | 73708179.0 |
2543 | NaN | USA | total | 2012 | 313873685.0 |
96 rows × 5 columns
根据数据是否缺失情况显示数据,如果缺失为True,那么显示
找到有哪些state/region使得state的值为NaN,使用unique()查看非重复值
abb_pop[abb_pop.isnull().any(axis=1)]['state/region'].unique()
array(['PR', 'USA'], dtype=object)
为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN!
记住这样清除缺失数据NaN的方法!
cond = abb_pop[abb_pop.isnull().any(axis=1)]['state/region'] == 'PR'
abb_pop[abb_pop.isnull().any(axis=1)]['state'][cond]= 'Puerto Rico'
d:\1903\.venv\lib\site-packages\pandas\core\generic.py:9116: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._update_inplace(new_data)
d:\1903\.venv\lib\site-packages\IPython\core\interactiveshell.py:3326: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
exec(code_obj, self.user_global_ns, self.user_ns)
abb_pop[abb_pop.isnull().any(axis=1)][cond]
state | state/region | ages | year | population | |
---|---|---|---|---|---|
2448 | NaN | PR | under18 | 1990 | NaN |
2449 | NaN | PR | total | 1990 | NaN |
2450 | NaN | PR | total | 1991 | NaN |
2451 | NaN | PR | under18 | 1991 | NaN |
2452 | NaN | PR | total | 1993 | NaN |
2453 | NaN | PR | under18 | 1993 | NaN |
2454 | NaN | PR | under18 | 1992 | NaN |
2455 | NaN | PR | total | 1992 | NaN |
2456 | NaN | PR | under18 | 1994 | NaN |
2457 | NaN | PR | total | 1994 | NaN |
2458 | NaN | PR | total | 1995 | NaN |
2459 | NaN | PR | under18 | 1995 | NaN |
2460 | NaN | PR | under18 | 1996 | NaN |
2461 | NaN | PR | total | 1996 | NaN |
2462 | NaN | PR | under18 | 1998 | NaN |
2463 | NaN | PR | total | 1998 | NaN |
2464 | NaN | PR | total | 1997 | NaN |
2465 | NaN | PR | under18 | 1997 | NaN |
2466 | NaN | PR | total | 1999 | NaN |
2467 | NaN | PR | under18 | 1999 | NaN |
2468 | NaN | PR | total | 2000 | 3810605.0 |
2469 | NaN | PR | under18 | 2000 | 1089063.0 |
2470 | NaN | PR | total | 2001 | 3818774.0 |
2471 | NaN | PR | under18 | 2001 | 1077566.0 |
2472 | NaN | PR | total | 2002 | 3823701.0 |
2473 | NaN | PR | under18 | 2002 | 1065051.0 |
2474 | NaN | PR | total | 2004 | 3826878.0 |
2475 | NaN | PR | under18 | 2004 | 1035919.0 |
2476 | NaN | PR | total | 2003 | 3826095.0 |
2477 | NaN | PR | under18 | 2003 | 1050615.0 |
2478 | NaN | PR | total | 2005 | 3821362.0 |
2479 | NaN | PR | under18 | 2005 | 1019447.0 |
2480 | NaN | PR | total | 2006 | 3805214.0 |
2481 | NaN | PR | under18 | 2006 | 998543.0 |
2482 | NaN | PR | total | 2007 | 3782995.0 |
2483 | NaN | PR | under18 | 2007 | 973613.0 |
2484 | NaN | PR | total | 2008 | 3760866.0 |
2485 | NaN | PR | under18 | 2008 | 945705.0 |
2486 | NaN | PR | under18 | 2013 | 814068.0 |
2487 | NaN | PR | total | 2013 | 3615086.0 |
2488 | NaN | PR | total | 2009 | 3740410.0 |
2489 | NaN | PR | under18 | 2009 | 920794.0 |
2490 | NaN | PR | total | 2010 | 3721208.0 |
2491 | NaN | PR | under18 | 2010 | 896945.0 |
2492 | NaN | PR | under18 | 2011 | 869327.0 |
2493 | NaN | PR | total | 2011 | 3686580.0 |
2494 | NaN | PR | under18 | 2012 | 841740.0 |
2495 | NaN | PR | total | 2012 | 3651545.0 |
abb_pop[abb_pop.isnull().any(axis=1)].loc[cond, 'state'] = 'Puerto Rico'
d:\1903\.venv\lib\site-packages\pandas\core\indexing.py:480: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[item] = s
abb_pop[abb_pop.isnull().any(axis=1)][cond]
state | state/region | ages | year | population | |
---|---|---|---|---|---|
2448 | NaN | PR | under18 | 1990 | NaN |
2449 | NaN | PR | total | 1990 | NaN |
2450 | NaN | PR | total | 1991 | NaN |
2451 | NaN | PR | under18 | 1991 | NaN |
2452 | NaN | PR | total | 1993 | NaN |
2453 | NaN | PR | under18 | 1993 | NaN |
2454 | NaN | PR | under18 | 1992 | NaN |
2455 | NaN | PR | total | 1992 | NaN |
2456 | NaN | PR | under18 | 1994 | NaN |
2457 | NaN | PR | total | 1994 | NaN |
2458 | NaN | PR | total | 1995 | NaN |
2459 | NaN | PR | under18 | 1995 | NaN |
2460 | NaN | PR | under18 | 1996 | NaN |
2461 | NaN | PR | total | 1996 | NaN |
2462 | NaN | PR | under18 | 1998 | NaN |
2463 | NaN | PR | total | 1998 | NaN |
2464 | NaN | PR | total | 1997 | NaN |
2465 | NaN | PR | under18 | 1997 | NaN |
2466 | NaN | PR | total | 1999 | NaN |
2467 | NaN | PR | under18 | 1999 | NaN |
2468 | NaN | PR | total | 2000 | 3810605.0 |
2469 | NaN | PR | under18 | 2000 | 1089063.0 |
2470 | NaN | PR | total | 2001 | 3818774.0 |
2471 | NaN | PR | under18 | 2001 | 1077566.0 |
2472 | NaN | PR | total | 2002 | 3823701.0 |
2473 | NaN | PR | under18 | 2002 | 1065051.0 |
2474 | NaN | PR | total | 2004 | 3826878.0 |
2475 | NaN | PR | under18 | 2004 | 1035919.0 |
2476 | NaN | PR | total | 2003 | 3826095.0 |
2477 | NaN | PR | under18 | 2003 | 1050615.0 |
2478 | NaN | PR | total | 2005 | 3821362.0 |
2479 | NaN | PR | under18 | 2005 | 1019447.0 |
2480 | NaN | PR | total | 2006 | 3805214.0 |
2481 | NaN | PR | under18 | 2006 | 998543.0 |
2482 | NaN | PR | total | 2007 | 3782995.0 |
2483 | NaN | PR | under18 | 2007 | 973613.0 |
2484 | NaN | PR | total | 2008 | 3760866.0 |
2485 | NaN | PR | under18 | 2008 | 945705.0 |
2486 | NaN | PR | under18 | 2013 | 814068.0 |
2487 | NaN | PR | total | 2013 | 3615086.0 |
2488 | NaN | PR | total | 2009 | 3740410.0 |
2489 | NaN | PR | under18 | 2009 | 920794.0 |
2490 | NaN | PR | total | 2010 | 3721208.0 |
2491 | NaN | PR | under18 | 2010 | 896945.0 |
2492 | NaN | PR | under18 | 2011 | 869327.0 |
2493 | NaN | PR | total | 2011 | 3686580.0 |
2494 | NaN | PR | under18 | 2012 | 841740.0 |
2495 | NaN | PR | total | 2012 | 3651545.0 |
temp = abb_pop[abb_pop.isnull().any(axis=1)].copy()
temp.loc[cond, 'state'] = 'Puerto Rico'
temp
state | state/region | ages | year | population | |
---|---|---|---|---|---|
2448 | Puerto Rico | PR | under18 | 1990 | NaN |
2449 | Puerto Rico | PR | total | 1990 | NaN |
2450 | Puerto Rico | PR | total | 1991 | NaN |
2451 | Puerto Rico | PR | under18 | 1991 | NaN |
2452 | Puerto Rico | PR | total | 1993 | NaN |
2453 | Puerto Rico | PR | under18 | 1993 | NaN |
2454 | Puerto Rico | PR | under18 | 1992 | NaN |
2455 | Puerto Rico | PR | total | 1992 | NaN |
2456 | Puerto Rico | PR | under18 | 1994 | NaN |
2457 | Puerto Rico | PR | total | 1994 | NaN |
2458 | Puerto Rico | PR | total | 1995 | NaN |
2459 | Puerto Rico | PR | under18 | 1995 | NaN |
2460 | Puerto Rico | PR | under18 | 1996 | NaN |
2461 | Puerto Rico | PR | total | 1996 | NaN |
2462 | Puerto Rico | PR | under18 | 1998 | NaN |
2463 | Puerto Rico | PR | total | 1998 | NaN |
2464 | Puerto Rico | PR | total | 1997 | NaN |
2465 | Puerto Rico | PR | under18 | 1997 | NaN |
2466 | Puerto Rico | PR | total | 1999 | NaN |
2467 | Puerto Rico | PR | under18 | 1999 | NaN |
2468 | Puerto Rico | PR | total | 2000 | 3810605.0 |
2469 | Puerto Rico | PR | under18 | 2000 | 1089063.0 |
2470 | Puerto Rico | PR | total | 2001 | 3818774.0 |
2471 | Puerto Rico | PR | under18 | 2001 | 1077566.0 |
2472 | Puerto Rico | PR | total | 2002 | 3823701.0 |
2473 | Puerto Rico | PR | under18 | 2002 | 1065051.0 |
2474 | Puerto Rico | PR | total | 2004 | 3826878.0 |
2475 | Puerto Rico | PR | under18 | 2004 | 1035919.0 |
2476 | Puerto Rico | PR | total | 2003 | 3826095.0 |
2477 | Puerto Rico | PR | under18 | 2003 | 1050615.0 |
... | ... | ... | ... | ... | ... |
2514 | NaN | USA | under18 | 1999 | 71946051.0 |
2515 | NaN | USA | total | 2000 | 282162411.0 |
2516 | NaN | USA | under18 | 2000 | 72376189.0 |
2517 | NaN | USA | total | 1999 | 279040181.0 |
2518 | NaN | USA | total | 2001 | 284968955.0 |
2519 | NaN | USA | under18 | 2001 | 72671175.0 |
2520 | NaN | USA | total | 2002 | 287625193.0 |
2521 | NaN | USA | under18 | 2002 | 72936457.0 |
2522 | NaN | USA | total | 2003 | 290107933.0 |
2523 | NaN | USA | under18 | 2003 | 73100758.0 |
2524 | NaN | USA | total | 2004 | 292805298.0 |
2525 | NaN | USA | under18 | 2004 | 73297735.0 |
2526 | NaN | USA | total | 2005 | 295516599.0 |
2527 | NaN | USA | under18 | 2005 | 73523669.0 |
2528 | NaN | USA | total | 2006 | 298379912.0 |
2529 | NaN | USA | under18 | 2006 | 73757714.0 |
2530 | NaN | USA | total | 2007 | 301231207.0 |
2531 | NaN | USA | under18 | 2007 | 74019405.0 |
2532 | NaN | USA | total | 2008 | 304093966.0 |
2533 | NaN | USA | under18 | 2008 | 74104602.0 |
2534 | NaN | USA | under18 | 2013 | 73585872.0 |
2535 | NaN | USA | total | 2013 | 316128839.0 |
2536 | NaN | USA | total | 2009 | 306771529.0 |
2537 | NaN | USA | under18 | 2009 | 74134167.0 |
2538 | NaN | USA | under18 | 2010 | 74119556.0 |
2539 | NaN | USA | total | 2010 | 309326295.0 |
2540 | NaN | USA | under18 | 2011 | 73902222.0 |
2541 | NaN | USA | total | 2011 | 311582564.0 |
2542 | NaN | USA | under18 | 2012 | 73708179.0 |
2543 | NaN | USA | total | 2012 | 313873685.0 |
96 rows × 5 columns
cond = abb_pop[abb_pop.isnull().any(axis=1)]['state/region'] == 'USA'
temp.loc[cond, 'state'] = 'United State of American'
temp
state | state/region | ages | year | population | |
---|---|---|---|---|---|
2448 | Puerto Rico | PR | under18 | 1990 | NaN |
2449 | Puerto Rico | PR | total | 1990 | NaN |
2450 | Puerto Rico | PR | total | 1991 | NaN |
2451 | Puerto Rico | PR | under18 | 1991 | NaN |
2452 | Puerto Rico | PR | total | 1993 | NaN |
2453 | Puerto Rico | PR | under18 | 1993 | NaN |
2454 | Puerto Rico | PR | under18 | 1992 | NaN |
2455 | Puerto Rico | PR | total | 1992 | NaN |
2456 | Puerto Rico | PR | under18 | 1994 | NaN |
2457 | Puerto Rico | PR | total | 1994 | NaN |
2458 | Puerto Rico | PR | total | 1995 | NaN |
2459 | Puerto Rico | PR | under18 | 1995 | NaN |
2460 | Puerto Rico | PR | under18 | 1996 | NaN |
2461 | Puerto Rico | PR | total | 1996 | NaN |
2462 | Puerto Rico | PR | under18 | 1998 | NaN |
2463 | Puerto Rico | PR | total | 1998 | NaN |
2464 | Puerto Rico | PR | total | 1997 | NaN |
2465 | Puerto Rico | PR | under18 | 1997 | NaN |
2466 | Puerto Rico | PR | total | 1999 | NaN |
2467 | Puerto Rico | PR | under18 | 1999 | NaN |
2468 | Puerto Rico | PR | total | 2000 | 3810605.0 |
2469 | Puerto Rico | PR | under18 | 2000 | 1089063.0 |
2470 | Puerto Rico | PR | total | 2001 | 3818774.0 |
2471 | Puerto Rico | PR | under18 | 2001 | 1077566.0 |
2472 | Puerto Rico | PR | total | 2002 | 3823701.0 |
2473 | Puerto Rico | PR | under18 | 2002 | 1065051.0 |
2474 | Puerto Rico | PR | total | 2004 | 3826878.0 |
2475 | Puerto Rico | PR | under18 | 2004 | 1035919.0 |
2476 | Puerto Rico | PR | total | 2003 | 3826095.0 |
2477 | Puerto Rico | PR | under18 | 2003 | 1050615.0 |
... | ... | ... | ... | ... | ... |
2514 | United State of American | USA | under18 | 1999 | 71946051.0 |
2515 | United State of American | USA | total | 2000 | 282162411.0 |
2516 | United State of American | USA | under18 | 2000 | 72376189.0 |
2517 | United State of American | USA | total | 1999 | 279040181.0 |
2518 | United State of American | USA | total | 2001 | 284968955.0 |
2519 | United State of American | USA | under18 | 2001 | 72671175.0 |
2520 | United State of American | USA | total | 2002 | 287625193.0 |
2521 | United State of American | USA | under18 | 2002 | 72936457.0 |
2522 | United State of American | USA | total | 2003 | 290107933.0 |
2523 | United State of American | USA | under18 | 2003 | 73100758.0 |
2524 | United State of American | USA | total | 2004 | 292805298.0 |
2525 | United State of American | USA | under18 | 2004 | 73297735.0 |
2526 | United State of American | USA | total | 2005 | 295516599.0 |
2527 | United State of American | USA | under18 | 2005 | 73523669.0 |
2528 | United State of American | USA | total | 2006 | 298379912.0 |
2529 | United State of American | USA | under18 | 2006 | 73757714.0 |
2530 | United State of American | USA | total | 2007 | 301231207.0 |
2531 | United State of American | USA | under18 | 2007 | 74019405.0 |
2532 | United State of American | USA | total | 2008 | 304093966.0 |
2533 | United State of American | USA | under18 | 2008 | 74104602.0 |
2534 | United State of American | USA | under18 | 2013 | 73585872.0 |
2535 | United State of American | USA | total | 2013 | 316128839.0 |
2536 | United State of American | USA | total | 2009 | 306771529.0 |
2537 | United State of American | USA | under18 | 2009 | 74134167.0 |
2538 | United State of American | USA | under18 | 2010 | 74119556.0 |
2539 | United State of American | USA | total | 2010 | 309326295.0 |
2540 | United State of American | USA | under18 | 2011 | 73902222.0 |
2541 | United State of American | USA | total | 2011 | 311582564.0 |
2542 | United State of American | USA | under18 | 2012 | 73708179.0 |
2543 | United State of American | USA | total | 2012 | 313873685.0 |
96 rows × 5 columns
abb_pop[abb_pop.isnull().any(axis=1)] = temp
abb_pop[abb_pop.isnull().any(axis=1)]
state | state/region | ages | year | population | |
---|---|---|---|---|---|
2448 | Puerto Rico | PR | under18 | 1990 | NaN |
2449 | Puerto Rico | PR | total | 1990 | NaN |
2450 | Puerto Rico | PR | total | 1991 | NaN |
2451 | Puerto Rico | PR | under18 | 1991 | NaN |
2452 | Puerto Rico | PR | total | 1993 | NaN |
2453 | Puerto Rico | PR | under18 | 1993 | NaN |
2454 | Puerto Rico | PR | under18 | 1992 | NaN |
2455 | Puerto Rico | PR | total | 1992 | NaN |
2456 | Puerto Rico | PR | under18 | 1994 | NaN |
2457 | Puerto Rico | PR | total | 1994 | NaN |
2458 | Puerto Rico | PR | total | 1995 | NaN |
2459 | Puerto Rico | PR | under18 | 1995 | NaN |
2460 | Puerto Rico | PR | under18 | 1996 | NaN |
2461 | Puerto Rico | PR | total | 1996 | NaN |
2462 | Puerto Rico | PR | under18 | 1998 | NaN |
2463 | Puerto Rico | PR | total | 1998 | NaN |
2464 | Puerto Rico | PR | total | 1997 | NaN |
2465 | Puerto Rico | PR | under18 | 1997 | NaN |
2466 | Puerto Rico | PR | total | 1999 | NaN |
2467 | Puerto Rico | PR | under18 | 1999 | NaN |
合并各州面积数据areas,使用左合并。
思考一下为什么使用外合并?
abb_pop.head()
state | state/region | ages | year | population | |
---|---|---|---|---|---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 |
1 | Alabama | AL | total | 2012 | 4817528.0 |
2 | Alabama | AL | under18 | 2010 | 1130966.0 |
3 | Alabama | AL | total | 2010 | 4785570.0 |
4 | Alabama | AL | under18 | 2011 | 1125763.0 |
areas.head()
state | area (sq. mi) | |
---|---|---|
0 | Alabama | 52423 |
1 | Alaska | 656425 |
2 | Arizona | 114006 |
3 | Arkansas | 53182 |
4 | California | 163707 |
abb_pop_areas = abb_pop.merge(areas, how='outer')
abb_pop_areas.head()
state | state/region | ages | year | population | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 | 52423.0 |
1 | Alabama | AL | total | 2012 | 4817528.0 | 52423.0 |
2 | Alabama | AL | under18 | 2010 | 1130966.0 | 52423.0 |
3 | Alabama | AL | total | 2010 | 4785570.0 | 52423.0 |
4 | Alabama | AL | under18 | 2011 | 1125763.0 | 52423.0 |
继续寻找存在缺失数据的列
abb_pop_areas.isnull().any()
state False
state/region False
ages False
year False
population True
area (sq. mi) True
dtype: bool
我们会发现area(sq.mi)这一列有缺失数据,为了找出是哪一行,我们需要找出是哪个state没有数据
abb_pop_areas[abb_pop_areas.isnull().any(axis=1)]
state | state/region | ages | year | population | area (sq. mi) | |
---|---|---|---|---|---|---|
2448 | Puerto Rico | PR | under18 | 1990 | NaN | 3515.0 |
2449 | Puerto Rico | PR | total | 1990 | NaN | 3515.0 |
2450 | Puerto Rico | PR | total | 1991 | NaN | 3515.0 |
2451 | Puerto Rico | PR | under18 | 1991 | NaN | 3515.0 |
2452 | Puerto Rico | PR | total | 1993 | NaN | 3515.0 |
2453 | Puerto Rico | PR | under18 | 1993 | NaN | 3515.0 |
2454 | Puerto Rico | PR | under18 | 1992 | NaN | 3515.0 |
2455 | Puerto Rico | PR | total | 1992 | NaN | 3515.0 |
2456 | Puerto Rico | PR | under18 | 1994 | NaN | 3515.0 |
2457 | Puerto Rico | PR | total | 1994 | NaN | 3515.0 |
2458 | Puerto Rico | PR | total | 1995 | NaN | 3515.0 |
2459 | Puerto Rico | PR | under18 | 1995 | NaN | 3515.0 |
2460 | Puerto Rico | PR | under18 | 1996 | NaN | 3515.0 |
2461 | Puerto Rico | PR | total | 1996 | NaN | 3515.0 |
2462 | Puerto Rico | PR | under18 | 1998 | NaN | 3515.0 |
2463 | Puerto Rico | PR | total | 1998 | NaN | 3515.0 |
2464 | Puerto Rico | PR | total | 1997 | NaN | 3515.0 |
2465 | Puerto Rico | PR | under18 | 1997 | NaN | 3515.0 |
2466 | Puerto Rico | PR | total | 1999 | NaN | 3515.0 |
2467 | Puerto Rico | PR | under18 | 1999 | NaN | 3515.0 |
2496 | United State of American | USA | under18 | 1990 | 64218512.0 | NaN |
2497 | United State of American | USA | total | 1990 | 249622814.0 | NaN |
2498 | United State of American | USA | total | 1991 | 252980942.0 | NaN |
2499 | United State of American | USA | under18 | 1991 | 65313018.0 | NaN |
2500 | United State of American | USA | under18 | 1992 | 66509177.0 | NaN |
2501 | United State of American | USA | total | 1992 | 256514231.0 | NaN |
2502 | United State of American | USA | total | 1993 | 259918595.0 | NaN |
2503 | United State of American | USA | under18 | 1993 | 67594938.0 | NaN |
2504 | United State of American | USA | under18 | 1994 | 68640936.0 | NaN |
2505 | United State of American | USA | total | 1994 | 263125826.0 | NaN |
... | ... | ... | ... | ... | ... | ... |
2514 | United State of American | USA | under18 | 1999 | 71946051.0 | NaN |
2515 | United State of American | USA | total | 2000 | 282162411.0 | NaN |
2516 | United State of American | USA | under18 | 2000 | 72376189.0 | NaN |
2517 | United State of American | USA | total | 1999 | 279040181.0 | NaN |
2518 | United State of American | USA | total | 2001 | 284968955.0 | NaN |
2519 | United State of American | USA | under18 | 2001 | 72671175.0 | NaN |
2520 | United State of American | USA | total | 2002 | 287625193.0 | NaN |
2521 | United State of American | USA | under18 | 2002 | 72936457.0 | NaN |
2522 | United State of American | USA | total | 2003 | 290107933.0 | NaN |
2523 | United State of American | USA | under18 | 2003 | 73100758.0 | NaN |
2524 | United State of American | USA | total | 2004 | 292805298.0 | NaN |
2525 | United State of American | USA | under18 | 2004 | 73297735.0 | NaN |
2526 | United State of American | USA | total | 2005 | 295516599.0 | NaN |
2527 | United State of American | USA | under18 | 2005 | 73523669.0 | NaN |
2528 | United State of American | USA | total | 2006 | 298379912.0 | NaN |
2529 | United State of American | USA | under18 | 2006 | 73757714.0 | NaN |
2530 | United State of American | USA | total | 2007 | 301231207.0 | NaN |
2531 | United State of American | USA | under18 | 2007 | 74019405.0 | NaN |
2532 | United State of American | USA | total | 2008 | 304093966.0 | NaN |
2533 | United State of American | USA | under18 | 2008 | 74104602.0 | NaN |
2534 | United State of American | USA | under18 | 2013 | 73585872.0 | NaN |
2535 | United State of American | USA | total | 2013 | 316128839.0 | NaN |
2536 | United State of American | USA | total | 2009 | 306771529.0 | NaN |
2537 | United State of American | USA | under18 | 2009 | 74134167.0 | NaN |
2538 | United State of American | USA | under18 | 2010 | 74119556.0 | NaN |
2539 | United State of American | USA | total | 2010 | 309326295.0 | NaN |
2540 | United State of American | USA | under18 | 2011 | 73902222.0 | NaN |
2541 | United State of American | USA | total | 2011 | 311582564.0 | NaN |
2542 | United State of American | USA | under18 | 2012 | 73708179.0 | NaN |
2543 | United State of American | USA | total | 2012 | 313873685.0 | NaN |
68 rows × 6 columns
去除含有缺失数据的行
abb_pop_areas.dropna(axis=0, how='any', inplace=True)
查看数据是否缺失
abb_pop_areas.isnull().any()
state False
state/region False
ages False
year False
population False
area (sq. mi) False
dtype: bool
abb_pop_areas.head()
state | state/region | ages | year | population | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 | 52423.0 |
1 | Alabama | AL | total | 2012 | 4817528.0 | 52423.0 |
2 | Alabama | AL | under18 | 2010 | 1130966.0 | 52423.0 |
3 | Alabama | AL | total | 2010 | 4785570.0 | 52423.0 |
4 | Alabama | AL | under18 | 2011 | 1125763.0 | 52423.0 |
abb_pop_areas.dtypes
state object
state/region object
ages object
year int64
population float64
area (sq. mi) float64
dtype: object
找出2010年的全民人口数据,df.query(查询语句)
result = abb_pop_areas.query('ages=="total" & year==2010')
对查询结果进行处理,以state列作为新的行索引:set_index
result.set_index(keys=['state'], inplace=True)
result.head()
state/region | ages | year | population | area (sq. mi) | |
---|---|---|---|---|---|
state | |||||
Alabama | AL | total | 2010 | 4785570.0 | 52423.0 |
Alaska | AK | total | 2010 | 713868.0 | 656425.0 |
Arizona | AZ | total | 2010 | 6408790.0 | 114006.0 |
Arkansas | AR | total | 2010 | 2922280.0 | 53182.0 |
California | CA | total | 2010 | 37333601.0 | 163707.0 |
计算人口密度。注意是Series/Series,其结果还是一个Series。
density = result['population'] / result['area (sq. mi)']
density.head()
state
Alabama 91.287603
Alaska 1.087509
Arizona 56.214497
Arkansas 54.948667
California 228.051342
dtype: float64
排序,并找出人口密度最高的五个州sort_values()
density.sort_values(inplace=True)
density.tail()
state
Connecticut 645.600649
Rhode Island 681.339159
New Jersey 1009.253268
Puerto Rico 1058.665149
District of Columbia 8898.897059
dtype: float64
找出人口密度最低的五个州
density.head()
state
Alaska 1.087509
Wyoming 5.768079
Montana 6.736171
North Dakota 9.537565
South Dakota 10.583512
dtype: float64
要点总结:
- 统一用loc()索引
- 善于使用.isnull().any()找到存在NaN的列
- 善于使用.unique()确定该列中哪些key是我们需要的
- 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息
回顾:Series/DataFrame运算与ndarray运算的区别
- Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
- ndarray有广播,通过重复已有值来计算