Python基础学习笔记-12.Pandas库

最新推荐文章于 2022-12-05 17:10:45 发布

ruoqi23

最新推荐文章于 2022-12-05 17:10:45 发布

阅读量883

点赞数

分类专栏：笔记文章标签： python 机器学习

本文链接：https://blog.csdn.net/ruoqi23/article/details/107724425

版权

笔记专栏收录该内容

61 篇文章 52 订阅

订阅专栏

12.Pandas库

12.1.对象创建

12.1.1.Pandas Series对象

基于Numpy构建的Pandas库，提供了使得数据分析变得更快更简单的高级数据结构和操作工具

Series 是带标签数据的一维数组

Series对象的创建

通用结构: pd.Series(data, index=index, dtype=dtype)

data：数据，可以是列表，字典或Numpy数组

index：索引，为可选参数

dtype: 数据类型，为可选参数

1、用列表创建

index缺省，默认为整数序列

import pandas as pd

data = pd.Series([1.5, 3, 4.5, 6])

data

0 1.5

1 3.0

2 4.5

3 6.0

dtype: float64

增加index

data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])

data

a 1.5

b 3.0

c 4.5

d 6.0

dtype: float64

增加数据类型

缺省则从传入的数据自动判断

data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])

data

a 1

b 2

c 3

d 4

dtype: int64

data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")

data

a 1.0

b 2.0

c 3.0

d 4.0

dtype: float64

注意：数据支持多种类型

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])

data

a 1

b 2

c 3

d 4

dtype: object

data["a"]

data["c"]

'3'

数据类型可被强制改变

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)

data

a 1.0

b 2.0

c 3.0

d 4.0

dtype: float64

data["c"]

3.0

data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)

data

ValueError: could not convert string to float: 'a'

2、用一维numpy数组创建

import numpy as np

x = np.arange(5)

pd.Series(x)

0 0

1 1

2 2

3 3

4 4

dtype: int32

3、用字典创建

默认以键为index 值为data

population_dict = {"BeiJing": 2154,

"ShangHai": 2424,

"ShenZhen": 1303,

"HangZhou": 981 }

population = pd.Series(population_dict)

population

BeiJing 2154

ShangHai 2424

ShenZhen 1303

HangZhou 981

dtype: int64

字典创建，如果指定index，则会到字典的键中筛选，找不到的，值设为NaN

population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])

population

BeiJing 2154.0

HangZhou 981.0

c NaN

d NaN

dtype: float64

4、data为标量的情况

pd.Series(5, index=[100, 200, 300])

100 5

200 5

300 5

dtype: int64

12.1.2.Pandas DataFrame对象

DataFrame 是带标签数据的多维数组

DataFrame对象的创建

通用结构: pd.DataFrame(data, index=index, columns=columns)

data：数据，可以是列表，字典或Numpy数组

index：索引，为可选参数

columns: 列标签，为可选参数

1、通过Series对象创建

population_dict = {"BeiJing": 2154,

"ShangHai": 2424,

"ShenZhen": 1303,

"HangZhou": 981 }

population = pd.Series(population_dict)

pd.DataFrame(population)

	0
BeiJing	2154
ShangHai	2424
ShenZhen	1303
HangZhou	981

pd.DataFrame(population, columns=["population"])

	population
BeiJing	2154
ShangHai	2424
ShenZhen	1303
HangZhou	981

2、通过Series对象字典创建

GDP_dict = {"BeiJing": 30320,

"ShangHai": 32680,

"ShenZhen": 24222,

"HangZhou": 13468 }

GDP = pd.Series(GDP_dict)

GDP

BeiJing 30320

ShangHai 32680

ShenZhen 24222

HangZhou 13468

dtype: int64

pd.DataFrame({"population": population,

"GDP": GDP})

	population	GDP
BeiJing	2154	30320
ShangHai	2424	32680
ShenZhen	1303	24222
HangZhou	981	13468

注意：数量不够的会自动补齐

pd.DataFrame({"population": population,

"GDP": GDP,

"country": "China"})

	population	GDP	country
BeiJing	2154	30320	China
ShangHai	2424	32680	China
ShenZhen	1303	24222	China
HangZhou	981	13468	China

3、通过字典列表对象创建

字典索引作为index，字典键作为columns

import numpy as np

import pandas as pd

data = [{"a": i, "b": 2*i} for i in range(3)]

data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

data = pd.DataFrame(data)

data

	a	b
0	0	0
1	1	2
2	2	4

data1 = data["a"].copy()

data1

0 0

1 1

2 2

Name: a, dtype: int64

data1[0] = 10

data1

0 10

1 1

2 2

Name: a, dtype: int64

data

	a	b
0	0	0
1	1	2
2	2	4

不存在的键，会默认值为NaN

data = [{"a": 1, "b":1},{"b": 3, "c":4}]

data

[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]

pd.DataFrame(data)

	a	b	c
0	1.0	1	NaN
1	NaN	3	4.0

4、通过Numpy二维数组创建

data = np.random.randint(10, size=(3, 2))

data

array([[1, 6],

[2, 9],

[4, 0]])

pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])

	foo	bar
a	1	6
b	2	9
c	4	0

12.2.DataFrame性质

12.2.1.属性

data = pd.DataFrame({"pop": population, "GDP": GDP})

data

	pop	GDP
BeiJing	2154	30320
ShangHai	2424	32680
ShenZhen	1303	24222
HangZhou	981	13468

（1）df.values 返回numpy数组的数据

data.values

array([[ 2154, 30320],

[ 2424, 32680],

[ 1303, 24222],

[ 981, 13468]], dtype=int64)

（2）df.index 返回行索引

data.index

Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

（3）df.columns 返回列索引

data.columns

Index(['pop', 'GDP'], dtype='object')

（4）df.shape 形状

data.shape

(4, 2)

（5） pd.size 大小

data.size

（6）pd.dtypes 返回每列数据类型

data.dtypes

pop int64

GDP int64

dtype: object

12.2.2.索引

data

	pop	GDP
BeiJing	2154	30320
ShangHai	2424	32680
ShenZhen	1303	24222
HangZhou	981	13468

（1）获取列

字典式

data["pop"]

BeiJing 2154

ShangHai 2424

ShenZhen 1303

HangZhou 981

Name: pop, dtype: int64

data[["GDP", "pop"]]

	GDP	pop
BeiJing	30320	2154
ShangHai	32680	2424
ShenZhen	24222	1303
HangZhou	13468	981

对象属性式

data.GDP

BeiJing 30320

ShangHai 32680

ShenZhen 24222

HangZhou 13468

Name: GDP, dtype: int64

（2）获取行

绝对索引 df.loc

data.loc["BeiJing"]

pop 2154

GDP 30320

Name: BeiJing, dtype: int64

data.loc[["BeiJing", "HangZhou"]]

	pop	GDP
BeiJing	2154	30320
HangZhou	981	13468

相对索引 df.iloc

data.iloc[0]

pop 2154

GDP 30320

Name: BeiJing, dtype: int64

data.iloc[[1, 3]]

	pop	GDP
ShangHai	2424	32680
HangZhou	981	13468

（3）获取标量

data.loc["BeiJing", "GDP"]

30320

data.iloc[0, 1]

30320

data.values[0][1]

30320

（4）Series对象的索引

type(data.GDP)

pandas.core.series.Series

GDP

BeiJing 30320

ShangHai 32680

ShenZhen 24222

HangZhou 13468

dtype: int64

GDP["BeiJing"]

30320

12.2.3.切片

dates = pd.date_range(start='2019-01-01', periods=6)

dates

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',

'2019-01-05', '2019-01-06'],

dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])

	A	B	C	D
2019-01-01	-0.935378	-0.190742	0.925984	-0.818969
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395
2019-01-03	-0.141572	0.058118	1.102248	1.207726
2019-01-04	0.305088	0.535920	-0.978434	0.177251
2019-01-05	0.313383	0.234041	0.163155	-0.296649
2019-01-06	0.250613	-0.904400	-0.858240	-1.573342

（1）行切片

df["2019-01-01": "2019-01-03"]

	A	B	C	D
2019-01-01	-0.935378	-0.190742	0.925984	-0.818969
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395
2019-01-03	-0.141572	0.058118	1.102248	1.207726

df.loc["2019-01-01": "2019-01-03"]

	A	B	C	D
2019-01-01	-0.935378	-0.190742	0.925984	-0.818969
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395
2019-01-03	-0.141572	0.058118	1.102248	1.207726

df.iloc[0: 3]

	A	B	C	D
2019-01-01	-0.935378	-0.190742	0.925984	-0.818969
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395
2019-01-03	-0.141572	0.058118	1.102248	1.207726

（2）列切片

df.loc[:, "A": "C"]

	A	B	C
2019-01-01	-0.935378	-0.190742	0.925984
2019-01-02	-0.234414	-1.194674	1.080779
2019-01-03	-0.141572	0.058118	1.102248
2019-01-04	0.305088	0.535920	-0.978434
2019-01-05	0.313383	0.234041	0.163155
2019-01-06	0.250613	-0.904400	-0.858240

df.iloc[:, 0: 3]

	A	B	C
2019-01-01	-0.935378	-0.190742	0.925984
2019-01-02	-0.234414	-1.194674	1.080779
2019-01-03	-0.141572	0.058118	1.102248
2019-01-04	0.305088	0.535920	-0.978434
2019-01-05	0.313383	0.234041	0.163155
2019-01-06	0.250613	-0.904400	-0.858240

（3）多种多样的取值

行、列同时切片

df.loc["2019-01-02": "2019-01-03", "C":"D"]

	C	D
2019-01-02	1.080779	-2.294395
2019-01-03	1.102248	1.207726

df.iloc[1: 3, 2:]

	C	D
2019-01-02	1.080779	-2.294395
2019-01-03	1.102248	1.207726

行切片，列分散取值

df.loc["2019-01-04": "2019-01-06", ["A", "C"]]

	A	C
2019-01-04	0.305088	-0.978434
2019-01-05	0.313383	0.163155
2019-01-06	0.250613	-0.858240

df.iloc[3:, [0, 2]]

	A	C
2019-01-04	0.305088	-0.978434
2019-01-05	0.313383	0.163155
2019-01-06	0.250613	-0.858240

行分散取值，列切片

df.loc[["2019-01-02", "2019-01-06"], "C": "D"]

df.iloc[[1, 5], 0: 3]

	A	B	C
2019-01-02	-0.234414	-1.194674	1.080779
2019-01-06	0.250613	-0.904400	-0.858240

行、列均分散取值

df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]

df.iloc[[1, 5], [0, 3]]

	A	D
2019-01-02	-0.234414	-2.294395
2019-01-06	0.250613	-1.573342

12.2.4.布尔索引

	A	B	C	D
2019-01-01	-0.935378	-0.190742	0.925984	-0.818969
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395
2019-01-03	-0.141572	0.058118	1.102248	1.207726
2019-01-04	0.305088	0.535920	-0.978434	0.177251
2019-01-05	0.313383	0.234041	0.163155	-0.296649
2019-01-06	0.250613	-0.904400	-0.858240	-1.573342

df > 0

	A	B	C	D
2019-01-01	False	False	True	False
2019-01-02	False	False	True	False
2019-01-03	False	True	True	True
2019-01-04	True	True	False	True
2019-01-05	True	True	True	False
2019-01-06	True	False	False	False

df[df > 0]

	A	B	C	D
2019-01-01	NaN	NaN	0.925984	NaN
2019-01-02	NaN	NaN	1.080779	NaN
2019-01-03	NaN	0.058118	1.102248	1.207726
2019-01-04	0.305088	0.535920	NaN	0.177251
2019-01-05	0.313383	0.234041	0.163155	NaN
2019-01-06	0.250613	NaN	NaN	NaN

df.A > 0

2019-01-01 False

2019-01-02 False

2019-01-03 False

2019-01-04 True

2019-01-05 True

2019-01-06 True

Freq: D, Name: A, dtype: bool

df[df.A > 0]

	A	B	C	D
2019-01-04	0.305088	0.535920	-0.978434	0.177251
2019-01-05	0.313383	0.234041	0.163155	-0.296649
2019-01-06	0.250613	-0.904400	-0.858240	-1.573342

isin（）方法

df2 = df.copy()

df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

df2

	A	B	C	D	E
2019-01-01	-0.935378	-0.190742	0.925984	-0.818969	one
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395	one
2019-01-03	-0.141572	0.058118	1.102248	1.207726	two
2019-01-04	0.305088	0.535920	-0.978434	0.177251	three
2019-01-05	0.313383	0.234041	0.163155	-0.296649	four
2019-01-06	0.250613	-0.904400	-0.858240	-1.573342	three

ind = df2["E"].isin(["two", "four"])

ind

2019-01-01 False

2019-01-02 False

2019-01-03 True

2019-01-04 False

2019-01-05 True

2019-01-06 False

Freq: D, Name: E, dtype: bool

df2[ind]

	A	B	C	D	E
2019-01-03	-0.141572	0.058118	1.102248	1.207726	two
2019-01-05	0.313383	0.234041	0.163155	-0.296649	four

12.2.5.赋值

DataFrame 增加新列

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))

2019-01-01 1

2019-01-02 2

2019-01-03 3

2019-01-04 4

2019-01-05 5

2019-01-06 6

Freq: D, dtype: int64

df["E"] = s1

	A	B	C	D	E
2019-01-01	-0.935378	-0.190742	0.925984	-0.818969	1
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395	2
2019-01-03	-0.141572	0.058118	1.102248	1.207726	3
2019-01-04	0.305088	0.535920	-0.978434	0.177251	4
2019-01-05	0.313383	0.234041	0.163155	-0.296649	5
2019-01-06	0.250613	-0.904400	-0.858240	-1.573342	6

修改赋值

df.loc["2019-01-01", "A"] = 0

	A	B	C	D	E
2019-01-01	0.000000	-0.190742	0.925984	-0.818969	1
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395	2
2019-01-03	-0.141572	0.058118	1.102248	1.207726	3
2019-01-04	0.305088	0.535920	-0.978434	0.177251	4
2019-01-05	0.313383	0.234041	0.163155	-0.296649	5
2019-01-06	0.250613	-0.904400	-0.858240	-1.573342	6

df.iloc[0, 1] = 0

	A	B	C	D	E
2019-01-01	0.000000	0.000000	0.925984	-0.818969	1
2019-01-02	-0.234414	-1.194674	1.080779	-2.294395	2
2019-01-03	-0.141572	0.058118	1.102248	1.207726	3
2019-01-04	0.305088	0.535920	-0.978434	0.177251	4
2019-01-05	0.313383	0.234041	0.163155	-0.296649	5
2019-01-06	0.250613	-0.904400	-0.858240	-1.573342	6

df["D"] = np.array([5]*len(df)) # 可简化成df["D"] = 5

	A	B	C	D	E
2019-01-01	0.000000	0.000000	0.925984	5	1
2019-01-02	-0.234414	-1.194674	1.080779	5	2
2019-01-03	-0.141572	0.058118	1.102248	5	3
2019-01-04	0.305088	0.535920	-0.978434	5	4
2019-01-05	0.313383	0.234041	0.163155	5	5
2019-01-06	0.250613	-0.904400	-0.858240	5	6

修改index和columns

df.index = [i for i in range(len(df))]

	A	B	C	D	E
0	0.000000	0.000000	0.925984	5	1
1	-0.234414	-1.194674	1.080779	5	2
2	-0.141572	0.058118	1.102248	5	3
3	0.305088	0.535920	-0.978434	5	4
4	0.313383	0.234041	0.163155	5	5
5	0.250613	-0.904400	-0.858240	5	6

df.columns = [i for i in range(df.shape[1])]

	0	1	2	3	4
0	0.000000	0.000000	0.925984	5	1
1	-0.234414	-1.194674	1.080779	5	2
2	-0.141572	0.058118	1.102248	5	3
3	0.305088	0.535920	-0.978434	5	4
4	0.313383	0.234041	0.163155	5	5
5	0.250613	-0.904400	-0.858240	5	6

12.3.数值运算及统计分析

12.3.1.数据的查看

import pandas as pd

import numpy as np

dates = pd.date_range(start='2019-01-01', periods=6)

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])

	A	B	C	D
2019-01-01	-0.854043	0.412345	-2.296051	-0.048964
2019-01-02	1.371364	-0.121454	-0.299653	1.095375
2019-01-03	-0.714591	-1.103224	0.979250	0.319455
2019-01-04	-1.397557	0.426008	0.233861	-1.651887
2019-01-05	0.434026	0.459830	-0.095444	1.220302
2019-01-06	-0.133876	0.074500	-1.028147	0.605402

（1）查看前面的行

df.head() # 默认5行

	A	B	C	D
2019-01-01	-0.854043	0.412345	-2.296051	-0.048964
2019-01-02	1.371364	-0.121454	-0.299653	1.095375
2019-01-03	-0.714591	-1.103224	0.979250	0.319455
2019-01-04	-1.397557	0.426008	0.233861	-1.651887
2019-01-05	0.434026	0.459830	-0.095444	1.220302

df.head(2)

	A	B	C	D
2019-01-01	-0.854043	0.412345	-2.296051	-0.048964
2019-01-02	1.371364	-0.121454	-0.299653	1.095375

（2）查看后面的行

df.tail() # 默认5行

	A	B	C	D
2019-01-02	1.371364	-0.121454	-0.299653	1.095375
2019-01-03	-0.714591	-1.103224	0.979250	0.319455
2019-01-04	-1.397557	0.426008	0.233861	-1.651887
2019-01-05	0.434026	0.459830	-0.095444	1.220302
2019-01-06	-0.133876	0.074500	-1.028147	0.605402

df.tail(3)

	A	B	C	D
2019-01-04	-1.397557	0.426008	0.233861	-1.651887
2019-01-05	0.434026	0.459830	-0.095444	1.220302
2019-01-06	-0.133876	0.074500	-1.028147	0.605402

（3）查看总体信息

df.iloc[0, 3] = np.nan

	A	B	C	D
2019-01-01	-0.854043	0.412345	-2.296051	NaN
2019-01-02	1.371364	-0.121454	-0.299653	1.095375
2019-01-03	-0.714591	-1.103224	0.979250	0.319455
2019-01-04	-1.397557	0.426008	0.233861	-1.651887
2019-01-05	0.434026	0.459830	-0.095444	1.220302
2019-01-06	-0.133876	0.074500	-1.028147	0.605402

df.info()

DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06

Freq: D

Data columns (total 4 columns):

A 6 non-null float64

B 6 non-null float64

C 6 non-null float64

D 5 non-null float64

dtypes: float64(4)

memory usage: 240.0 bytes

12.3.2.Numpy通用函数同样适用于Pandas

（1）向量化运算

x = pd.DataFrame(np.arange(4).reshape(1, 4))

	0	1	2	3
0	0	1	2	3

x+5

	0	1	2	3
0	5	6	7	8

np.exp(x)

	0	1	2	3
0	1.0	2.718282	7.389056	20.085537

y = pd.DataFrame(np.arange(4,8).reshape(1, 4))

	0	1	2	3
0	4	5	6	7

x*y

	0	1	2	3
0	0	5	12	21

（2）矩阵化运算

np.random.seed(42)

x = pd.DataFrame(np.random.randint(10, size=(3, 3)))

	0	1	2
0	6	3	7
1	4	6	9
2	2	6	7

转置

z = x.Tz

	0	1	2
0	6	4	2
1	3	6	6
2	7	9	7

np.random.seed(1)

y = pd.DataFrame(np.random.randint(10, size=(3, 3)))

	0	1	2
0	5	8	9
1	5	0	0
2	1	7	6

x.dot(y)

	0	1	2
0	52	97	96
1	59	95	90
2	47	65	60

%timeit x.dot(y)

133 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.dot(x, y)

40.2 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

执行相同运算，Numpy与Pandas的对比

x1 = np.array(x)

array([[6, 3, 7],

[4, 6, 9],

[2, 6, 7]])

y1 = np.array(y)

array([[5, 8, 9],

[5, 0, 0],

[1, 7, 6]])

%timeit x1.dot(y1)

22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.dot(x1, y1)

22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.dot(x.values, y.values)

42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

x2 = list(x1)y2 = list(y1)x3 = []y3 = []for i in x2:

res = []

for j in i:

res.append(int(j))

x3.append(res)for i in y2:

res = []

for j in i:

res.append(int(j))

y3.append(res)

def f(x, y):

res = []

for i in range(len(x)):

row = []

for j in range(len(y[0])):

sum_row = 0

for k in range(len(x[0])):

sum_row += x[i][k]*y[k][j]

row.append(sum_row)

res.append(row)

return res

%timeit f(x3, y3)

4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一般来说，纯粹的计算在Numpy里执行的更快

Numpy更侧重于计算，Pandas更侧重于数据处理

（3）广播运算

np.random.seed(42)

x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))

	A	B	C
0	6	3	7
1	4	6	9
2	2	6	7

按行广播

x.iloc[0]

A 6

B 3

C 7

Name: 0, dtype: int32

x/x.iloc[0]

	A	B	C
0	1.000000	1.0	1.000000
1	0.666667	2.0	1.285714
2	0.333333	2.0	1.000000

按列广播

x.A

0 6

1 4

2 2

Name: A, dtype: int32

x.div(x.A, axis=0) # add sub div mul

	A	B	C
0	1.0	0.5	1.166667
1	1.0	1.5	2.250000
2	1.0	3.0	3.500000

x.div(x.iloc[0], axis=1)

	A	B	C
0	1.000000	1.0	1.000000
1	0.666667	2.0	1.285714
2	0.333333	2.0	1.000000

12.3.3.其他用法

（1）索引对齐

A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))

	A	B
0	3	7
1	2	1

B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))

	A	B	C
0	7	5	1
1	4	0	9
2	5	8	0

pandas会自动对齐两个对象的索引，没有的值用np.nan表示

A+B

	A	B	C
0	10.0	12.0	NaN
1	6.0	1.0	NaN
2	NaN	NaN	NaN

缺省值也可用fill_value来填充

A.add(B, fill_value=0)

	A	B	C
0	10.0	12.0	1.0
1	6.0	1.0	9.0
2	5.0	8.0	0.0

A*B

	A	B	C
0	21.0	35.0	NaN
1	8.0	0.0	NaN
2	NaN	NaN	NaN

（2）统计相关

数据种类统计

y = np.random.randint(3, size=20)

array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])

np.unique(y)

array([0, 1, 2])

from collections import Counter

Counter(y)

Counter({2: 11, 1: 5, 0: 4})

y1 = pd.DataFrame(y, columns=["A"])

	A
0	2
...	...
19	1

np.unique(y1)

array([0, 1, 2])

y1["A"].value_counts()

2 11

1 5

0 4

Name: A, dtype: int64

产生新的结果，并进行排序

population_dict = {"BeiJing": 2154,

"ShangHai": 2424,

"ShenZhen": 1303,

"HangZhou": 981 }

population = pd.Series(population_dict)

GDP_dict = {"BeiJing": 30320,

"ShangHai": 32680,

"ShenZhen": 24222,

"HangZhou": 13468 }

GDP = pd.Series(GDP_dict)

city_info = pd.DataFrame({"population": population,"GDP": GDP})

city_info

	population	GDP
BeiJing	2154	30320
ShangHai	2424	32680
ShenZhen	1303	24222
HangZhou	981	13468

city_info["per_GDP"] = city_info["GDP"]/city_info["population"]

city_info

	population	GDP	per_GDP
BeiJing	2154	30320	14.076137
ShangHai	2424	32680	13.481848
ShenZhen	1303	24222	18.589409
HangZhou	981	13468	13.728848

递增排序

city_info.sort_values(by="per_GDP")

	population	GDP	per_GDP
ShangHai	2424	32680	13.481848
HangZhou	981	13468	13.728848
BeiJing	2154	30320	14.076137
ShenZhen	1303	24222	18.589409

递减排序

city_info.sort_values(by="per_GDP", ascending=False)

	population	GDP	per_GDP
ShenZhen	1303	24222	18.589409
BeiJing	2154	30320	14.076137
HangZhou	981	13468	13.728848
ShangHai	2424	32680	13.481848

按轴进行排序

data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))

data

	C	B	A	D
2	3	13	17	8
1	1	19	14	6
0	11	7	14	2

行排序

data.sort_index()

	C	B	A	D
0	11	7	14	2
1	1	19	14	6
2	3	13	17	8

列排序

data.sort_index(axis=1)

	A	B	C	D
2	17	13	3	8
1	14	19	1	6
0	14	7	11	2

data.sort_index(axis=1, ascending=False)

	D	C	B	A
2	8	3	13	17
1	6	1	19	14
0	2	11	7	14

统计方法

df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))

	A	B	C	D
0	1.082198	3.557396	-3.060476	6.367969
1	13.113252	6.774559	2.874553	5.527044
2	-2.036341	-4.333177	5.094802	-0.152567
3	-3.386712	-1.522365	-2.522209	2.537716
4	4.328491	5.550994	5.577329	5.019991
5	1.171336	-0.493910	-4.032613	6.398588

非空个数

df.count()

A 6

B 6

C 6

D 6

dtype: int64

求和

df.sum()

A 14.272224

B 9.533497

C 3.931385

D 25.698741

dtype: float64

df.sum(axis=1)

0 7.947086

1 28.289408

2 -1.427283

3 -4.893571

4 20.476806

5 3.043402

dtype: float64

最大值最小值

df.min()

A -3.386712

B -4.333177

C -4.032613

D -0.152567

dtype: float64

df.max(axis=1)

0 6.367969

1 13.113252

2 5.094802

3 2.537716

4 5.577329

5 6.398588

dtype: float64

df.idxmax()

A 1

B 1

C 4

D 5

dtype: int64

均值

df.mean()

A 2.378704

B 1.588916

C 0.655231

D 4.283124

dtype: float64

方差

df.var()

A 34.980702

B 19.110656

C 18.948144

D 6.726776

dtype: float64

标准差

df.std()

A 5.914449

B 4.371574

C 4.352947

D 2.593603

dtype: float64

中位数

df.median()

A 1.126767

B 1.531743

C 0.176172

D 5.273518

dtype: float64

众数

data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))

data

	A	B
0	4	2
1	3	2
2	2	0
3	2	4
4	2	0
5	4	1
6	2	0
7	1	1
8	3	4
9	2	0

data.mode()

	A	B
0	2	0

75%分位数

df.quantile(0.75)

A 3.539202

B 5.052594

C 4.539740

D 6.157738

Name: 0.75, dtype: float64

统计所有

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	2.378704	1.588916	0.655231	4.283124
std	5.914449	4.371574	4.352947	2.593603
min	-3.386712	-4.333177	-4.032613	-0.152567
25%	-1.256706	-1.265251	-2.925910	3.158284
50%	1.126767	1.531743	0.176172	5.273518
75%	3.539202	5.052594	4.539740	6.157738
max	13.113252	6.774559	5.577329	6.398588

data_2 = pd.DataFrame([["a", "a", "c", "d"],

["c", "a", "c", "b"],

["a", "a", "d", "c"]], columns=list("ABCD"))

data_2

	A	B	C	D
0	a	a	c	d
1	c	a	c	b
2	a	a	d	c

data_2.describe()

	A	B	C	D
count	3	3	3	3
unique	2	1	2	3
top	a	a	c	d
freq	2	3	2	1

12.4.缺失值处理

12.4.1.发现缺失值

import pandas as pd

import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],

[np.nan, 3, 4],

[5, 6, None]]), columns=["A", "B", "C"])

data

	A	B	C
0	1	NaN	2
1	NaN	3	4
2	5	6	None

注意：有None、字符串等，数据类型全部变为object，它比int和float更消耗资源

data.dtypes

A object

B object

C object

dtype: object

data.isnull()

	A	B	C
0	False	True	False
1	True	False	False
2	False	False	True

data.notnull()

	A	B	C
0	True	False	True
1	False	True	True
2	True	True	False

12.4.2.删除缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],

[np.nan, 4, 5, 6],

[7, 8, np.nan, 9],

[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])

data

	A	B	C	D
0	1.0	NaN	2.0	3.0
1	NaN	4.0	5.0	6.0
2	7.0	8.0	NaN	9.0
3	10.0	11.0	12.0	13.0

注意：np.nan是一种特殊的浮点数

data.dtypes

A float64

B float64

C float64

D float64

dtype: object

（1）删除整行

data.dropna()

	A	B	C	D
3	10.0	11.0	12.0	13.0

（2）删除整列

data.dropna(axis="columns")

	D
0	3.0
1	6.0
2	9.0
3	13.0

data["D"] = np.nan

data

	A	B	C	D
0	1.0	NaN	2.0	NaN
1	NaN	4.0	5.0	NaN
2	7.0	8.0	NaN	NaN
3	10.0	11.0	12.0	NaN

data.dropna(axis="columns", how="all")

	A	B	C
0	1.0	NaN	2.0
1	NaN	4.0	5.0
2	7.0	8.0	NaN
3	10.0	11.0	12.0

data.dropna(axis="columns", how="any")

data.loc[3] = np.nan

data

	A	B	C	D
0	1.0	NaN	2.0	NaN
1	NaN	4.0	5.0	NaN
2	7.0	8.0	NaN	NaN
3	NaN	NaN	NaN	NaN

data.dropna(how="all")

	A	B	C	D
0	1.0	NaN	2.0	NaN
1	NaN	4.0	5.0	NaN
2	7.0	8.0	NaN	NaN

12.4.3.填充缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],

[np.nan, 4, 5, 6],

[7, 8, np.nan, 9],

[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])

data

	A	B	C	D
0	1.0	NaN	2.0	3.0
1	NaN	4.0	5.0	6.0
2	7.0	8.0	NaN	9.0
3	10.0	11.0	12.0	13.0

data.fillna(value=5)

	A	B	C	D
0	1.0	5.0	2.0	3.0
1	5.0	4.0	5.0	6.0
2	7.0	8.0	5.0	9.0
3	10.0	11.0	12.0	13.0

用均值进行替换

fill = data.mean()

fill

A 6.000000

B 7.666667

C 6.333333

D 7.750000

dtype: float64

data.fillna(value=fill)

	A	B	C	D
0	1.0	7.666667	2.000000	3.0
1	6.0	4.000000	5.000000	6.0
2	7.0	8.000000	6.333333	9.0
3	10.0	11.000000	12.000000	13.0

fill = data.stack().mean()

fill

7.0

data.fillna(value=fill)

	A	B	C	D
0	1.0	7.0	2.0	3.0
1	7.0	4.0	5.0	6.0
2	7.0	8.0	7.0	9.0
3	10.0	11.0	12.0	13.0

12.5.合并数据

构造一个生产DataFrame的函数

import pandas as pd

import numpy as np

def make_df(cols, ind):

"一个简单的DataFrame"

data = {c: [str(c)+str(i) for i in ind] for c in cols}

return pd.DataFrame(data, ind)

make_df("ABC", range(3))

	A	B	C
0	A0	B0	C0
1	A1	B1	C1
2	A2	B2	C2

垂直合并

df_1 = make_df("AB", [1, 2])

df_2 = make_df("AB", [3, 4])

print(df_1)

print(df_2)

A B

1 A1 B1

2 A2 B2

A B

3 A3 B3

4 A4 B4

pd.concat([df_1, df_2])

	A	B
1	A1	B1
2	A2	B2
3	A3	B3
4	A4	B4

水平合并

pd.concat([df_1, df_2], axis=1)

	A	B	A	B
1	A1	B1	A1	B1
2	A2	B2	A2	B2

索引重叠

行重叠

df_5 = make_df("AB", [1, 2])

df_6 = make_df("AB", [1, 2])

print(df_5)

print(df_6)

A B

1 A1 B1

2 A2 B2

A B

1 A1 B1

2 A2 B2

pd.concat([df_5, df_6])

	A	B
1	A1	B1
2	A2	B2
1	A1	B1
2	A2	B2

pd.concat([df_5, df_6],ignore_index=True)

	A	B
0	A1	B1
1	A2	B2
2	A1	B1
3	A2	B2

列重叠

df_7 = make_df("ABC", [1, 2])

df_8 = make_df("BCD", [1, 2])

print(df_7)

print(df_8)

A B C

1 A1 B1 C1

2 A2 B2 C2

B C D

1 B1 C1 D1

2 B2 C2 D2

pd.concat([df_7, df_8], axis=1)

	A	B	C	B	C	D
1	A1	B1	C1	B1	C1	D1
2	A2	B2	C2	B2	C2	D2

pd.concat([df_7, df_8],axis=1, ignore_index=True)

	0	1	2	3	4	5
1	A1	B1	C1	B1	C1	D1
2	A2	B2	C2	B2	C2	D2

对齐合并merge()

df_9 = make_df("AB", [1, 2])

df_10 = make_df("BC", [1, 2])

print(df_9)

print(df_10)

A B

1 A1 B1

2 A2 B2

B C

1 B1 C1

2 B2 C2

pd.merge(df_9, df_10)

	A	B	C
0	A1	B1	C1
1	A2	B2	C2

df_9 = make_df("AB", [1, 2])

df_10 = make_df("CB", [2, 1])

print(df_9)

print(df_10)

A B

1 A1 B1

2 A2 B2

C B

2 C2 B2

1 C1 B1

pd.merge(df_9, df_10)

	A	B	C
0	A1	B1	C1
1	A2	B2	C2

【例】合并城市信息

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),

"pop": (2154, 981, 1303)}

population = pd.DataFrame(population_dict)

population

	city	pop
0	BeiJing	2154
1	HangZhou	981
2	ShenZhen	1303

GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),

"GDP": (30320, 32680, 13468)}

GDP = pd.DataFrame(GDP_dict)

GDP

	city	GDP
0	BeiJing	30320
1	ShangHai	32680
2	HangZhou	13468

city_info = pd.merge(population, GDP)

city_info

	city	pop	GDP
0	BeiJing	2154	30320
1	HangZhou	981	13468

city_info = pd.merge(population, GDP, how="outer")

city_info

	city	pop	GDP
0	BeiJing	2154.0	30320.0
1	HangZhou	981.0	13468.0
2	ShenZhen	1303.0	NaN
3	ShangHai	NaN	32680.0

12.6.分组和数据透视表

12.6.1.分组

df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],

"data1": range(6),

"data2": np.random.randint(0, 10, size=6)})

	key	data1	data2
0	A	0	1
1	B	1	4
2	C	2	9
3	C	3	9
4	B	4	1
5	A	5	9

延迟计算

df.groupby("key")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>

df.groupby("key").sum()

	data1	data2
key
A	5	10
B	5	6
C	5	11

df.groupby("key").mean()

	data1	data2
key
A	2.5	5.0
B	2.5	3.0
C	2.5	5.5

for i in df.groupby("key"):

print(str(i))

('A', key data1 data2

0 A 0 2

5 A 5 8)

('B', key data1 data2

1 B 1 2

4 B 4 4)

('C', key data1 data2

2 C 2 8

3 C 3 3)

按列取值

df.groupby("key")["data2"].sum()

key

A 10

B 6

C 11

Name: data2, dtype: int32

按组迭代

for data, group in df.groupby("key"):

print("{0:5} shape={1}".format(data, group.shape))

A shape=(2, 3)

B shape=(2, 3)

C shape=(2, 3)

调用方法

df.groupby("key")["data1"].describe()

	count	mean	std	min	25%	50%	75%	max
key
A	2.0	2.5	3.535534	0.0	1.25	2.5	3.75	5.0
B	2.0	2.5	2.121320	1.0	1.75	2.5	3.25	4.0
C	2.0	2.5	0.707107	2.0	2.25	2.5	2.75	3.0

支持更复杂的操作

df.groupby("key").aggregate(["min", "median", "max"])

	data1			data2
	min	median	max	min	median	max
key
A	0	2.5	5	2	5.0	8
B	1	2.5	4	2	3.0	4
C	2	2.5	3	3	5.5	8

过滤

def filter_func(x):

return x["data2"].std() > 3

df.groupby("key")["data2"].std()

key

A 4.242641

B 1.414214

C 3.535534

Name: data2, dtype: float64

df.groupby("key").filter(filter_func)

	key	data1	data2
0	A	0	2
2	C	2	8
3	C	3	3
5	A	5	8

apply（）方法

	key	data1	data2
0	A	0	2
1	B	1	2
2	C	2	8
3	C	3	3
4	B	4	4
5	A	5	8

df.groupby("key").apply(lambda x: x-x.mean())

	data1	data2
0	-2.5	-3.0
1	-1.5	-1.0
2	-0.5	2.5
3	0.5	-2.5
4	1.5	1.0
5	2.5	3.0

def norm_by_data2(x):

x["data1"] /= x["data2"].sum()

return x

df.groupby("key").apply(norm_by_data2)

	key	data1	data2
0	A	0.000000	2
1	B	0.166667	2
2	C	0.181818	8
3	C	0.272727	3
4	B	0.666667	4
5	A	0.500000	8

将列表、数组设为分组键

L = [0, 1, 0, 1, 2, 0]

df.groupby(L).sum()

	data1	data2
0	7	18
1	4	5
2	4	4

用字典将索引映射到分组

df2 = df.set_index("key")

df2

	data1	data2
key
A	0	2
B	1	2
C	2	8
C	3	3
B	4	4
A	5	8

mapping = {"A": "first", "B": "constant", "C": "constant"}

df2.groupby(mapping).sum()

	data1	data2
constant	10	17
first	5	10

任意Python函数

df2.groupby(str.lower).mean()

	data1	data2
a	2.5	5.0
b	2.5	3.0
c	2.5	5.5

多个有效值组成的列表

df2.groupby([str.lower, mapping]).mean()

		data1	data2
a	first	2.5	5.0
b	constant	2.5	3.0
c	constant	2.5	5.5

12.6.2.案例

【例1】行星观测数据处理

import seaborn as sns

import pandas as pd

# planets = sns.load_dataset("planets")

planets = pd.read_csv("data/planets.csv") # 读取本地的csv文件

planets.shape

(1035, 6)

planets.head()

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011
3	Radial Velocity	1	326.030	19.40	110.62	2007
4	Radial Velocity	1	516.220	10.50	119.47	2009

planets.describe()

	number	orbital_period	mass	distance	year
count	1035.000000	992.000000	513.000000	808.000000	1035.000000
mean	1.785507	2002.917596	2.638161	264.069282	2009.070531
std	1.240976	26014.728304	3.818617	733.116493	3.972567
min	1.000000	0.090706	0.003600	1.350000	1989.000000
25%	1.000000	5.442540	0.229000	32.560000	2007.000000
50%	1.000000	39.979500	1.260000	55.250000	2010.000000
75%	2.000000	526.005000	3.040000	178.500000	2012.000000
max	7.000000	730000.000000	25.000000	8500.000000	2014.000000

decade = 10 * (planets["year"] // 10)

decade.head()

0 2000

1 2000

2 2010

3 2000

4 2000

Name: year, dtype: int64

decade = decade.astype(str) + "s"

decade.name = "decade"

decade.head()

0 2000s

1 2000s

2 2010s

3 2000s

4 2000s

Name: decade, dtype: object

planets.groupby(["method", decade]).sum()

		number	orbital_period	mass	distance	year
method	decade
Astrometry	2010s	2	1.262360e+03	0.00000	35.75	4023
Eclipse Timing Variations	2000s	5	1.930800e+04	6.05000	261.44	6025
Eclipse Timing Variations	2010s	10	2.345680e+04	4.20000	1000.00	12065
Imaging	2000s	29	1.350935e+06	0.00000	956.83	40139
Imaging	2010s	21	6.803750e+04	0.00000	1210.08	36208
Microlensing	2000s	12	1.732500e+04	0.00000	0.00	20070
Microlensing	2010s	15	4.750000e+03	0.00000	41440.00	26155
Orbital Brightness Modulation	2010s	5	2.127920e+00	0.00000	2360.00	6035
Pulsar Timing	1990s	9	1.900153e+02	0.00000	0.00	5978
	2000s	1	3.652500e+04	0.00000	0.00	2003
	2010s	1	9.070629e-02	0.00000	1200.00	2011
Pulsation Timing Variations	2000s	1	1.170000e+03	0.00000	0.00	2007
Radial Velocity	1980s	1	8.388800e+01	11.68000	40.57	1989
	1990s	52	1.091561e+04	68.17820	723.71	55943
	2000s	475	2.633526e+05	945.31928	15201.16	619775
	2010s	424	1.809630e+05	316.47890	11382.67	432451
Transit	2000s	64	2.897102e+02	0.00000	31823.31	124462
Transit	2010s	712	8.087813e+03	1.47000	102419.46	673999
Transit Timing Variations	2010s	9	2.393505e+02	0.00000	3313.00	8050

planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)

	number
decade	1980s	1990s	2000s	2010s
method
Astrometry	0.0	0.0	0.0	2.0
Eclipse Timing Variations	0.0	0.0	5.0	10.0
Imaging	0.0	0.0	29.0	21.0
Microlensing	0.0	0.0	12.0	15.0
Orbital Brightness Modulation	0.0	0.0	0.0	5.0
Pulsar Timing	0.0	9.0	1.0	1.0
Pulsation Timing Variations	0.0	0.0	1.0	0.0
Radial Velocity	1.0	52.0	475.0	424.0
Transit	0.0	0.0	64.0	712.0
Transit Timing Variations	0.0	0.0	0.0	9.0

【例2】泰坦尼克号乘客数据分析

import seaborn as sns

# titanic = sns.load_dataset("titanic")

titanic = pd.read_csv("data/titanic.csv") # 读取本地的csv文件

titanic.head()

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

male

22.0

7.2500

Third

man

True

NaN

Southampton

False

female

38.0

71.2833

First

woman

False

Cherbourg

yes

False

female

26.0

7.9250

Third

woman

False

NaN

Southampton

yes

True

female

35.0

53.1000

First

woman

False

Southampton

yes

False

male

35.0

8.0500

Third

man

True

NaN

Southampton

True

titanic.describe()

	survived	pclass	age	sibsp	parch	fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

titanic.groupby("sex")[["survived"]].mean()

	survived
sex
female	0.742038
male	0.188908

titanic.groupby("sex")["survived"].mean()

sex

female 0.742038

male 0.188908

Name: survived, dtype: float64

titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()

class	First	Second	Third
sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

数据透视表

titanic.pivot_table("survived", index="sex", columns="class")

class	First	Second	Third
sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True)

class	First	Second	Third	All
sex
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	0.629630	0.472826	0.242363	0.383838

titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"})

	fare			survived
class	First	Second	Third	First	Second	Third
sex
female	106.125798	21.970121	16.118810	91	70	72
male	67.226127	19.741782	12.661633	45	17	47

12.7.其他

（1）多级索引：用于多维数据

base_data = np.array([[1771, 11115 ],

[2154, 30320],

[2141, 14070],

[2424, 32680],

[1077, 7806],

[1303, 24222],

[798, 4789],

[981, 13468]])

data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"], [2008, 2018]*4], columns=["population", "GDP"])

data

		population	GDP
BeiJing	2008	1771	11115
BeiJing	2018	2154	30320
ShangHai	2008	2141	14070
ShangHai	2018	2424	32680
ShenZhen	2008	1077	7806
ShenZhen	2018	1303	24222
HangZhou	2008	798	4789
HangZhou	2018	981	13468

data.index.names = ["city", "year"]

data

		population	GDP
city	year
BeiJing	2008	1771	11115
BeiJing	2018	2154	30320
ShangHai	2008	2141	14070
ShangHai	2018	2424	32680
ShenZhen	2008	1077	7806
ShenZhen	2018	1303	24222
HangZhou	2008	798	4789
HangZhou	2018	981	13468

data["GDP"]

city year

BeiJing 2008 11115

2018 30320

ShangHai 2008 14070

2018 32680

ShenZhen 2008 7806

2018 24222

HangZhou 2008 4789

2018 13468

Name: GDP, dtype: int32

data.loc["ShangHai", "GDP"]

year

2008 14070

2018 32680

Name: GDP, dtype: int32

data.loc["ShangHai", 2018]["GDP"]

32680

（2）高性能的Pandas：eval（）

df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))

%timeit (df1+df2)/(df3+df4)

17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

减少了复合代数式计算中间过程的内存分配

%timeit pd.eval("(df1+df2)/(df3+df4)")

10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))

True

（3）高性能的Pandas：query（）

df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))

df.head()

	A	B	C
0	0.418071	0.381836	0.500556
1	0.059432	0.749066	0.302429
2	0.489147	0.739153	0.777161
3	0.175441	0.016556	0.348979
4	0.766534	0.559252	0.310635

df.eval("D=(A+B)/(C-1)", inplace=True)

df.head()

	A	B	C	D
0	0.418071	0.381836	0.500556	-1.601593
1	0.059432	0.749066	0.302429	-1.159019
2	0.489147	0.739153	0.777161	-5.512052
3	0.175441	0.016556	0.348979	-0.294917
4	0.766534	0.559252	0.310635	-1.923199

%timeit df[(df.A < 0.5) & (df.B > 0.5)]

1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.query("(A < 0.5)&(B > 0.5)")

2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df.query("(A < 0.5)&(B > 0.5)").head()

	A	B	C	D
1	0.059432	0.749066	0.302429	-1.159019
2	0.489147	0.739153	0.777161	-5.512052
7	0.073950	0.730144	0.646190	-2.272672
10	0.393200	0.610467	0.697096	-3.313485
11	0.065734	0.764699	0.179380	-1.011958

np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))

True

（4）eval（）和query（）的使用时机

小数组时，普通方法反而更快

df.values.nbytes

32000

df1.values.nbytes

8000000

12.8.作业练习

创建DataFrame数组并进行相应操作：

1、创建一个30*6的DataFrame数组，元素由70~100之间均匀分布的随机整数构成，行标签按030201（初三.二班1号）~030230格式顺序排列，列标签分别为语文、数学、英语、物理、化学、计算机。

2、输出其纯数据、行标签、列标签、形状、大小和数据类型

3、获取全班数学成绩、获取学号为030205的同学的所有成绩；

4、增加总成绩的新列，并建立按总成绩降序排列的副本（注意是获得副本，不是获得视图），切片获得前十名学生的全部成绩；

5、创建一个DataFrame对象（记为B），行标签与上文DataFrame对象（记为A）一致，列标签为性别，数据为30个学生的随机性别，将A和B进行水平合并，获得新的DataFrame对象（记为C）；

6、输出数据C的info和describe信息，尝试自定义my_describe，输出自己感兴趣的统计信息；

7、按性别进行分组，对比男生女生所有科目及总成绩的平均值。

DataFrame数组操作

8.下载titanic数据集，执行下列操作：

import seaborn as sns

# titanic = sns.load_dataset("titanic")

titanic = pd.read_csv("data/titanic.csv") # 读取本地的csv文件

titanic.head()

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

male

22.0

7.2500

Third

man

True

NaN

Southampton

False

female

38.0

71.2833

First

woman

False

Cherbourg

yes

False

female

26.0

7.9250

Third

woman

False

NaN

Southampton

yes

True

female

35.0

53.1000

First

woman

False

Southampton

yes

False

male

35.0

8.0500

Third

man

True

NaN

Southampton

True

（1）获得一个删除了无年龄数据的所有行的副本；

（2）创建一个名为Age的Series对象，其数据来源于对数据集中的年龄按下列规则进行映射（参照行星数据集案例中decade的处理办法）：

If <10 : “0s”

elif <20 : “10s”

…

elif <60 : “50s”

…

（3）通过sex和Age对titanic数据集进行分组，获得不同性别、不同年龄段乘客的幸存比例，请分别使用groupby和pivot_table（如果直接用Age不行的话，换个思路）两种方法。

答案：

import pandas as pd

import numpy as np

classes = ["03020" + str(i) for i in range(1, 10)] + ["0302" + str(i) for i in range(10, 31)]

classes

['030201',

'030202',

...

'030230']

objectes = ["语文", "数学", "英语", "物理", "化学", "计算机"]

df = pd.DataFrame(np.random.randint(70, 100, (30, 6)), classes, objectes)

	语文	数学	英语	物理	化学	计算机
030201	91	74	85	92	84	78
030202	71	72	75	93	97	99
...	...	...	...	...	...	...
030230	74	86	90	85	80	82

print("row_index : {}, \ncol_index : {}, \nshape : {}, \nsize : {}, \ndtype : {}".format(df.index, df.columns,

df.shape, df.size,

df.dtypes))

row_index : Index(['030201', '030202', '030203', '030204', '030205', '030206', '030207',

'030208', '030209', '030210', '030211', '030212', '030213', '030214',

'030215', '030216', '030217', '030218', '030219', '030220', '030221',

'030222', '030223', '030224', '030225', '030226', '030227', '030228',

'030229', '030230'],

dtype='object'),

col_index : Index(['语文', '数学', '英语', '物理', '化学', '计算机'], dtype='object'),

shape : (30, 6),

size : 180,

dtype : 语文 int32

数学 int32

英语 int32

物理 int32

化学 int32

计算机 int32

dtype: object

print("math_score = \n{}, \n030205_score = \n{}".format(df["数学"], df.loc["030205"]))

math_score =

030201 74

030202 72

...

030230 86

Name: 数学, dtype: int32,

030205_score =

语文 76

数学 96

英语 77

物理 87

化学 82

计算机 93

Name: 030205, dtype: int32

df_cp = df.copy()

df_cp['总成绩'] = df_cp.apply(lambda x : x.sum(), axis=1)

df_cp

	语文	数学	英语	物理	化学	计算机	总成绩
030201	91	74	85	92	84	78	504
030202	71	72	75	93	97	99	507
...	...	...	...	...	...	...	...
030230	74	86	90	85	80	82	497

# 按总成绩降序排列的副本，切片获得前十名学生的全部成绩

df2 = df_cp.sort_values(by=['总成绩'], ascending=False)

# print(df2)

print(df2.iloc[0: 10, :])

语文数学英语物理化学计算机总成绩

030208 80 93 92 85 98 98 546

030203 95 95 81 80 97 90 538

030216 72 97 92 83 95 97 536

030220 89 96 80 83 93 90 531

030222 81 96 97 81 73 98 526

030226 89 99 90 85 86 76 525

030211 88 94 95 73 83 91 524

030213 98 95 76 85 94 75 523

030207 91 86 81 98 87 76 519

030210 76 99 82 88 95 78 518

gender = ['男' if np.random.random() < 0.5 else '女' for i in range(30)]

gender

['男',

'男',

...

'女']

df2 = pd.DataFrame(gender, index=classes, columns=["性别"])

print(df2)

性别

030201 男

030202 男

...

030230 女

c = pd.concat([df_cp, df2], axis=1)

print(c)

语文数学英语物理化学计算机总成绩性别

030201 91 74 85 92 84 78 504 男

030202 71 72 75 93 97 99 507 男

...

030230 74 86 90 85 80 82 497 女

print("C_info={}, C_describe={}".format(c.info(), c.describe()))

Index: 30 entries, 030201 to 030230

Data columns (total 8 columns):

语文 30 non-null int32

数学 30 non-null int32

...

总成绩

count 30.000000

...

max 546.000000

def my_describe(x):

return pd.Series([x.count()], index=["Count"])

c.apply(my_describe)

	语文	数学	英语	物理	化学	计算机	总成绩	性别
Count	30	30	30	30	30	30	30	30

c.groupby('性别').sum()

	语文	数学	英语	物理	化学	计算机	总成绩
性别
女	1189	1223	1155	1188	1220	1185	7160
男	1305	1353	1376	1356	1403	1345	8138

c.groupby('性别').mean()

	语文	数学	英语	物理	化学	计算机	总成绩
性别
女	84.928571	87.357143	82.5	84.857143	87.142857	84.642857	511.428571
男	81.562500	84.562500	86.0	84.750000	87.687500	84.062500	508.625000

8.（1）

data_no_age = titanic.drop(columns='age').copy()

data_no_age.head(10)

	survived	pclass	sex	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9	1	2	female	1	0	30.0708	C	Second	child	False	NaN	Cherbourg	yes	False

（2）

Age = titanic[titanic['age'].notnull()]

Age = 10 * (Age['age'] // 10)

Age = Age.astype(int)

Age = Age.astype(str) + 's'

print(Age)

0 20s

1 30s

...

890 30s

Name: age, Length: 714, dtype: object

（3）

t = titanic[titanic.age.notnull()]

t.groupby(["sex", Age])["survived"].mean().unstack()

age	0s	10s	20s	30s	40s	50s	60s	70s	80s
sex
female	0.633333	0.755556	0.722222	0.833333	0.687500	0.888889	1.000000	NaN	NaN
male	0.593750	0.122807	0.168919	0.214953	0.210526	0.133333	0.133333	0.0	1.0

t.age = Age.copy()

t.pivot_table("survived", index="sex", columns="age")

age	0s	10s	20s	30s	40s	50s	60s	70s	80s
sex
female	0.633333	0.755556	0.722222	0.833333	0.687500	0.888889	1.000000	NaN	NaN
male	0.593750	0.122807	0.168919	0.214953	0.210526	0.133333	0.133333	0.0	1.0

ruoqi23

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

	A	B	C	D
A	1.000000	0.831063	0.331060	0.510821
B	0.831063	1.000000	0.179244	0.719112
C	0.331060	0.179244	1.000000	-0.450365
D	0.510821	0.719112	-0.450365	1.000000

	A	B	C	D
0	1.082198	4.639594	1.579117	7.947086
1	13.113252	19.887811	22.762364	28.289408
2	-2.036341	-6.369518	-1.274717	-1.427283
3	-3.386712	-4.909077	-7.431287	-4.893571
4	4.328491	9.879485	15.456814	20.476806
5	1.171336	0.677427	-3.355186	3.043402