任务简介:
numpy的表现已经很好了,但是当我们需要处理更灵活的数据任务的时候(如为数据添加标签、 处理缺失值、分组等),numpy 的限制就非常明显了,基于numpy而创建的pandas 库提供了一种高效的带行标签和列标签的数据结构DataFrame,完美地解决了上述问题。
任务说明:
1、掌握DataFrame数据结构的创建和基本性质。
2、掌握Pandas库的数值运算和统计分析方法。
3、掌握DataFrame缺失值处理、数据集合并等操作。
4、掌握DataFrame累计与分组操作。
5、用eval和query实现更高效的计算。
一、引子
Numpy 在向量化的数值计算中表现优异
但是在处理更灵活、复杂的数据任务:
如为数据添加标签、处理缺失值、分组和透视表等方面
Numpy显得力不从心
而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具
二、Pandas对象创建
1. Pandas Series对象
Series 是带标签数据的一维数组
Series对象的创建:
通用结构: pd.Series(data, index=index, dtype=dtype)
data: 数据,可以是列表,字典或Numpy数组
index: 索引,为可选参数
dtype: 数据类型,为可选参数
- 用列表创建
- index缺省,默认为整数序列
输入:
import pandas as pd
data = pd.Series([1.5, 3, 4.5, 6])
data
输出:
0 1.5
1 3.0
2 4.5
3 6.0
dtype: float64
- 增加index
输入:
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
输出:
a 1.5
b 3.0
c 4.5
d 6.0
dtype: float64
- 增加数据类型
缺省则从传入的数据自动判断
输入:
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
data
输出:
a 1
b 2
c 3
d 4
dtype: int64
输入:
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
输出:
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
注意:数据支持多种类型
输入:
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
输出:
a 1
b 2
c 3
d 4
dtype: object
输入:
data["a"]
输出:
1
输入:
data["c"]
输出:
'3'
数据类型可被强制改变
输入:
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
输出:
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
输入:
data["c"]
输出:
3.0
输入:
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data
输出:
ValueError: could not convert string to float: 'a'
- 用一维numpy数组创建
输入:
import numpy as np
x = np.arange(5)
pd.Series(x)
输出:
0 0
1 1
2 2
3 3
4 4
dtype: int32
- 用字典创建
- 默认以键为index 值为data
输入:
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
population
输出:
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
dtype: int64
- 字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
输入:
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])
population
输出:
BeiJing 2154.0
HangZhou 981.0
c NaN
d NaN
dtype: float64
- data为标量的情况
输入:
pd.Series(5, index=[100, 200, 300])
输出:
100 5
200 5
300 5
dtype: int64
2. Pandas DataFrame对象
DataFrame 是带标签数据的多维数组
DataFrame对象的创建
通用结构: pd.DataFrame(data, index=index, columns=columns)
data: 数据,可以是列表,字典或Numpy数组
index: 索引,为可选参数
columns: 列标签,为可选参数
- 通过Series对象创建
输入:
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
pd.DataFrame(population)
输出:
输入:
pd.DataFrame(population, columns=["population"])
输出:
- 通过Series对象字典创建
输入:
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
GDP
输出:
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
输入:
pd.DataFrame({"population": population,
"GDP": GDP})
输出:
注意:数量不够的会自动补齐
输入:
pd.DataFrame({"population": population,
"GDP": GDP,
"country": "China"})
输出:
- 通过字典列表对象创建
- 字典索引作为index,字典键作为columns
输入:
import numpy as np
import pandas as pd
data = [{"a": i, "b": 2*i} for i in range(3)]
data
输出:
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
输入:
data = pd.DataFrame(data)
data
输出:
输入:
data1 = data["a"].copy()
data1 # 变成Series
输出:
0 0
1 1
2 2
Name: a, dtype: int64
输入:
data1[0] = 10
data1
输出:
0 10
1 1
2 2
Name: a, dtype: int64
输入:
data
输出:
- 不存在的键,会默认值为NaN
输入:
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
输出:
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
输入:
pd.DataFrame(data)
输出:
- 通过Numpy二维数组创建
输入:
data = np.random.randint(10, size=(3, 2))
data
输出:
array([[8, 5],
[1, 5],
[5, 8]])
输入:
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
输出:
三、DataFrame性质
1. 属性
输入:
data = pd.DataFrame({"pop": population, "GDP": GDP})
data
输出:
(1)df.values 返回numpy数组表示的数据
输入:
data.values
输出:
array([[ 2154, 30320],
[ 2424, 32680],
[ 1303, 24222],
[ 981, 13468]], dtype=int64)
(2)df.index 返回行索引
输入:
data.index
输出:
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')
(3)df.columns 返回列索引
输入:
data.columns
输出:
Index(['pop', 'GDP'], dtype='object')
(4)df.shape 形状
输入:
data.shape
输出:
(4, 2)
(5) pd.size 大小
输入:
data.size
输出:
8
(6)pd.dtypes 返回每列数据类型
输入:
data.dtypes
输出:
pop int64
GDP int64
dtype: object
2. 索引
输入:
data
输出:
(1)获取列
- 字典式
输入:
data["pop"]
输出:
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
Name: pop, dtype: int64
输入:
data[["GDP", "pop"]]
输出:
- 对象属性式
输入:
data.GDP
输出:
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
Name: GDP, dtype: int64
(2)获取行
- 绝对索引 df.loc
输入:
data.loc["BeiJing"]
输出:
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
输入:
data.loc[["BeiJing", "HangZhou"]]
输出:
- 相对索引 df.iloc
输入:
data
输出:
输入:
data.iloc[0]
输出:
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
输入:
data.iloc[[1, 3]]
输出:
(3)获取标量
输入:
data
输出:
输入:
data.loc["BeiJing", "GDP"]
输出:
30320
输入:
data.iloc[0, 1]
输出:
30320
输入:
data.values[0][1]
输出:
30320
(4)Series对象的索引
输入:
type(data.GDP)
输出:
pandas.core.series.Series
输入:
GDP
输出:
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
输入:
GDP["BeiJing"]
输出:
30320
3. 切片
输入:
dates = pd.date_range(start='2021-08-01', periods=6)
dates
输出:
DatetimeIndex(['2021-08-01', '2021-08-02', '2021-08-03', '2021-08-04',
'2021-08-05', '2021-08-06'],
dtype='datetime64[ns]', freq='D')
输入:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
输出:
(1)行切片
输入:
df["2021-08-01": "2021-08-03"]
输出:
输入:
df.loc["2021-08-01": "2021-08-03"]
输出:
输入:
df.iloc[0: 3]
输出:
(2)列切片
输入:
df
输出:
输入:
df.loc[:, "A": "C"]
输出:
输入:
df.iloc[:, 0: 3]
输出:
(3)多种多样的取值
输入:
df
输出:
- 行、列同时切片
输入:
df.loc["2021-08-02": "2021-08-03", "C":"D"]
输出:
输入:
df.iloc[1: 3, 2:]
输出:
- 行切片,列分散取值
输入:
df.loc["2021-08-04": "2021-08-06", ["A", "C"]]
输出:
输入:
df.iloc[3:, [0, 2]]
输出:
- 行分散取值,列切片
输入:
df.loc[["2021-08-02", "2021-08-06"], "C": "D"]
输出:
输入:
df.iloc[[1, 5], 0: 3]
输出:
- 行、列均分散取值
输入:
df.loc[["2021-08-04", "2021-08-06"], ["A", "D"]]
输出:
输入:
df.iloc[[1, 5], [0, 3]]
输出:
4. 布尔索引
输入:
df
输出:
输入:
df > 0
输出:
输入:
df[df > 0]
输出:
输入:
df.A > 0
输出:
2021-08-01 True
2021-08-02 True
2021-08-03 False
2021-08-04 True
2021-08-05 True
2021-08-06 False
Freq: D, Name: A, dtype: bool
输入:
df[df.A > 0]
输出:
- isin()方法
注意:此处以下由于时间久notebook被关闭,df重新生成,故与上面的df数值不同。
输入:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
输出:
输入:
ind = df2["E"].isin(["two", "four"])
ind
输出:
2021-08-01 False
2021-08-02 False
2021-08-03 True
2021-08-04 False
2021-08-05 True
2021-08-06 False
Freq: D, Name: E, dtype: bool
输入:
df2[ind]
输出:
(5)赋值
输入:
df
输出:
- DataFrame 增加新列
输入:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210801', periods=6))
s1
输出:
2021-08-01 1
2021-08-02 2
2021-08-03 3
2021-08-04 4
2021-08-05 5
2021-08-06 6
Freq: D, dtype: int64
输入:
df["E"] = s1
df
输出:
- 修改赋值
输入:
df.loc["2021-08-01", "A"] = 0
df
输出:
输入:
df.iloc[0, 1] = 0
df
输出:
输入:
df["D"] = np.array([5]*len(df)) # 可简化成df["D"] = 5
df
输出:
- 修改index和columns
输入:
df.index = [i for i in range(len(df))]
df
输出:
输入:
df.columns = [i for i in range(df.shape[1])]
df
输出:
四、数值运算及统计分析
1. 数据的查看
输入:
dates = pd.date_range(start='2021-08-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
输出:
(1)查看前面的行
输入:
df.head() # 默认5行
输出:
输入:
df.head(2)
输出:
(2)查看后面的行
输入:
df.tail() # 默认5行
输出:
输入:
df.tail(3)
输出:
(3)查看总体信息
输入:
df.iloc[0, 3] = np.nan
df
输出:
输入:
df.info()
输出:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2021-08-01 to 2021-08-06
Freq: D
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 6 non-null float64
1 B 6 non-null float64
2 C 6 non-null float64
3 D 5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
2. Numpy通用函数同样适用于Pandas
(1)向量化运算
输入:
x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
输出:
输入:
x+5
输出:
输入:
np.exp(x)
输出:
输入:
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
输出:
输入:
x*y
输出:
(2)矩阵化运算
输入:
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x
输出:
- 转置
输入:
z = x.T
z
输出:
输入:
np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y
输出:
输入:
x.dot(y)
输出:
输入:
%timeit x.dot(y)
输出:
80.1 µs ± 328 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
输入:
%timeit np.dot(x, y)
输出:
40.2 µs ± 856 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
- 执行相同运算,Numpy与Pandas的对比
输入:
x1 = np.array(x)
x1
输出:
array([[6, 3, 7, 4, 6, 9, 2, 6, 7, 4, 3, 7, 7, 2, 5, 4, 1, 7, 5, 1, 4, 0,
9, 5, 8, 0, 9, 2, 6, 3],
[8, 2, 4, 2, 6, 4, 8, 6, 1, 3, 8, 1, 9, 8, 9, 4, 1, 3, 6, 7, 2, 0,
3, 1, 7, 3, 1, 5, 5, 9],
[3, 5, 1, 9, 1, 9, 3, 7, 6, 8, 7, 4, 1, 4, 7, 9, 8, 8, 0, 8, 6, 8,
7, 0, 7, 7, 2, 0, 7, 2],
[2, 0, 4, 9, 6, 9, 8, 6, 8, 7, 1, 0, 6, 6, 7, 4, 2, 7, 5, 2, 0, 2,
4, 2, 0, 4, 9, 6, 6, 8],
[9, 9, 2, 6, 0, 3, 3, 4, 6, 6, 3, 6, 2, 5, 1, 9, 8, 4, 5, 3, 9, 6,
8, 6, 0, 0, 8, 8, 3, 8],
[2, 6, 5, 7, 8, 4, 0, 2, 9, 7, 5, 7, 8, 3, 0, 0, 9, 3, 6, 1, 2, 0,
4, 0, 7, 0, 0, 1, 1, 5],
[6, 4, 0, 0, 2, 1, 4, 9, 5, 6, 3, 6, 7, 0, 5, 7, 4, 3, 1, 5, 5, 0,
8, 5, 2, 3, 3, 2, 9, 2],
[2, 3, 6, 3, 8, 0, 7, 6, 1, 7, 0, 8, 8, 1, 6, 9, 2, 6, 9, 8, 3, 0,
1, 0, 4, 4, 6, 8, 8, 2],
[2, 2, 3, 7, 5, 7, 0, 7, 3, 0, 7, 3, 5, 7, 3, 2, 8, 2, 8, 1, 1, 1,
5, 2, 8, 3, 0, 3, 0, 4],
[3, 7, 7, 6, 2, 0, 0, 2, 5, 6, 5, 5, 5, 2, 5, 7, 1, 4, 0, 0, 4, 2,
3, 2, 0, 0, 4, 5, 2, 8],
[4, 7, 0, 4, 2, 0, 3, 4, 6, 0, 2, 1, 8, 9, 5, 9, 2, 7, 7, 1, 5, 6,
1, 9, 1, 9, 0, 7, 0, 8],
[5, 6, 9, 6, 9, 2, 1, 8, 7, 9, 6, 8, 3, 3, 0, 7, 2, 6, 1, 1, 6, 5,
2, 8, 9, 5, 9, 9, 5, 0],
[3, 9, 5, 5, 4, 0, 7, 4, 4, 6, 3, 5, 3, 2, 6, 7, 3, 1, 9, 2, 0, 7,
2, 9, 6, 9, 4, 9, 4, 6],
[8, 4, 0, 9, 9, 0, 1, 5, 8, 7, 4, 0, 6, 4, 5, 6, 2, 9, 2, 4, 5, 8,
4, 0, 3, 4, 9, 9, 4, 6],
[3, 0, 4, 6, 9, 9, 5, 4, 3, 1, 3, 9, 9, 2, 9, 0, 7, 4, 3, 7, 6, 1,
0, 3, 7, 1, 2, 0, 0, 2],
[4, 2, 0, 0, 7, 9, 1, 2, 1, 2, 6, 0, 9, 7, 9, 9, 9, 1, 2, 8, 6, 3,
9, 4, 1, 7, 3, 8, 4, 8],
[3, 9, 4, 8, 7, 2, 0, 2, 3, 1, 0, 6, 7, 6, 4, 0, 6, 6, 8, 2, 8, 0,
0, 3, 8, 5, 2, 0, 3, 8],
[2, 8, 6, 3, 2, 9, 4, 4, 2, 8, 3, 4, 3, 4, 6, 8, 6, 4, 9, 9, 6, 9,
4, 2, 6, 1, 8, 9, 9, 0],
[5, 6, 7, 9, 8, 1, 9, 1, 4, 4, 5, 2, 7, 0, 5, 3, 0, 6, 8, 3, 3, 5,
2, 5, 6, 9, 9, 2, 6, 2],
[1, 9, 3, 7, 8, 6, 0, 2, 8, 0, 8, 7, 0, 5, 4, 5, 9, 4, 5, 4, 4, 3,
2, 2, 3, 8, 1, 8, 0, 0],
[4, 5, 5, 2, 6, 8, 9, 7, 5, 7, 4, 7, 9, 3, 9, 7, 9, 1, 4, 8, 3, 5,
0, 8, 0, 4, 3, 2, 5, 1],
[2, 4, 8, 1, 9, 7, 1, 4, 6, 7, 0, 5, 0, 1, 0, 4, 9, 8, 5, 0, 0, 1,
8, 2, 0, 4, 6, 5, 0, 4],
[4, 5, 2, 4, 6, 4, 4, 4, 9, 9, 2, 0, 4, 8, 0, 2, 3, 0, 0, 7, 1, 7,
6, 9, 9, 1, 5, 5, 2, 1],
[0, 5, 4, 8, 0, 6, 4, 4, 1, 2, 6, 5, 1, 5, 1, 1, 1, 2, 1, 3, 8, 5,
0, 7, 6, 9, 2, 0, 4, 3],
[9, 7, 0, 9, 0, 3, 7, 4, 1, 5, 4, 1, 2, 8, 6, 6, 5, 7, 3, 7, 3, 7,
8, 2, 2, 1, 9, 2, 2, 4],
[4, 1, 9, 5, 4, 5, 0, 4, 8, 9, 1, 0, 9, 8, 9, 8, 8, 5, 7, 0, 9, 3,
0, 7, 0, 2, 3, 7, 5, 9],
[6, 7, 1, 9, 7, 2, 6, 2, 6, 1, 9, 5, 2, 2, 8, 6, 4, 9, 6, 8, 0, 6,
5, 9, 8, 0, 3, 8, 3, 9],
[2, 8, 1, 3, 5, 1, 7, 7, 0, 2, 9, 8, 4, 5, 3, 9, 1, 7, 5, 4, 8, 0,
4, 5, 4, 5, 5, 6, 3, 7],
[6, 8, 6, 2, 2, 7, 4, 3, 7, 5, 1, 3, 3, 5, 5, 0, 7, 5, 2, 8, 1, 7,
9, 2, 4, 5, 9, 5, 3, 2],
[3, 0, 3, 0, 0, 9, 5, 4, 3, 2, 0, 5, 1, 7, 9, 4, 6, 9, 1, 7, 1, 3,
0, 4, 8, 0, 8, 7, 5, 6]])
输入:
y1 = np.array(y)
y1
输出:
array([[5, 8, 9, 5, 0, 0, 1, 7, 6, 9, 2, 4, 5, 2, 4, 2, 4, 7, 7, 9, 1, 7,
0, 6, 9, 9, 7, 6, 9, 1],
[0, 1, 8, 8, 3, 9, 8, 7, 3, 6, 5, 1, 9, 3, 4, 8, 1, 4, 0, 3, 9, 2,
0, 4, 9, 2, 7, 7, 9, 8],
[6, 9, 3, 7, 7, 4, 5, 9, 3, 6, 8, 0, 2, 7, 7, 9, 7, 3, 0, 8, 7, 7,
1, 1, 3, 0, 8, 6, 4, 5],
[6, 2, 5, 7, 8, 4, 4, 7, 7, 4, 9, 0, 2, 0, 7, 1, 7, 9, 8, 4, 0, 1,
9, 8, 2, 3, 1, 2, 7, 2],
[6, 0, 9, 2, 6, 6, 2, 7, 7, 0, 6, 5, 1, 4, 6, 0, 6, 5, 1, 2, 1, 5,
4, 0, 7, 8, 9, 5, 7, 0],
[9, 3, 9, 1, 4, 4, 6, 8, 8, 9, 2, 7, 5, 5, 4, 5, 8, 5, 8, 1, 1, 8,
7, 0, 3, 4, 2, 0, 3, 5],
[1, 2, 4, 3, 0, 6, 0, 7, 2, 8, 3, 0, 8, 4, 2, 9, 0, 3, 8, 1, 4, 3,
3, 6, 7, 3, 5, 3, 2, 4],
[4, 0, 3, 3, 8, 3, 5, 6, 7, 5, 1, 7, 0, 2, 8, 2, 1, 4, 0, 4, 1, 7,
3, 1, 6, 6, 9, 6, 9, 6],
[0, 0, 2, 9, 6, 0, 6, 7, 0, 3, 9, 0, 3, 4, 7, 5, 3, 8, 8, 0, 6, 7,
9, 5, 4, 9, 5, 2, 5, 6],
[6, 8, 7, 7, 7, 2, 6, 0, 5, 2, 1, 8, 5, 9, 4, 9, 1, 2, 0, 4, 7, 0,
6, 2, 4, 3, 6, 7, 6, 3],
[0, 6, 4, 7, 6, 2, 9, 5, 9, 9, 9, 8, 6, 4, 2, 9, 4, 0, 0, 3, 4, 9,
3, 9, 1, 2, 5, 4, 0, 8],
[2, 3, 9, 9, 4, 4, 8, 2, 1, 6, 3, 8, 9, 7, 0, 5, 2, 2, 8, 5, 0, 5,
9, 8, 6, 6, 0, 4, 7, 3],
[0, 1, 6, 0, 6, 1, 6, 4, 2, 5, 4, 6, 2, 9, 2, 7, 5, 0, 7, 8, 8, 8,
0, 7, 2, 0, 7, 1, 1, 9],
[5, 1, 5, 9, 6, 4, 9, 8, 7, 5, 1, 8, 0, 5, 3, 9, 0, 4, 8, 6, 2, 4,
3, 2, 0, 0, 4, 2, 5, 0],
[0, 3, 8, 5, 3, 1, 4, 7, 3, 2, 2, 2, 6, 6, 0, 1, 5, 6, 5, 8, 8, 5,
5, 7, 5, 9, 1, 3, 9, 3],
[3, 3, 6, 1, 3, 0, 5, 0, 5, 2, 7, 6, 4, 0, 2, 4, 8, 7, 6, 7, 7, 1,
7, 7, 3, 8, 3, 0, 6, 3],
[0, 6, 5, 9, 6, 4, 6, 6, 2, 2, 4, 1, 2, 3, 9, 3, 6, 7, 0, 3, 3, 6,
8, 6, 5, 1, 3, 2, 6, 3],
[6, 7, 2, 8, 0, 1, 8, 6, 0, 0, 1, 2, 7, 7, 4, 4, 0, 1, 0, 8, 5, 6,
2, 5, 4, 3, 0, 6, 2, 1],
[9, 4, 4, 0, 9, 8, 7, 7, 6, 1, 7, 2, 4, 5, 6, 7, 2, 0, 5, 2, 7, 9,
9, 7, 1, 1, 4, 6, 5, 6],
[4, 1, 1, 5, 1, 2, 6, 2, 3, 3, 2, 3, 0, 0, 1, 5, 0, 5, 8, 8, 0, 0,
0, 9, 8, 5, 9, 3, 4, 0],
[9, 8, 6, 3, 9, 9, 0, 8, 1, 6, 6, 1, 3, 7, 3, 2, 3, 0, 2, 8, 2, 9,
0, 1, 3, 9, 4, 8, 8, 8],
[2, 8, 6, 4, 9, 0, 5, 5, 6, 1, 7, 6, 5, 7, 1, 9, 7, 5, 6, 7, 6, 7,
5, 6, 8, 7, 4, 2, 4, 0],
[0, 3, 5, 9, 0, 3, 6, 5, 1, 1, 8, 4, 7, 0, 1, 7, 1, 5, 4, 5, 6, 2,
5, 3, 9, 3, 9, 5, 1, 9],
[7, 7, 0, 8, 6, 1, 2, 0, 4, 4, 6, 1, 0, 9, 4, 9, 3, 0, 5, 1, 1, 9,
6, 0, 2, 8, 3, 7, 2, 5],
[6, 0, 4, 2, 3, 1, 0, 5, 7, 0, 2, 2, 0, 9, 5, 1, 1, 6, 3, 2, 1, 1,
2, 7, 5, 2, 9, 4, 7, 3],
[5, 0, 2, 1, 4, 9, 4, 6, 9, 3, 8, 8, 7, 1, 8, 7, 8, 9, 2, 2, 5, 5,
3, 5, 9, 2, 7, 4, 1, 6],
[9, 8, 1, 8, 1, 6, 2, 6, 1, 8, 3, 7, 0, 7, 0, 0, 7, 3, 9, 5, 2, 5,
1, 2, 5, 3, 3, 6, 1, 8],
[1, 8, 6, 4, 6, 9, 5, 4, 7, 2, 0, 2, 0, 5, 9, 4, 1, 4, 5, 2, 9, 3,
1, 5, 1, 1, 7, 1, 2, 6],
[0, 7, 7, 4, 3, 2, 7, 8, 5, 2, 4, 9, 2, 2, 3, 5, 9, 6, 4, 9, 0, 2,
8, 3, 7, 3, 9, 2, 3, 8],
[8, 0, 2, 6, 8, 3, 6, 4, 9, 7, 6, 3, 2, 9, 1, 5, 5, 6, 9, 4, 6, 7,
8, 5, 7, 2, 5, 3, 4, 5]])
输入:
%timeit x1.dot(y1)
输出:
14.5 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
输入:
%timeit np.dot(x1, y1)
输出:
14.8 µs ± 385 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
输入:
%timeit np.dot(x.values, y.values)
输出:
21.3 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
输入:
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
res = []
for j in i:
res.append(int(j))
x3.append(res)
for i in y2:
res = []
for j in i:
res.append(int(j))
y3.append(res)
def f(x, y):
res = []
for i in range(len(x)):
row = []
for j in range(len(y[0])):
sum_row = 0
for k in range(len(x[0])):
sum_row += x[i][k]*y[k][j]
row.append(sum_row)
res.append(row)
return res
%timeit f(x3, y3)
输出:
2.85 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
一般来说,纯粹的计算在Numpy里执行的更快
Numpy更侧重于计算,Pandas更侧重于数据处理
(3)广播运算
输入:
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
输出:
- 按行广播
输入:
x.iloc[0]
输出:
A 6
B 3
C 7
Name: 0, dtype: int32
输入:
x/x.iloc[0]
输出:
- 按列广播
输入:
x.A
输出:
0 6
1 4
2 2
Name: A, dtype: int32
输入:
x.div(x.A, axis=0) # add sub div mul
输出:
输入:
x.div(x.iloc[0], axis=1)
输出:
3. 新的用法
(1)索引对齐
输入:
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
输出:
输入:
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
输出:
- pandas会自动对齐两个对象的索引,没有的值用np.nan表示
输入:
A+B
输出:
- 缺省值也可用fill_value来填充
输入:
A.add(B, fill_value=0)
输出:
输入:
A*B
输出:
(2)统计相关
- 数据种类统计
输入:
y = np.random.randint(3, size=20)
y
输出:
array([0, 2, 1, 0, 1, 1, 1, 0, 1, 0, 1, 2, 2, 0, 2, 2, 1, 0, 1, 1])
输入:
np.unique(y)
输出:
array([0, 1, 2])
输入:
from collections import Counter
Counter(y)
输出:
Counter({0: 6, 2: 5, 1: 9})
输入:
y1 = pd.DataFrame(y, columns=["A"])
y1
输出:
输入:
np.unique(y1)
输出:
array([0, 1, 2])
输入:
y1["A"].value_counts()
输出:
1 9
0 6
2 5
Name: A, dtype: int64
- 产生新的结果,并进行排序
输入:
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
输出:
输入:
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info
输出:
递增排序
输入:
city_info.sort_values(by="per_GDP")
输出:
递减排序
输入:
city_info.sort_values(by="per_GDP", ascending=False)
输出:
按轴进行排序
输入:
data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
输出:
行排序
输入:
data.sort_index()
输出:
列排序
输入:
data.sort_index(axis=1)
输出:
输入:
data.sort_index(axis=1, ascending=False)
输出:
- 统计方法
输入:
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
输出:
非空个数
输入:
df.count()
输出:
A 6
B 6
C 6
D 6
dtype: int64
求和
输入:
df.sum()
输出:
A 9.480388
B 23.085391
C 12.228453
D 20.948778
dtype: float64
输入:
df.sum(axis=1)
输出:
0 29.387197
1 7.033140
2 1.297668
3 4.998616
4 15.101796
5 7.924592
dtype: float64
最大值 最小值
输入:
df.min()
输出:
A -4.252267
B -1.182524
C -4.039829
D -1.078671
dtype: float64
输入:
df.max(axis=1)
输出:
0 12.624040
1 4.886010
2 1.825853
3 5.722338
4 4.793761
5 6.659075
dtype: float64
输入:
df
输出:
输入:
df.idxmax()
输出:
A 0
B 0
C 3
D 0
dtype: int64
均值
输入:
df.mean()
输出:
A 1.580065
B 3.847565
C 2.038075
D 3.491463
dtype: float64
方差
输入:
df.var()
输出:
A 14.188574
B 25.808212
C 11.890393
D 11.415976
dtype: float64
标准差
输入:
df.std()
输出:
A 3.766772
B 5.080178
C 3.448245
D 3.378754
dtype: float64
中位数
输入:
df.median()
输出:
A 2.585222
B 2.519051
C 2.439898
D 3.837779
dtype: float64
众数
输入:
data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
输出:
输入:
data.mode()
输出:
75%分位数
输入:
df.quantile(0.75)
输出:
A 4.556137
B 5.667908
C 4.250340
D 4.842275
Name: 0.75, dtype: float64
一网打尽
输入:
df.describe()
输出:
输入:
data_2 = pd.DataFrame([["a", "a", "c", "d"],
["c", "a", "c", "b"],
["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
输出:
输入:
data_2.describe()
输出:
相关性系数和协方差
输入:
df.corr() # 如:A列和A列、A列和B列...的相关系数
输出:
输入:
df.corrwith(df["A"])
输出:
A 1.000000
B 0.672049
C -0.344121
D 0.333825
dtype: float64
自定义输出
apply(method)的用法:使用method方法默认对每一列进行相应的操作
输入:
df
输出:
输入:
df.apply(np.cumsum)
输出:
输入:
df.apply(np.cumsum, axis=1)
输出:
输入:
df.apply(sum)
输出:
A 9.480388
B 23.085391
C 12.228453
D 20.948778
dtype: float64
输入:
df.sum()
输出:
A 9.480388
B 23.085391
C 12.228453
D 20.948778
dtype: float64
输入:
df.apply(lambda x: x.max()-x.min())
输出:
A 9.394396
B 13.806564
C 9.762167
D 9.645756
dtype: float64
输入:
def my_describe(x):
return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
输出:
五、缺失值处理
1. 发现缺失值
输入:
data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
输出:
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
输入:
data.dtypes
输出:
A object
B object
C object
dtype: object
输入:
data.isnull()
输出:
输入:
data.notnull()
输出:
2. 删除缺失值
输入:
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
输出:
注意:np.nan是一种特殊的浮点数
输入:
data.dtypes
输出:
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
输入:
data.dropna()
输出:
(2)删除整列
输入:
data.dropna(axis="columns")
输出:
输入:
data["D"] = np.nan
data
输出:
输入:
data.dropna(axis="columns", how="all")
输出:
输入:
data.dropna(axis="columns", how="any")
输出:
输入:
data.loc[3] = np.nan
data
输出:
输入:
data.dropna(how="all")
输出:
3. 填充缺失值
输入:
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
输出:
输入:
data.fillna(value=5)
输出:
- 用均值进行替换
输入:
fill = data.mean()
fill
输出:
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
输入:
data.fillna(value=fill)
输出:
输入:
data.stack()
输出:
0 A 1.0
C 2.0
D 3.0
1 B 4.0
C 5.0
D 6.0
2 A 7.0
B 8.0
D 9.0
3 A 10.0
B 11.0
C 12.0
D 13.0
dtype: float64
输入:
fill = data.stack().mean()
fill
输出:
7.0
输入:
data.fillna(value=fill)
输出:
六、合并数据
- 构造一个生产DataFrame的函数
输入:
def make_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
make_df("ABC", range(3))
输出:
1. 垂直合并
输入:
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
输出:
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
输入:
pd.concat([df_1, df_2])
输出:
2. 水平合并
输入:
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
输出:
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
输入:
pd.concat([df_3, df_4], axis=1)
输出:
3. 索引重叠
- 行重叠
输入:
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
输出:
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
输入:
pd.concat([df_5, df_6])
输出:
输入:
pd.concat([df_5, df_6],ignore_index=True)
输出:
- 列重叠
输入:
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
输出:
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
输入:
pd.concat([df_7, df_8], axis=1)
输出:
输入:
pd.concat([df_7, df_8],axis=1, ignore_index=True)
输出:
4. 对齐合并merge()
输入:
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
输出:
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
输入:
pd.merge(df_9, df_10)
输出:
输入:
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
输出:
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
输入:
pd.merge(df_9, df_10)
输出:
【例】 合并城市信息
输入:
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
输出:
输入:
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
输出:
输入:
city_info = pd.merge(population, GDP)
city_info
输出:
输入:
city_info = pd.merge(population, GDP, how="outer")
city_info
输出:
七、分组和数据透视表
输入:
df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
输出:
1. 分组
- 延迟计算
输入:
df.groupby("key")
输出:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001CF610047C0>
输入:
df.groupby("key").sum()
输出:
输入:
df.groupby("key").mean()
输出:
输入:
for i in df.groupby("key"):
print(str(i))
print("------------------------")
输出:
('A', key data1 data2
0 A 0 4
5 A 5 8)
------------------------
('B', key data1 data2
1 B 1 5
4 B 4 5)
------------------------
('C', key data1 data2
2 C 2 2
3 C 3 4)
------------------------
- 按列取值
输入:
df.groupby("key")["data2"].sum()
输出:
key
A 12
B 10
C 6
Name: data2, dtype: int32
- 按组迭代
输入:
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
输出:
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
- 调用方法
输入:
df.groupby("key")["data1"].describe()
输出:
- 支持更复杂的操作
输入:
df.groupby("key").aggregate(["min", "median", "max"])
输出:
- 过滤
输入:
def filter_func(x):
return x["data2"].std() > 2
df.groupby("key")["data2"].std()
输出:
key
A 2.828427
B 0.000000
C 1.414214
Name: data2, dtype: float64
输入:
df.groupby("key").filter(filter_func)
输出:
- 转换
输入:
df
输出:
输入:
df.groupby("key").transform(lambda x: x-x.mean())
输出:
输入:
df.groupby("key").apply(lambda x: x-x.mean())
输出:
- apply()方法
输入:
def norm_by_data2(x):
x["data1"] /= x["data2"].sum()
return x
df.groupby("key").apply(norm_by_data2)
输出:
- 将列表、数组设为分组键
输入:
L = [0, 1, 0, 1, 2, 0] # 人为设置标签,按照标签进行分组
df
输出:
输入:
df.groupby(L).sum() # 人为设置标签,按照标签进行分组
输出:
- 用字典将索引映射到分组
输入:
df2 = df.set_index("key")
df2
输出:
输入:
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
输出:
- 任意Python函数
输入:
df2.groupby(str.lower).mean()
输出:
- 多个有效值组成的列表
输入:
df2.groupby([str.lower, mapping]).mean()
输出:
【例1】 行星观测数据处理
输入:
import seaborn as sns
planets = sns.load_dataset("planets")
planets.shape
输出:
(1035, 6)
输入:
planets.head()
输出:
输入:
planets.describe()
输出:
输入:
decade = 10 * (planets["year"] // 10) # 求整数商
decade.head()
输出:
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
输入:
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
输出:
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
输入:
planets.head()
输出:
输入:
planets.groupby(["method", decade]).sum()
输出:
输入:
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
输出:
2. 数据透视表
【例2】泰坦尼克号乘客数据分析
输入:
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()
输出:
输入:
T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
输出:
24.00 30
22.00 27
18.00 26
19.00 25
30.00 25
..
55.50 1
70.50 1
66.00 1
23.50 1
0.42 1
Name: age, Length: 88, dtype: int64
输入:
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.value_counts()
输出:
20 220
30 167
10 102
40 89
0 62
50 48
60 19
70 6
80 1
Name: age, dtype: int64
输入:
Age.astype(str)+"s"
输出:
0 20s
1 30s
2 20s
3 30s
4 30s
...
885 30s
886 20s
887 10s
889 20s
890 30s
Name: age, Length: 714, dtype: object
输入:
T.groupby(["sex", Age])["survived"].mean().unstack()
输出:
输入:
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
输出:
输入:
titanic.describe()
输出:
输入:
titanic.groupby("sex")[["survived"]].mean() # 2个"[]"输出的为dataframe格式
输出:
输入:
titanic.groupby("sex")["survived"].mean() # 1个"[]"输出的为series格式
输出:
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
输入:
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
输出:
- 数据透视表
输入:
titanic.pivot_table("survived", index="sex", columns="class") # 要分析的列、行标签、列标签,比上面的方法方便,可读性也强
输出:
输入:
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True) # margins:求和
输出:
输入:
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"})
输出:
八、其他
1. 向量化字符串操作
这部分后续自己学习补充
2. 处理时间序列
这部分后续自己学习补充
3. 多级索引:用于多维数据
输入:
base_data = np.array([[1771, 11115 ],
[2154, 30320],
[2141, 14070],
[2424, 32680],
[1077, 7806],
[1303, 24222],
[798, 4789],
[981, 13468]])
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
, [2008, 2018]*4], columns=["population", "GDP"])
data
输出:
输入:
data.index.names = ["city", "year"]
data
输出:
输入:
data["GDP"]
输出:
city year
BeiJing 2008 11115
2018 30320
ShangHai 2008 14070
2018 32680
ShenZhen 2008 7806
2018 24222
HangZhou 2008 4789
2018 13468
Name: GDP, dtype: int32
输入:
data.loc["ShangHai", "GDP"]
输出:
year
2008 14070
2018 32680
Name: GDP, dtype: int32
输入:
data.loc["ShangHai", 2018]["GDP"]
输出:
32680
4. 高性能的Pandas: eval()
输入:
df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
输出:
8.1 ms ± 83.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- 减少了复合代数式计算中间过程的内存分配
输入:
%timeit pd.eval("(df1+df2)/(df3+df4)")
输出:
5.73 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
输入:
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)")) # 判断两个值是否相等
输出:
True
- 实现列间运算
输入:
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
输出:
输入:
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_1
输出:
0 -1.563409
1 -2.279781
2 -0.478039
3 -3.997710
4 -1.820200
...
995 -18.328017
996 -2.211965
997 -1.651604
998 -193.146853
999 -1.054839
Length: 1000, dtype: float64
输入:
res_2 = df.eval("(A+B)/(C-1)")
res_2
输出:
0 -1.563409
1 -2.279781
2 -0.478039
3 -3.997710
4 -1.820200
...
995 -18.328017
996 -2.211965
997 -1.651604
998 -193.146853
999 -1.054839
Length: 1000, dtype: float64
输入:
np.allclose(res_1, res_2)
输出:
True
输入:
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
输出:
输入:
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
输出:
- 使用局部变量
输入:
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
输出:
0 0.077001
1 0.728563
2 0.079859
3 0.433338
4 0.362091
dtype: float64
5. 高性能的Pandas: query()
输入:
df.head()
输出:
输入:
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
输出:
385 µs ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
输入:
%timeit df.query("(A < 0.5)&(B > 0.5)")
输出:
1.57 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
发现:小数据量时,普通方法反而更快,因为eval() 和query() 都有内存的固定开销。
输入:
df.query("(A < 0.5)&(B > 0.5)").head()
输出:
输入:
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
输出:
True
6. eval() 和query() 的使用时机
小数组时,普通方法反而更快
输入:
df.values.nbytes # 查看数据大小
输出:
32000
输入:
df1.values.nbytes # 查看数据大小
输出:
8000000
九、总结