前言
内容翻译自10 minutes to pandas
pandas使用版本 0.23.0
一、pandas是什么
pandas是一个快速、强大、灵活且易于使用的开源数据分析和操作工具,构建在Python编程语言之上。
pandas中包含两种结构:
- Series是一个一维标记数组,可以保存任何数据类型(整数、字符串、浮点数、Python对象等)
- DataFrame是一个带有不同类型列的二维标记数据结构。可以将其视为电子表格或SQL表,或Series对象的字典。相比Series更加常用。
二、使用
1. 创建对象 Object creation
传入一个列表,创建Series对象,默认整数索引
Series
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
DataFrame
- 传入一个NumPy数组,将dates时间作为索引index,并设置相应的列标记columns,创建DataFrame对象
import numpy as np
import pandas as pd
dates = pd.date_range("20130101", periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
A B C D
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969
2013-01-02 0.865408 -2.301539 1.744812 -0.761207
2013-01-03 0.319039 -0.249370 1.462108 -2.060141
2013-01-04 -0.322417 -0.384054 1.133769 -1.099891
2013-01-05 -0.172428 -0.877858 0.042214 0.582815
2013-01-06 -1.100619 1.144724 0.901591 0.502494
- 传入字典,默认索引,字典的key即是对应的列名,创建DataFrame对象
DataFrame中不同的列可以有不同的dtype
f2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
print(f2)
print(f2.dtypes)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
2.查看数据 Viewing data
查看顶部和底部几行
f2.head(2)
# 2是行数,不指定默认查看全部数据
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
f2.tail(2)
# 2是行数,不指定默认查看全部数据
A B C D E F
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
查看索引和列名
f2.index
Int64Index([0, 1, 2, 3], dtype='int64')
f2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
获得统计数据
返回DataFrame中某几列(数字类型)的最大值,最小值,中位数之类
print(f2.describe())
# 只处理数字类型float64,float32,int32的列ACD
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
转置
print(f2.T)
排序
原始数据:
A B C D
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969
2013-01-02 0.865408 -2.301539 1.744812 -0.761207
2013-01-03 0.319039 -0.249370 1.462108 -2.060141
2013-01-04 -0.322417 -0.384054 1.133769 -1.099891
2013-01-05 -0.172428 -0.877858 0.042214 0.582815
2013-01-06 -1.100619 1.144724 0.901591 0.502494
- 按照轴排序,axis=1,表示按照列排序,axis=0表示按照行排序,ascending表示是否升序
print(df.sort_index(axis=1, ascending=False))
# 按列降序排列
D C B A
2013-01-01 -1.072969 -0.528172 -0.611756 1.624345
2013-01-02 -0.761207 1.744812 -2.301539 0.865408
2013-01-03 -2.060141 1.462108 -0.249370 0.319039
2013-01-04 -1.099891 1.133769 -0.384054 -0.322417
2013-01-05 0.582815 0.042214 -0.877858 -0.172428
2013-01-06 0.502494 0.901591 1.144724 -1.100619
- 按照某一列的值排序
print(df.sort_values(by="B"))
A B C D
2013-01-01 0.187485 -2.743918 1.000627 1.730537
2013-01-04 -0.882341 -1.104881 0.277582 0.154353
2013-01-05 0.332548 -0.874347 -0.188176 -0.546178
2013-01-02 -0.248065 0.051679 1.200797 2.189407
2013-01-06 -0.237779 0.196372 -0.304217 -0.596813
2013-01-03 -1.029446 0.525787 -0.737382 1.494209
3. 选择 Selection
原始数据:
A B C D
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969
2013-01-02 0.865408 -2.301539 1.744812 -0.761207
2013-01-03 0.319039 -0.249370 1.462108 -2.060141
2013-01-04 -0.322417 -0.384054 1.133769 -1.099891
2013-01-05 -0.172428 -0.877858 0.042214 0.582815
2013-01-06 -1.100619 1.144724 0.901591 0.502494
按列选择数据
按照列名选择一列数据,返回Series类型
print(df["A"])
2013-01-01 1.624345
2013-01-02 0.865408
2013-01-03 0.319039
2013-01-04 -0.322417
2013-01-05 -0.172428
2013-01-06 -1.100619
Freq: D, Name: A, dtype: float64
按照列名选取多列数据,返回DataFrame类型
print(df.loc[:, ["A", "B"]])
A B
2013-01-01 1.624345 -0.611756
2013-01-02 0.865408 -2.301539
2013-01-03 0.319039 -0.249370
2013-01-04 -0.322417 -0.384054
2013-01-05 -0.172428 -0.877858
2013-01-06 -1.100619 1.144724
按行选择数据
按照行名选择一行数据
print(df.loc["2013-01-03 "])
A 0.319039
B -0.249370
C 1.462108
D -2.060141
Name: 2013-01-03 00:00:00, dtype: float64
按照行索引选择多行数据
print(df[0:2])
A B C D
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969
2013-01-02 0.865408 -2.301539 1.744812 -0.761207
按行和列选取数据
指定一个行名和一个列名选择Dataframe中的某个值
print(df.loc[dates[0], "A"])
print(df.at[dates[0], "A"])
# 以上两句作用一致
1.6243453636632417
1.6243453636632417
指定一个行名和多个列名,选择数据
print(df.loc["20130102", ["A", "B"]])
A 0.865408
B -2.301539
Name: 2013-01-02 00:00:00, dtype: float64
指定开始行名和结束行名,指定多个列名,选择数据
print(df.loc["20130102":"20130104", ["A", "B"]])
A B
2013-01-02 0.865408 -2.301539
2013-01-03 0.319039 -0.249370
2013-01-04 -0.322417 -0.384054
按照位置选择数据
选取索引为3的行,即第四行
print(df.iloc[3])
# 同 df.loc["2013-01-04"]
# 不能使用df[3]
A -0.322417
B -0.384054
C 1.133769
D -1.099891
Name: 2013-01-04 00:00:00, dtype: float64
选取索引为3,4的行中的第1列和第2列
print(df.iloc[3:5, 0:2])
A B
2013-01-04 -0.322417 -0.384054
2013-01-05 -0.172428 -0.877858
选取索引为1,2,4的行中的第1列和第3列
print(df.iloc[[1, 2, 4], [0, 2]])
A C
2013-01-02 0.865408 1.744812
2013-01-03 0.319039 1.462108
2013-01-05 -0.172428 0.042214
默认选择全部的列或者行
print(df.iloc[1:3, :])
A B C D
2013-01-02 0.865408 -2.301539 1.744812 -0.761207
2013-01-03 0.319039 -0.249370 1.462108 -2.060141
print(df.iloc[:, 1:3])
B C
2013-01-01 -0.611756 -0.528172
2013-01-02 -2.301539 1.744812
2013-01-03 -0.249370 1.462108
2013-01-04 -0.384054 1.133769
2013-01-05 -0.877858 0.042214
2013-01-06 1.144724 0.901591
指定行或列,选择某个值
print(df.iloc[1, 1])
print(df.iat[1, 1])
-2.3015386968802827
-2.3015386968802827
布尔索引
按照布尔值的真假来选择数据
选择df中A列对应的数值大于0的
print(df[df["A"] > 0])
A B C D
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969
2013-01-02 0.865408 -2.301539 1.744812 -0.761207
2013-01-03 0.319039 -0.249370 1.462108 -2.060141
选择df中大于0的数据,其他数据使用NaN代替
A B C D
2013-01-01 1.624345 NaN NaN NaN
2013-01-02 0.865408 NaN 1.744812 NaN
2013-01-03 0.319039 NaN 1.462108 NaN
2013-01-04 NaN NaN 1.133769 NaN
2013-01-05 NaN NaN 0.042214 0.582815
2013-01-06 NaN 1.144724 0.901591 0.502494
改变原始数据
给df中增加一列,列名为F。通过索引连接,匹配不上的用nan代替
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print(s1)
df["F"] = s1
print(df)
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
A B C D F
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 NaN
2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0
2013-01-03 0.319039 -0.249370 1.462108 -2.060141 2.0
2013-01-04 -0.322417 -0.384054 1.133769 -1.099891 3.0
2013-01-05 -0.172428 -0.877858 0.042214 0.582815 4.0
2013-01-06 -1.100619 1.144724 0.901591 0.502494 5.0
改变df中原有的值
df.at[dates[0], "A"] = 0
df.iat[0, 1] = 0
df.loc[:, "D"] = np.array([5] * len(df))
print(df)
A B C D
2013-01-01 0.000000 0.000000 -0.528172 5
2013-01-02 0.865408 -2.301539 1.744812 5
2013-01-03 0.319039 -0.249370 1.462108 5
2013-01-04 -0.322417 -0.384054 1.133769 5
2013-01-05 -0.172428 -0.877858 0.042214 5
2013-01-06 -1.100619 1.144724 0.901591 5
4. 数据缺失处理 Missing data
Pandas中使用值np.nan表示缺失数据,默认情况不参与计算
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
print(df1)
A B C D E
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 1.0
2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0
2013-01-03 0.319039 -0.249370 1.462108 -2.060141 NaN
2013-01-04 -0.322417 -0.384054 1.133769 -1.099891 NaN
去掉包含缺失值的行
print(df1.dropna(axis=0, how="any"))
A B C D E
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 1.0
2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0
去掉包含缺失值的列
print(df1.dropna(axis=1, how="any"))
A B C D
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969
2013-01-02 0.865408 -2.301539 1.744812 -0.761207
2013-01-03 0.319039 -0.249370 1.462108 -2.060141
2013-01-04 -0.322417 -0.384054 1.133769 -1.099891
将缺失值设为某个值
print(df1.fillna(value=5))
A B C D E
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969 1.0
2013-01-02 0.865408 -2.301539 1.744812 -0.761207 1.0
2013-01-03 0.319039 -0.249370 1.462108 -2.060141 5.0
2013-01-04 -0.322417 -0.384054 1.133769 -1.099891 5.0
获得是否是缺失值的mask矩阵
print(pd.isna(df1))
A B C D E
2013-01-01 False False False False False
2013-01-02 False False False False False
2013-01-03 False False False False True
2013-01-04 False False False False True
5. 操作 Operations
统计
按行求平均
print(df.mean(0))
A 0.202221
B -0.546642
C 0.792720
D -0.651483
dtype: float64
按列求平均
print(df.mean(1))
2013-01-01 -0.147138
2013-01-02 -0.113132
2013-01-03 -0.132091
2013-01-04 -0.168148
2013-01-05 -0.106314
2013-01-06 0.362047
Freq: D, dtype: float64
s中的数据沿着指定的维度自动广播成df相同维度的数据,然后对应位置相减
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print(s)
print(df.sub(s, axis="index"))
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 -0.680961 -1.249370 0.462108 -3.060141
2013-01-04 -3.322417 -3.384054 -1.866231 -4.099891
2013-01-05 -5.172428 -5.877858 -4.957786 -4.417185
2013-01-06 NaN NaN NaN NaN
在指定维度的数据上应用函数
按行求累计和
print(df.apply(np.cumsum,axis=0))
A B C D
2013-01-01 1.624345 -0.611756 -0.528172 -1.072969
2013-01-02 2.489753 -2.913295 1.216640 -1.834176
2013-01-03 2.808792 -3.162665 2.678748 -3.894316
2013-01-04 2.486375 -3.546720 3.812517 -4.994207
2013-01-05 2.313947 -4.424578 3.854731 -4.411392
2013-01-06 1.213328 -3.279855 4.756322 -3.908898
按列累计和
print(df.apply(np.cumsum,axis=1))
A B C D
2013-01-01 1.624345 1.012589 0.484417 -0.588551
2013-01-02 0.865408 -1.436131 0.308681 -0.452526
2013-01-03 0.319039 0.069669 1.531777 -0.528364
2013-01-04 -0.322417 -0.706472 0.427298 -0.672593
2013-01-05 -0.172428 -1.050287 -1.008073 -0.425258
2013-01-06 -1.100619 0.044105 0.945695 1.448190
按行求最大值和最小值的差
print(df.apply(lambda x: x.max() - x.min(),axis=0))
A 2.724965
B 3.446262
C 2.272984
D 2.642956
dtype: float64
按列求最大值和最小值的差
print(df.apply(lambda x: x.max() - x.min(),axis=1))
2013-01-01 2.697314
2013-01-02 4.046350
2013-01-03 3.522249
2013-01-04 2.233661
2013-01-05 1.460674
2013-01-06 2.245343
Freq: D, dtype: float64
Histogramming
统计每个值出现次数
s = pd.Series(np.random.randint(0, 7, size=10))
# 默认索引
print(s)
print(s.value_counts())
0 5
1 3
2 1
3 2
4 0
5 4
6 1
7 2
8 2
9 1
dtype: int32
2 3
1 3
5 1
4 1
3 1
0 1
dtype: int64
字符串操作
转字符串为小写
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
print(s)
print(s.str.lower())
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
6. 合并 Merge
按行合并
df = pd.DataFrame(np.random.randn(10, 4))
pieces = [df[:3], df[3:7], df[7:]]
print(pd.concat(pieces))
# 合并时候和原来一样
按指定的列合并
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
print(left)
print(right)
print(pd.merge(left, right, on="key"))
key lval
0 foo 1
1 bar 2
key rval
0 foo 4
1 bar 5
key lval rval
0 foo 1 4
1 bar 2 5
7. 分组 Grouping
- 根据一些标准将数据分组
- 对每一组单独应用一个函数
- 将结果合并到数据结构中
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8)
})
print(df)
print(df.groupby("A").sum())
print(df.groupby(["A", "B"]).sum())
C D
A
bar -3.986264 -2.693565
foo 2.945186 1.492608
C D
A B
bar one -0.611756 -0.249370
three -1.072969 -2.060141
two -2.301539 -0.384054
foo one 3.369157 1.452809
three -0.761207 -1.099891
two 0.337236 1.139691
- 其他
还有一些不常用的方法,暂未整理,后面遇到再补上
Reshaping
Time series
Categoricals
Plotting
Getting data in/out
Gotchas
总结
本文仅仅简单介绍了pandas的使用,而pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
其他更复杂的使用见 https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook