Pandas是Python的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。Pandas基于NumPy开发,可以与其它第三方科学计算支持库完美集成。
Pandas的主要数据结构是Series(一维数据)与DataFrame(二维数据),这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。
Pandas可直接读取数据并做处理,高效简单;兼容各种数据库;支持各种分析算法。
import numpy as np
import pandas as pd
一、Series数据结构
Series是带标签的一维数组(有顺序),可存储整数、浮点数、字符串、Python对象等类型的数据。轴标签统称为索引。
1、创建 pd.Series(data, index=[])
1) 多维数组
data是多维数组时,index长度必须与data长度一致。没有指定index参数时,创建数值型索引,即[0, …, len(data) - 1]。
import pandas as pd
s = pd.Series(np.random.randn(5))
s # <class 'pandas.core.series.Series'>
'''
0 -1.335396
1 -1.302328
2 0.048324
3 -0.145933
4 -0.821122
dtype: float64
'''
# .index查看series索引,类型是<class 'pandas.indexes.range.RangeIndex'>
s.index
# .values查看series值,类型是<class 'numpy.ndarray'>
s.values
pd.Series(np.random.randn(5), index = list('abcde'))
'''
a -0.290481
b 1.893950
c 0.927622
d 0.233786
e 1.502956
dtype: float64
'''
2) 字典
data是字典时,且未设置index参数时,Series按字典的插入顺序排序索引(Python >=3.6 && Pandas>= 0.23),否则Series按字母顺序排序。
如果设置了index参数,则按索引标签提取data里对应的值。
d = {'b': 1, 'd': 4, 'a': 0, 'c': 2}
pd.Series(d)
'''
这是按字母顺序排序的
a 0
b 1
c 2
d 4
dtype: int64
'''
pd.Series(d, index = ['a', 'c', 'b', 'b'])
'''
a 0
c 2
b 1
b 1
dtype: int64
'''
3) 标量值
data是标量值时,必须提供索引。Series按索引长度重复该标量值。
pd.Series('cong', index = range(5))
'''
0 cong
1 cong
2 cong
3 cong
4 cong
dtype: object
'''
2、索引及切片
s = pd.Series(np.random.randn(6), index = ['a', 'b', 'c', 'd', 'e', 'f'])
'''
a 1.367450
b 0.435309
c 1.596586
d -0.083855
e -1.464273
f -2.672955
dtype: float64
'''
1) 下标索引及切片
下标切片是前闭后开[)
s[3]
# -0.0838549595595
s[-1]
# -2.67295467388
s[1:-2]
'''
b 0.435309
c 1.596586
d -0.083855
dtype: float64
'''
s[::2]
'''
a 1.367450
c 1.596586
e -1.464273
dtype: float64
'''
2) 标签索引及切片
标签切片是前闭后闭,末端包含[]
s['b']
# 0.435309196563
s[['b','a','d']]
'''
b 0.435309
a 1.367450
d -0.083855
dtype: float64
'''
s['b':'e']
'''
b 0.435309
c 1.596586
d -0.083855
e -1.464273
dtype: float64
'''
s['b'::2]
'''
b 0.435309
d -0.083855
f -2.672955
dtype: float64
'''
3) 布尔型索引
数组作判断后,返回一个由布尔值组成的新数组。
.isnull() / .notnull()判断是否为空值。None空值,NaN缺失值,两个均识别为空值。
s = pd.Series(np.random.rand(3)*100)
s[2] = None
s[5] = None
s
'''
0 3.35074
1 76.1909
2 NaN
5 None
dtype: object
'''
s > 50
'''
0 False
1 True
2 False
5 False
dtype: bool
'''
s.isnull()
'''
0 False
1 False
2 True
5 True
dtype: bool
'''
s.notnull()
'''
0 True
1 True
2 False
5 False
dtype: bool
'''
s[s > 50]
'''
1 76.1909
dtype: object
'''
s[s.notnull()]
'''
0 3.35074
1 76.1909
dtype: object
'''
3、基础操作
1) 查看头部/尾部数据.head()/.tail()
.head()查看头部数据
.tail()查看尾部数据
默认查看5条
s = pd.Series(np.random.rand(50))
s.head()
'''
0 0.459775
1 0.630393
2 0.883201
3 0.976862
4 0.157557
dtype: float64
'''
s.tail(2)
'''
48 0.818718
49 0.776736
dtype: float64
'''
2) .reindex()
.reindex()根据索引提取数据,如果当前索引不存在,则为NaN或者fill_value值。
s = pd.Series(np.random.rand(3), index = ['a', 'b', 'c'])
s1 = s.reindex(['c', 'a', 'd'])
'''
c 0.779870
a 0.128787
d NaN
dtype: float64
'''
s2 = s.reindex(['c', 'a', 'd'], fill_value = 0)
'''
c 0.779870
a 0.128787
d 0.000000
dtype: float64
'''
3) 对齐
Series之间的操作会基于标签自动对齐
chn = pd.Series([88, 72, 95], index = ['Jack', 'Mary', 'Tom'])
math = pd.Series([86, 90, 95], index = ['Lily', 'Mary', 'Jack'])
chn + math
'''
Jack 183.0
Lily NaN
Mary 162.0
Tom NaN
dtype: float64
'''
4) 删除.drop()
inplace属性默认为False,.drop()不修改Series本身,返回新Series。
inplace为True时,修改Series本身,.drop()返回值变为None。
s = pd.Series(np.random.rand(3))
s.drop(1)
'''
0 0.286693
2 0.542345
dtype: float64
'''
s.drop([0,1])
'''
2 0.542345
dtype: float64
'''
s.drop(1, inplace = True) # None
s
'''
0 0.286693
2 0.542345
dtype: float64
'''
5) 添加
直接通过下标/标签添加值;
通过.append()方法,添加一个Series,生成一个新Series,不改变之前的Series。
s = pd.Series(np.random.rand(2))
s[5] = 100
s['a'] = 100
s
'''
0 0.864331
1 0.662573
5 100.000000
a 100.000000
dtype: float64
'''
s1 = pd.Series(np.random.rand(2), index = ['a', 'c'])
s2 = pd.Series(np.random.rand(2), index = ['b', 'd'])
s1.append(s2)
'''
a 0.443560
c 0.406555
b 0.899235
d 0.656586
dtype: float64
'''
6).name属性和.rename()
Series支持name属性。一般情况下,Series自动分配name,特别是提取一维 DataFrame切片时。
.rename()方法用于重命名Series,生成新Series,和原Series指向不同的对象。
s = pd.Series(np.random.rand(2), name = "test")
s
'''
0 0.311986
1 0.313821
Name: test, dtype: float64
'''
s.name # test
s2 = s.rename("test2")
s2[1] = 100
s2
'''
0 0.22976
1 100.00000
Name: test2, dtype: float64
'''
s
'''
0 0.229760
1 0.483885
Name: test, dtype: float64
'''
二、DataFrame数据结构
DataFrame是由多种类型的列构成的二维标签数据结构。
1、创建 pd.DataFrame(data, index=[], columns=[])
总结:
- 对于list、ndarray,重新指定索引必须与list或ndarray长度相同;
(因为index是用来重写行索引,而不是提取其中的几行) - 对于字典、Series,指定索引提取其中对应的key值。
1) list字典 / ndarray字典
由list字典 / ndarray字典创建DataFrame,
- columns为字典key值,可以重新指定列,如果字典中没有该key,则产生NaN值。
- index默认为数字标签。如果传递了index,index的长度必须与list / ndarray长度一致。
- list / ndarray的长度必须保持一致。
# list字典创建
dic = {"name": ["Mary", "Tom", "Jack"],
"age": [18, 25, 17]}
df = pd.DataFrame(dic) # <class 'pandas.core.frame.DataFrame'>
'''
age name
0 18 Mary
1 25 Tom
2 17 Jack
'''
# .index查看行标签,值为RangeIndex(start=0, stop=3, step=1),类型为<class 'pandas.indexes.range.RangeIndex'>
df.index
# .columns查看列标签,值为Index(['age', 'name'], dtype='object'),类型为<class 'pandas.indexes.base.Index'>
df.columns
# .values查看值,类型为<class 'numpy.ndarray'>
df.values
'''
[[18 'Mary']
[25 'Tom']
[17 'Jack']]
'''
# columns为字典key值,可以重新指定列,如果字典中没有该key,则产生NaN值。
dic = {"name": ["Mary", "Tom", "Jack"],
"age": [18, 25, 17]}
df = pd.DataFrame(dic, columns = ["name", "gender"])
'''
name gender
0 Mary NaN
1 Tom NaN
2 Jack NaN
'''
# ndarray字典创建
dic = {"one":np.random.rand(3),
"two":np.random.rand(3)}
df = pd.DataFrame(dic)
'''
one two
0 0.281193 0.777831
1 0.528591 0.618943
2 0.511128 0.215831
'''
df2 = pd.DataFrame(dic, index = list("bac"))
'''
one two
b 0.586689 0.654119
a 0.247894 0.696588
c 0.797849 0.459429
'''
2) Series字典
由Series字典创建DataFrame,
- columns为字典key值,可以重新指定列,如果字典中没有该key,则产生NaN值。
- index为Series索引,不指定index时生成的索引是每个Series索引的并集,不存在的值为NaN;设置index后,只提取index指定的Series索引值。
- Series长度可以不一致,
# Series字典创建
dic = {"one":pd.Series(np.random.rand(3), index = ["a", "b", "c"]),
"two":pd.Series(np.random.rand(2), index = ["b", "c"])}
pd.DataFrame(dic)
'''
one two
a 0.690421 NaN
b 0.067540 0.876073
c 0.213951 0.149273
'''
pd.DataFrame(dic, index = ["b", "a"])
'''
one two
b 0.067540 0.876073
a 0.690421 NaN
'''
pd.DataFrame(dic, columns = ["one", "three"])
'''
one three
a 0.690421 NaN
b 0.067540 NaN
c 0.213951 NaN
'''
3) 二维数组
由二维数组创建DataFrame,
- index和columns长度分别与二维数组的长度和二维数组的元素长度一致。
# 二维数组创建
arr = np.random.rand(8).reshape(2, 4)
pd.DataFrame(arr)
'''
0 1 2 3
0 0.821861 0.491297 0.306242 0.416237
1 0.743759 0.409836 0.833850 0.557605
'''
pd.DataFrame(arr, index = list("ab"), columns = ["one", "two", "three", "four"])
'''
one two three four
a 0.821861 0.491297 0.306242 0.416237
b 0.743759 0.409836 0.833850 0.557605
'''
4) 字典列表
由字典列表创建DataFrame,
- columns为字典key值,可以重新指定列,如果字典中没有该key,则产生NaN值。
- index默认为数字标签。如果传递了index,index的长度必须与列表长度一致。
# 字典列表创建
lst = [{"a": 1, "c": 5, "b": 6},
{"b": 3, "c": 3}]
pd.DataFrame(lst)
'''
a b c
0 1.0 6 5
1 NaN 3 3
'''
pd.DataFrame(lst, index = ["one", "two"], columns = ["b", "c"])
'''
b c
one 6 5
two 3 3
'''
5) 二维字典
由二维字典创建DataFrame,
- columns为一维字典key值,可以重新指定列,如果一维字典中没有该key,则产生NaN值。
- index默认为为子字典key值,可以重新指定index,如果子字典中没有该key,则产生NaN值。
# 二维字典创建
dic = {'Jack': {'math': 99, 'chinese': 83, 'art': 87},
'Mary': {'math': 90, 'chinese': 90, 'art': 91},
'Bob': {'math': 86, 'chinese': 79}}
pd.DataFrame(dic)
'''
Bob Jack Mary
art NaN 87 91
chinese 79.0 83 90
math 86.0 99 90
'''
pd.DataFrame(dic, index = ["math", "chinese", "english"], columns = ["Mary", "Jack", "Tom"])
'''
Mary Jack Tom
math 90.0 99.0 NaN
chinese 90.0 83.0 NaN
english NaN NaN NaN
'''
2、索引及切片
1) 选择列 df[col]
选择不存在的col会报错。
# 选择列 df[col]
df = pd.DataFrame(np.random.rand(12).reshape(3, 4),
index = ["one", "two", "three"],
columns = ["a", "b", "c", "d"])
'''
a b c d
one 0.601347 0.617820 0.105009 0.128088
two 0.152353 0.242718 0.325983 0.163464
three 0.063760 0.216591 0.028874 0.729303
'''
df["c"] # <class 'pandas.core.series.Series'>
'''
one 0.105009
two 0.325983
three 0.028874
Name: c, dtype: float64
'''
df[["d", "b"]] # <class 'pandas.core.frame.DataFrame'>
'''
d b
one 0.128088 0.617820
two 0.163464 0.242718
three 0.729303 0.216591
'''
2) 用标签选择行 df.loc[label]
如果label不存在,则返回NaN。
# 用标签选择行 df.loc[label]
df = pd.DataFrame(np.random.rand(12).reshape(3,4),
columns = ["a", "b", "c", "d"])
'''
a b c d
0 0.515504 0.615885 0.185691 0.716931
1 0.039046 0.875238 0.145669 0.310193
2 0.243892 0.817281 0.701406 0.830673
'''
df.loc[1] # <class 'pandas.core.series.Series'>
'''
a 0.039046
b 0.875238
c 0.145669
d 0.310193
Name: 1, dtype: float64
'''
df.loc[[3, 0]] # <class 'pandas.core.frame.DataFrame'>
'''
a b c d
3 NaN NaN NaN NaN
0 0.515504 0.615885 0.185691 0.716931
'''
# 切片索引,末端包含
df.loc[0:1] # <class 'pandas.core.frame.DataFrame'>
'''
a b c d
0 0.515504 0.615885 0.185691 0.716931
1 0.039046 0.875238 0.145669 0.310193
'''
3) 用整数位置选择行 df.iloc[index]
如果index越界,会报错。
# 用整数位置选择行 df.iloc[index]
df = pd.DataFrame(np.random.rand(12).reshape(4, 3),
index = ['one', 'two', 'three', 'four'],
columns = ['a', 'b', 'c'])
'''
a b c
one 0.925215 0.523920 0.825322
two 0.177753 0.069859 0.113256
three 0.490654 0.690347 0.910460
four 0.880070 0.192099 0.161468
'''
df.iloc[1] # <class 'pandas.core.series.Series'>
'''
a 0.177753
b 0.069859
c 0.113256
Name: two, dtype: float64
'''
df.iloc[[3,-1]] # <class 'pandas.core.frame.DataFrame'>
'''
a b c
four 0.88007 0.192099 0.161468
four 0.88007 0.192099 0.161468
'''
df.iloc[::2] # <class 'pandas.core.frame.DataFrame'>
'''
a b c
one 0.925215 0.523920 0.825322
three 0.490654 0.690347 0.910460
'''
# 切片索引,末端不包含
df.iloc[1:3] # <class 'pandas.core.frame.DataFrame'>
'''
a b c
two 0.177753 0.069859 0.113256
three 0.490654 0.690347 0.910460
'''
4) 用布尔向量索引 df[bool_vec]
# 用布尔向量索引 df[bool_vec]
df = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index = ['one', 'two', 'three', 'four'],
columns = ['a', 'b', 'c', 'd'])
'''
a b c d
one 88.144522 68.527028 47.129355 32.633366
two 78.557709 81.340032 26.315262 13.479440
three 25.819493 63.532810 99.830179 22.410082
four 91.287652 88.188402 9.637995 81.761480
'''
对每个值做判断:返回所有数据,True返回原数据,False返回NaN。
# 对每个值做判断
df > 50
'''
a b c d
one True True False False
two True True False False
three False True True False
four True True False True
'''
df[df > 50] # <class 'pandas.core.frame.DataFrame'>
'''
a b c d
one 88.144522 68.527028 NaN NaN
two 78.557709 81.340032 NaN NaN
three NaN 63.532810 99.830179 NaN
four 91.287652 88.188402 NaN 81.76148
'''
对单列做判断,选择行:返回为True的值所在行。
# 对单列做判断,选择行
df["a"] > 50
'''
one True
two True
three False
four True
Name: a, dtype: bool
'''
df[df["a"] > 50] # <class 'pandas.core.frame.DataFrame'>
'''
a b c d
one 88.144522 68.527028 47.129355 32.633366
two 78.557709 81.340032 26.315262 13.479440
four 91.287652 88.188402 9.637995 81.761480
'''
对多列做判断:返回所有数据,True返回原数据,False返回NaN,未选择的列均返回NaN。
# 对多列做判断
df[["a", "b"]] > 50
'''
a b
one True True
two True True
three False True
four True True
'''
df[df[["a", "b"]] > 50] # <class 'pandas.core.frame.DataFrame'>
'''
a b c d
one 88.144522 68.527028 NaN NaN
two 78.557709 81.340032 NaN NaN
three NaN 63.532810 NaN NaN
four 91.287652 88.188402 NaN NaN
'''
对多行做判断:返回所有数据,True返回原数据,False返回NaN,未选择的行均返回NaN。
# 对多行做判断
df.loc[["two", "three"]] > 50
'''
a b c d
two True True False False
three False True True False
'''
df[df.loc[["two", "three"]] > 50] # <class 'pandas.core.frame.DataFrame'>
'''
a b c d
one NaN NaN NaN NaN
two 78.557709 81.340032 NaN NaN
three NaN 63.532810 99.830179 NaN
four NaN NaN NaN NaN
'''
5) 索引的组合使用
df[["a", "c"]].loc[["one", "four"]]
'''
a c
one 86.963936 18.116991
four 9.805014 62.993647
'''
df[df["a"] < 50].iloc[0]
'''
a 13.059468
b 85.634085
c 83.900548
d 48.594762
Name: two, dtype: float64
'''
3、基础操作
1) 转置T
# 转置
df = pd.DataFrame(np.random.rand(6).reshape(2, 3),
index = ["one", "two"],
columns = ["a", "b", "c"])
'''
a b c
one 0.583898 0.506374 0.651442
two 0.193543 0.416532 0.912228
'''
df.T
'''
one two
a 0.583898 0.193543
b 0.506374 0.416532
c 0.651442 0.912228
'''
2) 查看头部/尾部数据.head()/.tail()
.head()查看头部数据
.tail()查看尾部数据
默认查看5条
3) 添加与修改
通过索引添加与修改。
# 添加
df = pd.DataFrame(np.random.rand(9).reshape(3, 3) * 100,
index = ["one", "two", "three"],
columns = list("abc"))
df["d"] = 10 # 添加列,以标量值填充
df.loc["four"] = df.loc["one"] + df.loc["two"] # 添加行
df
'''
a b c d
one 42.366201 81.319170 65.621465 10.0
two 23.171599 35.551451 55.851890 10.0
three 64.258966 25.451767 37.622228 10.0
four 65.537800 116.870621 121.473355 20.0
'''
# 修改
df = pd.DataFrame(np.random.rand(9).reshape(3, 3) * 100,
index = ["one", "two", "three"],
columns = list("abc"))
df["a"] = df["b"] + df["c"]
df["b"].loc[["two", "three"]] = 20
df
'''
a b c
one 84.388654 59.621808 24.766845
two 129.506297 20.000000 91.568479
three 116.821680 20.000000 89.145148
'''
4) 删除 del/.drop()
# 删除
df = pd.DataFrame(np.random.rand(25).reshape(5, 5) * 100,
index = ["one", "two", "three", "four", "five"],
columns = list("abcde"))
# 删除列
del df["a"] # 改变原数据
c = df.pop("c") # 改变原数据
df.drop(["d"], axis = 1, inplace = True) # 改变原数据
newDf = df.drop(["b"], axis = 1) # inplace = False,删除后生成新的数据,不改变原数据
df
'''
b e
one 29.632793 94.193498
two 84.801648 5.697404
three 79.095469 57.280865
four 45.895612 78.464924
five 20.933363 84.786636
'''
newDf
'''
e
one 94.193498
two 5.697404
three 57.280865
four 78.464924
five 84.786636
'''
# 删除行
df.drop("one", inplace = True)
df.drop(["three", "four"], inplace = True)
df
'''
b e
two 88.776930 58.910481
five 19.339399 6.110319
'''
5) 对齐
DataFrame之间的操作会基于index和columns自动对齐。
# 对齐
df1 = pd.DataFrame(np.floor(np.random.rand(9).reshape(3, 3) * 10),
index = list("abc"))
df2 = pd.DataFrame(np.floor(np.random.rand(25).reshape(5, 5) * 10),
index = ["c", "d", "e", "b", "a"])
df1
'''
0 1 2
a 4.0 6.0 5.0
b 1.0 5.0 6.0
c 4.0 8.0 9.0
'''
df2
'''
0 1 2 3 4
c 9.0 1.0 8.0 7.0 8.0
d 1.0 5.0 2.0 7.0 9.0
e 6.0 6.0 1.0 7.0 7.0
b 9.0 5.0 8.0 5.0 1.0
a 0.0 1.0 3.0 9.0 1.0
'''
df1 + df2
'''
0 1 2 3 4
a 4.0 7.0 8.0 NaN NaN
b 10.0 10.0 14.0 NaN NaN
c 13.0 9.0 17.0 NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
'''
6) 排序 .sort_values()/.sort_index()
同样适用于Series。
# 排序
df = pd.DataFrame(np.floor(np.random.rand(15).reshape(5, 3) * 10),
index = ["c", "d", "e", "b", "a"],
columns = ["one", "two", "three"])
'''
one two three
c 0.0 3.0 4.0
d 0.0 4.0 0.0
e 5.0 6.0 9.0
b 5.0 1.0 6.0
a 7.0 8.0 0.0
'''
# 按值排序 .sort_values()
df.sort_values("one", ascending = True)
'''
one two three
c 0.0 3.0 4.0
d 0.0 4.0 0.0
e 5.0 6.0 9.0
b 5.0 1.0 6.0
a 7.0 8.0 0.0
'''
df.sort_values("one", ascending = False)
'''
one two three
a 7.0 8.0 0.0
e 5.0 6.0 9.0
b 5.0 1.0 6.0
c 0.0 3.0 4.0
d 0.0 4.0 0.0
'''
df.sort_values(["one", "three"]) # 多列排序,第一列升序的基础上第二列升序
'''
one two three
d 0.0 4.0 0.0
c 0.0 3.0 4.0
b 5.0 1.0 6.0
e 5.0 6.0 9.0
a 7.0 8.0 0.0
'''
df.sort_values("two", inplace = True) # None
df
'''
one two three
b 5.0 1.0 6.0
c 0.0 3.0 4.0
d 0.0 4.0 0.0
e 5.0 6.0 9.0
a 7.0 8.0 0.0
'''
# 按索引排序 .sort_index
# 默认ascending = True, inplace = False
df.sort_index()
'''
one two three
a 7.0 8.0 0.0
b 5.0 1.0 6.0
c 0.0 3.0 4.0
d 0.0 4.0 0.0
e 5.0 6.0 9.0
'''