第5章 pandas入门
引用惯例:import pandas as pd
5.1 pandas的数据结构介绍
5.1.1 Series
Series是⼀种类似于⼀维数组的对象,它由⼀组数据(各种NumPy数据类型)以及⼀组与之相关的数据标签(即索引)组成。Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引,于是会⾃动创建⼀个0到N-1(N为数据的⻓度)的整数型索引。
In [11]: obj = pd.Series([4, 7, -5, 3])
In [12]: obj
Out[12]:
0 4
1 7
2 -5
3 3
dtype: int64
# 可以通过Series的values和index属性获取其数组表示形式和索引对象
In [13]: obj.values
Out[13]: array([ 4, 7, -5, 3])
In [14]: obj.index # like range(4)
Out[14]: RangeIndex(start=0, stop=4, step=1)
# 可以自己标记索引
In [15]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [16]: obj2
Out[16]:
d 4
b 7
a -5
c 3
dtype: int64
In [17]: obj2.index
Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')
# 可以通过索引的⽅式选取Series中的单个或⼀组值
In [18]: obj2['a']
Out[18]: -5
In [19]: obj2['d'] = 6
In [20]: obj2[['c', 'a', 'd']]
Out[20]:
c 3
a -5
d 6
dtype: int64
# 使⽤NumPy函数或类似NumPy的运算都会保留索引值的链接
In [21]: obj2[obj2 > 0]
Out[21]:
d 6
b 7
c 3
dtype: int64
In [22]: obj2 * 2
Out[22]:
d 12
b 14
a -10
c 6
dtype: int64
In [23]: np.exp(obj2)
Out[23]:
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64
# Series可以⽤在许多原本需要字典参数的函数
In [24]: 'b' in obj2
Out[24]: True
In [25]: 'e' in obj2
Out[25]: False
# 也可以直接通过一个字典来创建Series
In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [27]: obj3 = pd.Series(sdata)
In [28]: obj3
Out[28]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
# 如果只传⼊⼀个字典,则结果Series中的索引就是原字典的键(有序排列)。可以传⼊排好序的字典的键以改变顺序
In [29]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [30]: obj4 = pd.Series(sdata, index=states)
In [31]: obj4
Out[31]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# andas的isnull和notnull函数可⽤于检测缺失数据
In [32]: pd.isnull(obj4)
Out[32]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
In [33]: pd.notnull(obj4)
Out[33]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
# Series会根据运算的索引标签⾃动对⻬数据
In [35]: obj3
Out[35]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [36]: obj4
Out[36]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [37]: obj3 + obj4
Out[37]:
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
# Series对象本身及其索引都有⼀个name属性,该属性跟pandas其他的关键功能关系⾮常密切
In [38]: obj4.name = 'population'
In [39]: obj4.index.name = 'state'
In [40]: obj4
Out[40]:
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
# Series的索引可以通过赋值的⽅式就地修改
In [41]: obj
Out[41]:
0 4
1 7
2 -5
3 3
dtype: int64
In [42]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [43]: obj
Out[43]:
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
5.1.2 DataFrame
DataFrame是⼀个表格型的数据结构,它含有⼀组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有⾏索引也有列索引,它可以被看做由Series组成的字典(共⽤同⼀个索引)。DataFrame中的数据是以⼀个或多个⼆维块存放的(⽽不是列表、字典或别的⼀维数据结构)。
# 可以直接传⼊⼀个由等⻓列表或NumPy数组组成的字典来创建DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
# 果DataFrame会⾃动加上索引(跟Series⼀样),且全部列会被有序排列
In [45]: frame
Out[45]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003
# 对于特别⼤的DataFrame,head⽅法会选取前五⾏
In [46]: frame.head()
Out[46]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
# 如果指定了列序列,则DataFrame的列就会按照指定顺序进⾏排列
In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[47]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
# 如果传⼊的列在数据中找不到,就会在结果中产⽣缺失值
In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five', 'six'])
In [49]: frame2
Out[49]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [50]: frame2.columns
Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
# 通过类似字典标记的⽅式或属性的⽅式,可以将DataFrame的列获取为⼀个Series
In [51]: frame2['state']
Out[51]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
In [52]: frame2.year
Out[52]:
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64
# ⾏也可以通过位置或名称的⽅式进⾏获取,⽐如⽤loc属性
In [53]: frame2.loc['three']
Out[53]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
# 列可以通过赋值的⽅式进⾏修改
In [54]: frame2['debt'] = 16.5
In [55]: frame2
Out[55]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
In [56]: frame2['debt'] = np.arange(6.)
In [57]: frame2
Out[57]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
# 将列表或数组赋值给某个列时,其⻓度必须跟DataFrame的⻓度相匹配。如果赋值的是⼀个Series,就会精确匹配DataFrame的索引,所有的空位都将被填上缺失值
In [58]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [59]: frame2['debt'] = val
In [60]: frame2
Out[60]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
# 为不存在的列赋值会创建出⼀个新列
# 注意:不能⽤frame2.eastern创建新的列
In [61]: frame2['eastern'] = frame2.state == 'Ohio'
In [62]: frame2
Out[62]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
# del⽅法可以⽤来删除这列
In [63]: del frame2['eastern']
In [64]: frame2.columns
Out[64]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
# 如果嵌套字典传给DataFrame,pandas就会被解释为:外层字典的键作为列,内层键则作为⾏索引
In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [66]: frame3 = pd.DataFrame(pop)
In [67]: frame3
Out[67]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
# 可以使⽤类似NumPy数组的⽅法,对DataFrame进⾏转置
In [68]: frame3.T
Out[68]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
# 内层字典的键会被合并、排序以形成最终的索引。如果明确指定了索引,则不会这样
In [69]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[69]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
# 由Series组成的字典差不多也是⼀样的⽤法
In [70]: pdata = {'Ohio': frame3['Ohio'][:-1],
....: 'Nevada': frame3['Nevada'][:2]}
In [71]: pd.DataFrame(pdata)
Out[71]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
# 如果设置了DataFrame的index和columns的name属性,则这些信息也会被显示出来
In [72]: frame3.index.name = 'year'; frame3.columns.name = 'state'
In [73]: frame3
Out[73]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
# 跟Series⼀样,values属性也会以⼆维ndarray的形式返回DataFrame中的数据
In [74]: frame3.values
Out[74]:
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
# 如果DataFrame各列的数据类型不同,则值数组的dtype就会选⽤能兼容所有列的数据类型
In [75]: frame2.values
Out[75]:
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
5.1.3 索引对象
pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成⼀个Index
In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [77]: index = obj.index
In [78]: index
Out[78]: Index(['a', 'b', 'c'], dtype='object')
In [79]: index[1:]
Out[79]: Index(['b', 'c'], dtype='object')
# Index对象是不可变的,因此⽤户不能对其进⾏修改
index[1] = 'd' # TypeError
# 不可变可以使Index对象在多个数据结构之间安全共享
In [80]: labels = pd.Index(np.arange(3))
In [81]: labels
Out[81]: Int64Index([0, 1, 2], dtype='int64')
In [82]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [83]: obj2
Out[83]:
0 1.5
1 -2.5
2 0.0
dtype: float64
In [84]: obj2.index is labels
Out[84]: True
# Index的功能类似⼀个固定⼤⼩的集合
In [85]: frame3
Out[85]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [86]: frame3.columns
Out[86]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [87]: 'Ohio' in frame3.columns
Out[87]: True
In [88]: 2003 in frame3.index
Out[88]: False
# pandas的Index可以包含重复的标签
In [89]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
In [90]: dup_labels
Out[90]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
-
Index的方法和属性
方法 说明 append 连接另一个Index对象,产生一个新的Index difference 计算差集,并得到一个Index intersection 计算交集 union 计算并集 isin 计算一个指示各值是否都包含在参数集合中的布尔型数组 delete 删除索引i处的元素,并得到新的Index drop 删除传入的值,并得到新的Index insert 将元素插入到索引i处,并得到新的Index is_monotonic 当各元素均大于第一个元素时,返回True is_unique 当Index没有重复值时,返回True unique 计算Index中唯一值的数组