Pandas入门
Pandas数据结构介绍
两个主要的数据结构:Series和DataFrame。
Series
- 类似一维数组,由一组数据以及与之相关的标签组成。
In [1]: from pandas import Series,DataFrame
In [2]: import pandas as pd
In [3]: obj = Series([4,7,5,-3])
In [4]: obj
Out[4]:
0 4
1 7
2 5
3 -3
dtype: int64
没有指定索引,所以自动创建0到N-1(N为数据长度)的整数型索引。可以通过Series的values和index属性获取数组表示形式和索引对象。
In [7]: obj.values
Out[7]: array([ 4, 7, 5, -3], dtype=int64)
In [8]: obj.index
Out[8]: RangeIndex(start=0, stop=4, step=1)
那么如何创建带有索引的Series呢?
In [9]: obj2 = Series([4,7,-5,3],index=['d','b','a','c'])
In [10]: obj2
Out[10]:
d 4
b 7
a -5
c 3
dtype: int64
In [11]: obj2.index
Out[11]: Index(['d', 'b', 'a', 'c'], dtype='object')
索引的建立方便我们快速的从Series中选取单个或多个的值
# 单个值
In [18]: obj2['a']
Out[18]: -5
In [19]: obj2['d'] = 6
# 多个值
In [20]: obj2[['c', 'a', 'd']]
Out[20]:
c 3
a -5
d 6
dtype: int64
需要注意的是,上面多个索引选取时,传入的参数是一个list,如果传入‘a’,‘c’,'b’的话会报错:TypeError: ‘tuple’ object cannot be interpreted as an integer
还有一种可以直接创建Series的方法,有些数据直接存放在一个字典中,那么直接传入,字典的key就是索引。
In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [27]: obj3 = pd.Series(sdata)
In [28]: obj3
Out[28]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
DataFrame
- 表格型的数据结构,存在行索引和列索引,可以看成是由Series组成的
那么如何构建DataFrame呢?
# 传入一个字典
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
In [46]: frame.head()
Out[46]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
# 用columns规定列索引,index规定行索引,如果传入的列在数据中找不到,那么就会用NaN值填充
In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
....: index=['one', 'two', 'three', 'four',
....: 'five', 'six'])
In [49]: frame2
Out[49]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [50]: frame2.columns
Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
# 嵌套字典,外层字典的键作为列,内层键作为行索引
In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [66]: frame3 = pd.DataFrame(pop)
In [67]: frame3
Out[67]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
从DataFrame中,我们可以从列获得Series,有两种方式:
# 索引的方式获得
In [51]: frame2['state']
Out[51]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
# 方法和属性的方式获得
In [52]: frame2.year
Out[52]:
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64
我们也可以从行获得索引,利用loc方法:
In [53]: frame2.loc['three']
Out[53]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
列可以通过赋值的方式修改,将列表或者数组赋给某个列是,长度要和DataFrame相匹配,如果给不存在的列赋值,则会创建一个新列。
警告:通过索引方式返回的是对应数据的视图,所以修改全是在原数据上修改,但可以用copy方法复制列。
基本功能
本节介绍一些pandas的基本手段
重新索引
reindex创建一个新的对象
In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [92]: obj
Out[92]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
# 当前引入的索引不存在时
In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [94]: obj2
Out[94]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
有时候需要连续一点的时间序列数据,我们可以进行插值处理
In [95]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [96]: obj3
Out[96]:
0 blue
2 purple
4 yellow
dtype: object
In [97]: obj3.reindex(range(6), method='ffill')
Out[97]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
利用reindex还可以修改行或者列的索引
In [98]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
....: index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])
In [99]: frame
Out[99]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [100]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [101]: frame2
Out[101]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
# 用columns关键字重新索引
In [102]: states = ['Texas', 'Utah', 'California']
In [103]: frame.reindex(columns=states)
Out[103]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
丢弃指定轴上的项
顾名思义,指定轴肯定指的是columns和index分别的两个轴
# 丢弃Series中的项,就不存在行还是列了
In [105]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [106]: obj
Out[106]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [107]: new_obj = obj.drop('c')
In [108]: new_obj
Out[108]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [109]: obj.drop(['d', 'c'])
Out[109]:
a 0.0
b 1.0
e 4.0
dtype: float64
# 丢弃DataFrame行的项
In [110]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
In [111]: data
Out[111]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [112]: data.drop(['Colorado', 'Ohio'])
Out[112]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
# 通过传递 axis = 1或者 axis = 'columns' 删除列的值
In [113]: data.drop('two', axis=1)
Out[113]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [114]: data.drop(['two', 'four'], axis='columns')
Out[114]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
# drop中传入一个参数inplace可以就地修改,不会生成一个新的对象
In [115]: obj.drop('c', inplace=True)
In [116]: obj
Out[116]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
索引、选取和过滤
对Series的操作
In [117]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
In [118]: obj
Out[118]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
In [119]: obj['b']
Out[119]: 1.0
In [120]: obj[1]
Out[120]: 1.0
In [121]: obj[2:4] # 注意这里右边是不包含的,下面利用标签进行索引右边是包含的
Out[121]:
c 2.0
d 3.0
dtype: float64
In [122]: obj[['b', 'a', 'd']]
Out[122]:
b 1.0
a 0.0
d 3.0
dtype: float64
In [123]: obj[[1, 3]]
Out[123]:
b 1.0
d 3.0
dtype: float64
In [124]: obj[obj < 2]
Out[124]:
a 0.0
b 1.0
dtype: float64
In [126]: obj['b':'c'] = 5 # 这里是标签的切片,就包括了右边,这是不同点,得记住,而且还可以直接修改
In [127]: obj
Out[127]:
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
对DataFrame的操作
In [128]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
In [129]: data
Out[129]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [130]: data['two']
Out[130]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
In [131]: data[['three', 'one']]
Out[131]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
用loc和iloc进行选取
轴标签索引loc;整数索引iloc,从DataFrame选取行和列的子集
In [137]: data.loc['Colorado', ['two', 'three']] # 第一个为行;第二个为列,最后变成了Series
Out[137]:
two 5
three 6
Name: Colorado, dtype: int64
In [138]: data.iloc[2, [3, 0, 1]]
Out[138]:
four 11
one 8
two 9
Name: Utah, dtype: int64
In [139]: data.iloc[2]
Out[139]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
In [140]: data.iloc[[1, 2], [3, 0, 1]]
Out[140]:
four one two
Colorado 7 4 5
Utah 11 8 9
DataFrame和Series的算术运算
pandas一个神奇的功能是,可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。自动的数据对齐操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。下面我们来看看在Series和DataFrame中会发生什么。
# Series中
In [150]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [151]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
.....: index=['a', 'c', 'e', 'f', 'g'])
In [152]: s1
Out[152]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
In [153]: s2
Out[153]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
In [154]: s1 + s2
Out[154]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
# DataFrame中
In [155]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
.....: index=['Ohio', 'Texas', 'Colorado'])
In [156]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [157]: df1
Out[157]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
In [158]: df2
Out[158]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [159]: df1 + df2
Out[159]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
如果想让加法没有缺失值,可以这么做:
In [171]: df1.add(df2, fill_value=0)
Out[171]:
In [171]: df1.add(df2, fill_value=0)
Out[171]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
DataFrame和Series之间的算术运算
广播(broadcasting)的机制,可以通过下列的例子进行理解:
In [179]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
.....: columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [180]: series = frame.iloc[0]
In [181]: frame
Out[181]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [182]: series
Out[182]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
In [183]: frame - series
Out[183]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
简而言之,每一行都会执行这个操作。那么如果想在列上进行操作怎么办呢?,使用算术运算方法可以解决这个问题。
In [186]: series3 = frame['d']
In [187]: frame
Out[187]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [188]: series3
Out[188]:
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
In [189]: frame.sub(series3, axis='index') # 利用参数axis='index' or axis='0'进行广播
Out[189]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
函数的应用和映射
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [192]: np.abs(frame) #所有元素求绝对值
Out[192]:
b d e
Utah 0.204708 0.478943 0.519439
Ohio 0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 1.296221
In [193]: f = lambda x: x.max() - x.min()
In [194]: frame.apply(f)
Out[194]:
b 1.802165
d 1.684034
e 2.689627
dtype: float64
In [195]: frame.apply(f, axis='columns')
Out[195]:
Utah 0.998382
Ohio 2.521511
Texas 0.676115
Oregon 2.542656
dtype: float64
排序
In [203]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],columns=['d', 'a', 'b', 'c']
In [206]: frame.sort_index(axis=1, ascending=False) # 降序排序
Out[206]:
d c b a
three 0 3 2 1
one 4 7 6 5
描述性统计
In [230]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
.....: [np.nan, np.nan], [0.75, -1.3]],
.....: index=['a', 'b', 'c', 'd'],
.....: columns=['one', 'two'])
In [231]: df
Out[231]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [232]: df.sum()
Out[232]:
one 9.25
two -5.80
dtype: float64
In [234]: df.mean(axis='columns', skipna=False) #skipna排除缺失值,默认值为True
Out[234]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
©版权归本博客所有
2019年1月3日第一次修改
2019年1月12日第二次修改