一、简介
二、两个主要数据类型
1.Series类型
serise类型是由一组数据及其相关的数组索引组成
1.1、创建方法
#标量创建
s = pd.Series(25,index=['a','b','c','d','e'])
s
Out[5]:
a 25
b 25
c 25
d 25
e 25
dtype: int64
#字典创建
a = pd.Series({'a':1,'b':2})
a
Out[9]:
a 1
b 2
dtype: int64#index属性可指定Series的形状
b = pd.Series({'a':1,'b':2},index=['b','a','b','d'])
b
Out[21]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64
#列表创建
a = pd.Series([1,4,'f',5])
a
Out[11]:
0 1
1 4
2 f
3 5
dtype: object
#numpy函数创建
a = pd.Series(np.arange(0,5,0.5))
a
Out[14]:
0 0.0
1 0.5
2 1.0
3 1.5
4 2.0
5 2.5
6 3.0
7 3.5
8 4.0
9 4.5
dtype: float64a = pd.Series(np.arange(0,5,0.5),index=(np.arange(10,0,-1)))
a
Out[25]:
10 0.0
9 0.5
8 1.0
7 1.5
6 2.0
5 2.5
4 3.0
3 3.5
2 4.0
1 4.5
dtype: float64
1.2、Series相关操作
b = pd.Series({'a':1,'b':2},index=['b','a','b','d'])
b
Out[28]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64b.index
Out[30]: Index(['b', 'a', 'b', 'd'], dtype='object')
b.values
Out[31]: array([ 2., 1., 2., nan])b[['a','b']]
Out[35]:
a 1.0
b 2.0
b 2.0
dtype: float64#同时存在两套索引,即自定义索引与默认索引,但不能混用
b['a']
Out[32]: 1.0
b[1]
Out[33]: 1.0
b
Out[37]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64
b[:4]
Out[38]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64
b[b>b.median()]
Out[39]: Series([], dtype: float64)
b[b>1]
Out[40]:
b 2.0
b 2.0
dtype: float64
np.exp(b)
Out[41]:
b 7.389056
a 2.718282
b 7.389056
d NaN
dtype: float64
b
Out[42]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64
'a' in b
Out[43]: True
0 in b
Out[44]: False
b.get('f',"草")
Out[45]: '草'
Series对齐操作
b
Out[46]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64
a = pd.Series(np.arange(5),index=(['a','b','c','d','e']))
a
Out[48]:
a 0
b 1
c 2
d 3
e 4
dtype: int32
a + b
Out[49]:
a 1.0
b 3.0
b 3.0
c NaN
d NaN
e NaN
dtype: float64
Series对象和索引都有名字存储在.name属性中
b
Out[51]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64
b.name = "Series Datum"
b.index.name = 'index list'
b
Out[54]:
index list
b 2.0
a 1.0
b 2.0
d NaN
Name: Series Datum, dtype: float64
Series的值随时修改立即生效
b
Out[51]:
b 2.0
a 1.0
b 2.0
d NaN
dtype: float64
b.name = "Series Datum"
b.index.name = 'index list'
b
Out[54]:
index list
b 2.0
a 1.0
b 2.0
d NaN
Name: Series Datum, dtype: float64
2、DataFrame类型
2.1、类型简介
2.2、DataFrame创建
#二位维ndarray对象创建
a = pd.DataFrame(np.arange(10).reshape(2,5))
a
Out[59]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
a#字典创建
a = pd.DataFrame({'a':pd.Series(np.arange(3),index=(['a','b','c'])),'b':pd.Series(np.array([5,6,7,8]),index=('e','f','g','h'))})
a
Out[63]:
a b
a 0.0 NaN
b 1.0 NaN
c 2.0 NaN
e NaN 5.0
f NaN 6.0
g NaN 7.0
h NaN 8.0
pd.DataFrame(a,index=['a','b','c'],columns=['a','b','c'])
Out[66]:
a b c
a 0.0 NaN NaN
b 1.0 NaN NaN
c 2.0 NaN NaN
2.3、数据类型操作
2.4、数据运算
a = pd.DataFrame(np.arange(20).reshape(4,5))
a
Out[71]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
b = pd.DataFrame(np.arange(15).reshape(3,5))
b
Out[74]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
a + b
Out[76]:
0 1 2 3 4
0 0.0 2.0 4.0 6.0 8.0
1 10.0 12.0 14.0 16.0 18.0
2 20.0 22.0 24.0 26.0 28.0
3 NaN NaN NaN NaN NaN
a * b
Out[77]:
0 1 2 3 4
0 0.0 1.0 4.0 9.0 16.0
1 25.0 36.0 49.0 64.0 81.0
2 100.0 121.0 144.0 169.0 196.0
3 NaN NaN NaN NaN NaN
a - b
Out[78]:
0 1 2 3 4
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 NaN NaN NaN NaN NaN
a / b
Out[79]:
0 1 2 3 4
0 NaN 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 NaN NaN NaN NaN NaN
#方法操作通过fill_value属性可以实现先补齐后运算的操作
a.add(b,fill_value = 0)
Out[81]:
0 1 2 3 4
0 0.0 2.0 4.0 6.0 8.0
1 10.0 12.0 14.0 16.0 18.0
0维数据与其他维度进行运算时采用广播运算
a
Out[82]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
b = pd.Series(np.arange(5))
a + b
Out[84]:
0 1 2 3 4
0 0 2 4 6 8
1 5 7 9 11 13
2 10 12 14 16 18
3 15 17 19 21 23
a * b
Out[85]:
0 1 2 3 4
0 0 1 4 9 16
1 0 6 14 24 36
2 0 11 24 39 56
3 0 16 34 54 76
#非0维以上不同维度运算时,默认在axis=1上进行,可指定axis属性进行指定轴进行运算
b = pd.Series(np.arange(4))
a.add(b,axis = 0)
Out[88]:
0 1 2 3 4
0 0 1 2 3 4
1 6 7 8 9 10
2 12 13 14 15 16
3 18 19 20 21 22
a
Out[89]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
b = pd.DataFrame(np.arange(2,22,1).reshape(4,5))
b
Out[91]:
0 1 2 3 4
0 2 3 4 5 6
1 7 8 9 10 11
2 12 13 14 15 16
3 17 18 19 20 21
a > b
Out[92]:
0 1 2 3 4
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
a < b
Out[93]:
0 1 2 3 4
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
#若为同维度则要求shape一致,不然报错
c = pd.DataFrame(np.arange(2,22,1).reshape(5,4))
a < c
Traceback (most recent call last):
File "D:\DevelopmentSoftwareAndEnvironment\Anaconda\envs\pytorch\lib\site-packages\IPython\core\interactiveshell.py", line 3526, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-95-52f302c93909>", line 1, in <module>
a < c
File "D:\DevelopmentSoftwareAndEnvironment\Anaconda\envs\pytorch\lib\site-packages\pandas\core\ops\common.py", line 81, in new_method
return method(self, other)
File "D:\DevelopmentSoftwareAndEnvironment\Anaconda\envs\pytorch\lib\site-packages\pandas\core\arraylike.py", line 48, in __lt__
return self._cmp_method(other, operator.lt)
File "D:\DevelopmentSoftwareAndEnvironment\Anaconda\envs\pytorch\lib\site-packages\pandas\core\frame.py", line 7442, in _cmp_method
self, other = ops.align_method_FRAME(self, other, axis, flex=False, level=None)
File "D:\DevelopmentSoftwareAndEnvironment\Anaconda\envs\pytorch\lib\site-packages\pandas\core\ops\__init__.py", line 313, in align_method_FRAME
raise ValueError(
ValueError: Can only compare identically-labeled (both index and columns) DataFrame objects
#不同维度广播运算,默认在axis=1上
a
Out[96]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
b = pd.Series(np.arange(5))
a < b
Out[98]:
0 1 2 3 4
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
a == b
Out[99]:
0 1 2 3 4
0 True True True True True
1 False False False False False
2 False False False False False
3 False False False False False