pandas是数据分析的一个核心框架,集成了数据结构化和数据清洗以及分析的一些方法。pandas在numpy的基础上新增了三个数据类型,Series、DataFrame、Panel
In [1]:
import numpy as np
import pandas as pd
Series是一种类似与一维数组的对象,由下面两个部分组成:
- values:一组数据(ndarray类型)
- index:相关的数据索引标签
In [2]:
# 引入Series
from pandas import Series
In [ ]:
两种创建方式:
(1) 由列表或numpy数组创建
默认索引为0到N-1的整数型索引
In [3]:
nd = np.array([1,2,3,4])
nd
Out[3]:
array([1, 2, 3, 4])
In [4]:
s = Series(nd) # 没有指定索引默认0~N-1
s
Out[4]:
0 1 1 2 2 3 3 4 dtype: int32
In [5]:
s = Series([1,2,3,4,5],index=list("abcde"))
s
Out[5]:
a 1 b 2 c 3 d 4 e 5 dtype: int64
In [6]:
s["a"]
Out[6]:
1
In [7]:
s = Series([1,2,3,4,5,6],index=["A","A","B","B","A","C"])
s
Out[7]:
A 1 A 2 B 3 B 4 A 5 C 6 dtype: int64
In [8]:
s["B"]
Out[8]:
B 3 B 4 dtype: int64
(2) 由字典创建
In [9]:
s = Series({ "a":1,"b":2,"c":3})
s
Out[9]:
a 1 b 2 c 3 dtype: int64
In [10]:
s1=Series({ "a":123,"b":431},index=list("ac"))
s1
Out[10]:
a 123.0 c NaN dtype: float64
============================================
练习1:
使用多种方法创建以下Series,命名为s1:
语文 150
数学 150
英语 150
理综 300
============================================
In [11]:
#数组
nd = np.array([150,150,150,300])
s1 = Series(nd,index=["语文","数学","英语","理综"])
s1
Out[11]:
语文 150 数学 150 英语 150 理综 300 dtype: int32
In [12]:
dic = { "语文":150,"数学":150,"英语":150,"理综":300}
s2 = Series(dic)
s2
Out[12]:
数学 150 理综 300 英语 150 语文 150 dtype: int64
In [13]:
nd[0] = 1000
nd
Out[13]:
array([1000, 150, 150, 300])
In [14]:
s1 #由数组和列表创建Series是一个浅拷贝(只拷贝引用地址,不拷贝对象本身)
Out[14]:
语文 1000 数学 150 英语 150 理综 300 dtype: int32
In [15]:
dic["数学"] = 120
dic
Out[15]:
{'数学': 120, '理综': 300, '英语': 150, '语文': 150}
In [16]:
s2 # 由字典创建Series是一个创建副本的过程(也叫深拷贝)
Out[16]:
数学 150 理综 300 英语 150 语文 150 dtype: int64
可以使用中括号取单个索引(此时返回的是元素类型),或者中括号里一个列表取多个索引(此时返回的仍然是一个Series类型)。分为显示索引和隐式索引:
(1) 显式索引:
- 使用index中的元素作为索引值
- 使用.loc[](推荐)
注意,此时是闭区间
In [17]:
s
Out[17]:
a 1 b 2 c 3 dtype: int64
In [18]:
s.values
Out[18]:
array([1, 2, 3], dtype=int64)
In [19]:
s.index # index的值是显示索引
Out[19]:
Index(['a', 'b', 'c'], dtype='object')
In [20]:
#方式一
s["a"]
Out[20]:
1
In [21]:
# 方式二(推荐)
s.loc["a"]
Out[21]:
1
In [22]:
s.loc["a","b"] # 不能写成这种形式
--------------------------------------------------------------------------- IndexingError Traceback (most recent call last) <ipython-input-22-1c72e1f53463> in <module>() ----> 1 s.loc["a","b"] # 不能写成这种形式 d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1323 except (KeyError, IndexError): 1324 pass -> 1325 return self._getitem_tuple(key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 839 840 # no multi-index, so validate all of the indexers --> 841 self._has_valid_tuple(tup) 842 843 # ugly hack for GH #836 d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key) 186 for i, k in enumerate(key): 187 if i >= self.obj.ndim: --> 188 raise IndexingError('Too many indexers') 189 if not self._has_valid_type(k, i): 190 raise ValueError("Location based indexing can only have [%s] " IndexingError: Too many indexers
In [ ]:
s.loc[["a","b","a"]] # 通过列表来查找,实际上就是从s中截取子series
(2) 隐式索引:
- 使用整数作为索引值
- 使用.iloc[](推荐)
注意,此时是半开区间
In [ ]:
s.iloc[0]
In [23]:
s2.iloc[0]
Out[23]:
150
In [24]:
s.iloc[0,1]
--------------------------------------------------------------------------- IndexingError Traceback (most recent call last) <ipython-input-24-61ca01bee2d4> in <module>() ----> 1 s.iloc[0,1] d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1323 except (KeyError, IndexError): 1324 pass -> 1325 return self._getitem_tuple(key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 1660 def _getitem_tuple(self, tup): 1661 -> 1662 self._has_valid_tuple(tup) 1663 try: 1664 return self._getitem_lowerdim(tup) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key) 186 for i, k in enumerate(key): 187 if i >= self.obj.ndim: --> 188 raise IndexingError('Too many indexers') 189 if not self._has_valid_type(k, i): 190 raise ValueError("Location based indexing can only have [%s] " IndexingError: Too many indexers
In [25]:
s.iloc[[0,1]]
Out[25]:
a 1 b 2 dtype: int64
(3)切片
In [26]:
# 显示
s.loc["a":"c"] # 闭区间
Out[26]:
a 1 b 2 c 3 dtype: int64
In [27]:
# 隐式
s.iloc[0:2] # 前闭后开
Out[27]:
a 1 b 2 dtype: int64
In [28]:
s = Series([1,2,3,4,5,6],index=["A","A","B","C","B","C"])
s
Out[28]:
A 1 A 2 B 3 C 4 B 5 C 6 dtype: int64
In [29]:
s.loc["A":"C"]
# 如果显式索引中有重复的不建议用显示索引来切片
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-29-dd4f4933695c> in <module>() ----> 1 s.loc["A":"C"] 2 # 如果显式索引中有重复的不建议用显示索引来切片 d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) -> 1328 return self._getitem_axis(key, axis=0) 1329 1330 def _is_scalar_access(self, key): d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1504 if isinstance(key, slice): 1505 self._has_valid_type(key, axis) -> 1506 return self._get_slice_axis(key, axis=axis) 1507 elif is_bool_indexer(key): 1508 return self._getbool_axis(key, axis=axis) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis) 1354 labels = obj._get_axis(axis) 1355 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, -> 1356 slice_obj.step, kind=self.name) 1357 1358 if isinstance(indexer, slice): d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind) 3348 """ 3349 start_slice, end_slice = self.slice_locs(start, end, step=step, -> 3350 kind=kind) 3351 3352 # return a slice d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, end, step, kind) 3542 end_slice = None 3543 if end is not None: -> 3544 end_slice = self.get_slice_bound(end, 'right', kind) 3545 if end_slice is None: 3546 end_slice = len(self) d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, label, side, kind) 3496 if isinstance(slc, np.ndarray): 3497 raise KeyError("Cannot get %s slice bound for non-unique " -> 3498 "label: %r" % (side, original_label)) 3499 3500 if isinstance(slc, slice): KeyError: "Cannot get right slice bound for non-unique label: 'C'"
============================================
练习2:
使用多种方法对练习1创建的Series s1进行索引和切片:
索引: 数学 150
切片: 语文 150 数学 150 英语 150
============================================
In [30]:
s1[[1]]
s1.loc[["数学"]]
Out[30]:
数学 150 dtype: int32
In [31]:
s1.loc["语文":"英语"]
Out[31]:
语文 1000 数学 150 英语 150 dtype: int32
In [32]:
s1.iloc[0:3]
Out[32]:
语文 1000 数学 150 英语 150 dtype: int32
可以把Series看成一个定长的有序字典
可以通过shape,size,index,values等得到series的属性
In [33]:
s.shape
Out[33]:
(6,)
In [34]:
s.reshape((3,2)) # 一般不对Series进行reshape操作,会改变原来的数据形式
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead """Entry point for launching an IPython kernel.
Out[34]:
array([[1, 2], [3, 4], [5, 6]], dtype=int64)
In [35]:
s.size
Out[35]:
6