Python 第三方模块数据分析 Pandas模块其他

最新推荐文章于 2024-06-28 09:48:33 发布

EdVzAs

最新推荐文章于 2024-06-28 09:48:33 发布

阅读量741

点赞数 1

文章标签： python 数据分析 pandas 索引与切片移动窗口函数

本文链接：https://blog.csdn.net/weixin_46131409/article/details/112633761

版权

Python 同时被 2 个专栏收录

135 篇文章 3 订阅

订阅专栏

数据分析

54 篇文章 13 订阅

订阅专栏

一.索引与切片操作

通过Series的index参数和DataFrame的index/columns参数设置索引

1.对Series对象
(1)方括号形:

通过索引取值:<S>[<index>]
  #当返回值仅有1个时,返回类型和数据类型相同;当返回值有多个时,返回Series对象
通过切片取值:<S>[<begin>:<end>[:<step>]]
  #总是返回Series对象;当标签不唯一时,不要使用标签进行切片
  #索引和切片既可用于取值,也可用于赋值;与list/ndarray的索引方法不同,但形式相同
  #参数说明:
    S:Series对象
    index:索引/标签;可为int/int list/标签的数据类型
      #为list时会将其中每个值作为索引取值,并返回所有结果(包括标签与值)构成的Series对象
      #既可以使用类似于list的索引取值,也可以使用类似于dict的标签取值
      #注意:与ndarray/list的索引不同,Series/DataFrame的索引不允许使用负值
    begin,end,step:切片的起始位置/结束位置/步长;<begin>/<end>可为int/标签的数据类型,<step>只能是int
      #包含<begin>,不包含<end>;<step>默认为1;也可以使用<S>[<begin>:]/<S>[:<end>]的语法;均允许为负值

#实例:
>>> s=pd.Series(np.random.random_sample(50))
>>> s1=s[[3,7,33]]
>>> s1
3     0.528045
7     0.449824
33    0.288187
dtype: float64
>>> s2=s[3]
>>> s2
0.5280445237551318
>>> s3=s[1:12:2]
>>> s3
1     0.268212
3     0.528045
5     0.178432
7     0.449824
9     0.336740
11    0.032818
dtype: float64
>>> s=pd.Series([1,2,3,4,5],index=["a","b","c","d","e"])
>>> s["b":"d"]
b    2
c    3
d    4
dtype: int64
>>> s["b":"d":2]
b    2
d    4
dtype: int64
>>> s["e":"b"]
Series([], dtype: int64)
>>> s=pd.Series([1,2,3,4,5,4,3,2,1],index=["a","b","c","d","e","d","c","b","a"])
>>> s[2]
3
>>> s[2:5]
c    3
d    4
e    5
dtype: int64
>>> s["d"]
d    4
d    4
dtype: int64
>>> type(s["e"])
<class 'numpy.int64'>
>>> s["b":"d"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\series.py", line 908, in __getitem__
    return self._get_with(key)
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\series.py", line 915, in _get_with
    slobj = self.index._convert_slice_indexer(key, kind="getitem")
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 3186, in _convert_slice_indexer
    indexer = self.slice_indexer(start, stop, step, kind=kind)
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 4962, in slice_indexer
    start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 5163, in slice_locs
    start_slice = self.get_slice_bound(start, "left", kind)
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 5095, in get_slice_bound
    raise KeyError(
KeyError: "Cannot get left slice bound for non-unique label: 'b'"
>>> s[1:4]
b    2
c    3
d    4
dtype: int64

(2)loc形:

通过.loc实现索引:<S>.loc[<index>]
通过.loc实现切片:<S>.loc[<begin>:<end>[:<step>]]
  #参数/用法均同上,不过只能为标签,不能为索引

(3)iloc形:

通过.iloc实现索引:<S>.iloc[<lindex>]
通过.iloc实现切片:<S>.iloc[<begin>:<end>[:<step>]]
  #与list/ndarray的索引方法相同,但形式不同
  #参数/用法均同上,不过只能为索引,不能为标签,且均可为负值

(4)at与iat形:

通过.at实现索引:<S>.at[<index>]
通过.iat实现索引:<S>.iat[<iindex>]
  #二者均不能实现切片;但速度很快
  #参数说明:
    lindex:指定标签(不能是下标索引)
    iindex:指定下标索引(可为负值;不能是标签)

#实例:
>>> s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> s.at['e']
5
>>> s.iat[-3]
3

2.对DataFrame对象
(1)方括号形:

索引:<df>[<index>]
多重索引(1次获取多个列):<df>[[<index1>,<index2>...]]
  #索引先取列,再取行
切片:<df>[<begin>:<end>[:<step>]]
  #切片只取行
  #参数说明:
    index:指定标签
    begin,end,step:指定起始位置/结束位置/步长/为int

#实例:
>>> df=pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,0,1,2],[3,4,5,6],[7,8,9,0]])
>>> df[1]
0    2
1    6
2    0
3    4
4    8
Name: 1, dtype: int64
>>> df[1][0]
2
>>> df[1:3]
   0  1  2  3
1  5  6  7  8
2  9  0  1  2
>>> df[1:3][1:2]
   0  1  2  3
2  9  0  1  2
>>> df=pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]],columns=["a","b","c","d","e"])
>>> df["a"]
0     1
1     6
2    11
Name: a, dtype: int64
>>> df[0]
Traceback (most recent call last):
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 0
>>> df["a":"d"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\frame.py", line 2881, in __getitem__
    indexer = convert_to_index_sliceable(self, key)
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 2134, in convert_to_index_sliceable
    return idx._convert_slice_indexer(key, kind="getitem")
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 3152, in _convert_slice_indexer
    self._validate_indexer("slice", key.start, "getitem")
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 4993, in _validate_indexer
    self._invalid_indexer(form, key)
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 3263, in _invalid_indexer
    raise TypeError(
TypeError: cannot do slice indexing on RangeIndex with these indexers [a] of type str
>>> df[["a","c"]]
    a   c
0   1   3
1   6   8
2  11  13

(2)loc/iloc形:

DataFrame对象也可以使用.loc/.iloc,用法参见Series对象
  #使用:则表示选择该行/列的全部元素

3.层次化索引:

参见:https://blog.csdn.net/ceerfuce/article/details/81589913

"层次化索引"(Hierarchical Index)使1个轴上能拥有多个索引级别,从而使用户能以低维形式处理高维数据:
>>> s=pd.Series([1,2,3,4,5,6,7,8,9],index=[["a","a","a","b","b","c","c","d","d"],[1,2,3,1,3,1,2,2,3]])
>>> s
a  1    1
   2    2
   3    3
b  1    4
   3    5
c  1    6
   2    7
d  2    8
   3    9
dtype: int64
>>> s=pd.Series([1,2,3,4,5,6,7,8,9],index=[["a","a","a","b","b","c","c","d","a"],[1,2,3,1,3,1,2,2,4]])
>>> s
a  1    1
   2    2
   3    3
b  1    4
   3    5
c  1    6
   2    7
d  2    8
a  4    9
dtype: int64
#使用了层次化索引时,Series/DataFrame的索引为MultiIndex对象:
>>> s.index
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('a', 4)],
           )
>>> type(s.index)
<class 'pandas.core.indexes.multi.MultiIndex'>
#可对具有层次化索引的对象使用普通的数值/标签索引:
>>> s[3]
4
>>> s["a"]
1    1
2    2
3    3
4    9
dtype: int64
>>> type(s["a"])
<class 'pandas.core.series.Series'>
>>> s["a"][1]
1
#可以进行多级索引:
>>> s["a",1]
1
>>> s[:,2]
a    2
c    7
d    8
dtype: int64
#也能进行普通的数值切片:
>>> s[2:5]
a  3    3
b  1    4
   3    5
dtype: int64
#标签切片要求索引已排序:
>>> s.sort_index()["a":"c"]
a  1    1
   2    2
   3    3
   4    9
b  1    4
   3    5
c  1    6
   2    7
dtype: int64

二.索引类型(Index类)

pandas.core.indexes.multi.MultiIndex类与此类似

1.简介:

Index类(pandas.core.indexes.base.Index)用于保存索引标签数据,属于不可修改的数据类型
>>> s=pd.Series([1,2,3],index=['a','b','c'])
>>> s.index
Index(['a', 'b', 'c'], dtype='object')
>>> df=pd.DataFrame([[1,2],[3,4]],index=['a','b'],columns=['A','B'])
>>> df.index
Index(['a', 'b'], dtype='object')
>>> df.columns
Index(['A', 'B'], dtype='object')

2.方法
(1)集合运算:

连接2个索引对象:<I>.append(<other>)
  #参数说明:
  	I,other:指定Index对象;为Index object/Index list/Index tuple

#实例:
>>> df=pd.DataFrame([1,2,3,4],index=["a","c","d","b"])
>>> i=df.index
>>> i.append(i)
Index(['a', 'c', 'd', 'b', 'a', 'c', 'd', 'b'], dtype='object')

######################################################################################################################

求索引对象的补集:<I>.difference(<other>[,sort=None])
  #参数说明:
  	I,other:指定Index对象;为Index object/array-like
  	sort:是否对结果进行排序;为None(是)/False(否)

#实例:接上
>>> i.difference(["c"])
Index(['a', 'b', 'd'], dtype='object')
>>> i.difference(["c"],sort=False)
Index(['a', 'd', 'b'], dtype='object')

######################################################################################################################

求索引对象的交集:<I>.intersection(<other>[,sort=False])
  #参数说明:同<I>.difference()

#实例:接上
>>> i.intersection(["a","c","b"])
Index(['a', 'c', 'b'], dtype='object')
>>> i.intersection(["a","c","b"],sort=None)
Index(['a', 'b', 'c'], dtype='object')

######################################################################################################################

求索引对象的并集:<I>.union(<other>[,sort=None])
  #参数说明:同<I>.difference()

#实例:接上
>>> i.union(["z","g"])
Index(['a', 'b', 'c', 'd', 'g', 'z'], dtype='object')

(2)增删查改:

删除指定位置处的索引:<I>.delete(<loc>)
  #参数说明:
  	loc:指定要删除的索引的索引;为int/int list
  	  #注意:位置不能越界

#实例:接上
>>> i.delete(1)
Index(['a', 'd', 'b'], dtype='object')
>>> i.delete([1,3])
Index(['a', 'd'], dtype='object')

######################################################################################################################

删除为指定值的索引:<I>.drop(<labels>p[,errors="raise"])
  #参数说明:
	labels:指定要删除的索引的值;为array-like
	  #注意:该索引值必须存在于<I>中
	errors:指定如何处理错误;为"raise"(报错)/"ignore"(忽略,继续删除没有错误的索引值)

#实例:接上
>>> i.drop(["a"])
Index(['c', 'd', 'b'], dtype='object')
>>> i.drop(["a","e","b"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexes\base.py", line 5278, in drop
    raise KeyError(f"{labels[mask]} not found in axis")
KeyError: "['e'] not found in axis"
>>> i.drop(["a","e","b"],errors="ignore")
Index(['c', 'd'], dtype='object')

######################################################################################################################

在指定位置插入新索引:<I>.insert(<loc>,<item>)
  #参数说明:
  	loc:指定插入新索引的位置的索引;为int
  	item:指定插入的新索引的值

#实例:接上
>>> i.insert(3,111)
Index(['a', 'c', 'd', 111, 'b'], dtype='object')
>>> i.insert(3,[111,222])
Index(['a', 'c', 'd', [111, 222], 'b'], dtype='object')

(3)其他:

判断索引是否为指定值:<I>.isin(<values>[,level=None])
  #返回bool array,如果<I>中对应的元素在<values>中存在,则该位置为True;否则为False
  #参数说明:
  	values:指定索引值;为set/list-like

#实例:接上
>>> i.isin(["a",3])
array([ True, False, False, False])
>>> i.isin(["a",3,"c"])
array([ True,  True, False, False])

######################################################################################################################

去除重复值:<I>.unique([level=None])

#实例:接上
>>> iu=i.insert(0,"a")
>>> iu
Index(['a', 'a', 'c', 'd', 'b'], dtype='object')
>>> iu.unique()
Index(['a', 'c', 'd', 'b'], dtype='object')

######################################################################################################################

对每个元素执行指定映射:<I>.map(<mapper>,[na_action=None])
  #参数说明:
	mapper:指定函数;为function/dict/Series

#实例:
>>> s=pd.Series([1,2,-999,4,-999,-999,7])
>>> i=s.index
>>> def f(x):
...     return np.nan if x<3 else x
...
>>> i.map(f)
Float64Index([nan, nan, nan, 3.0, 4.0, 5.0, 6.0], dtype='float64')
>>> i.map(pd.Series([2,3,4,5,6,7]))
Float64Index([2.0, 3.0, 4.0, 5.0, 6.0, 7.0, nan], dtype='float64')
>>> i.map({0:-11,1:11})
Float64Index([-11.0, 11.0, nan, nan, nan, nan, nan], dtype='float64')

3.属性:

判断某个元素是否均大于等于前1个元素:<I>.is_monotonic

#实例:接上
>>> i.is_monotonic
True

######################################################################################################################

是否没有重复值:<I>.is_unique

#实例:接上
>>> i.is_unique
True

######################################################################################################################

索引名:<I>.name

#实例:接上
>>> i.name="i1"
>>> i.name
'i1'

三.运算
1.Series对象
(1)算术运算:

加运算:<S>=<S1>+<S2>
减运算:<S>=<S1>-<S2>
乘运算:<S>=<S1>*<S2>
除运算:<S>=<S1>/<S2>
  #即将具有相同标签的值相加/减/乘/除;非共有标签对应的值则为NaN
  #参数说明:
    S1,S2,S:参与运算/返回的Series对象

#实例:
>>> s1=pd.Series([1,2])
>>> s2=pd.Series([3,2])
>>> s1+s2
0    4
1    4
dtype: int64
>>> s3=pd.Series(["aaa","bbb"],index=["a","b"])
>>> s1+s3
0    NaN
1    NaN
a    NaN
b    NaN
dtype: object
>>> s4=pd.Series([4,1,2],index=[1,2,3])
>>> s1+s4
0    NaN
1    6.0
2    NaN
3    NaN
dtype: float64
>>> s1-s2
0   -2
1    0
dtype: int64
>>> s1*s2
0    3
1    4
dtype: int64
>>> s1/s2
0    0.333333
1    1.000000
dtype: float64

#################################################################################################

加法:<S>+<n>
     <n>+<S>
减法:<S>-<n>
     <n>-<S>
乘法:<S>*<n>
     <n>*<S>
除法:<S>/<n>
     <n>/<S>
  #使Series对象中的每个元素都与<n>进行相应的运算
  #参数说明:
    S,n:分别指定参与运算的Series对象与num对象

#实例:
>>> s=pd.Series([1,2.22,4,8.88])
>>> s+1
0    2.00
1    3.22
2    5.00
3    9.88
dtype: float64
>>> 1.11-s
0    0.11
1   -1.11
2   -2.89
3   -7.77
dtype: float64
>>> s*1.414
0     1.41400
1     3.13908
2     5.65600
3    12.55632
dtype: float64
>>> 1.414/s
0    1.414000
1    0.636937
2    0.353500
3    0.159234
dtype: float64

(2)比较运算:

<S1> > <S2>
<S1> < <S2>
    ...
  #即将具有相同标签的值进行比较;2个对象中不能存在具有不同标签的值

#实例:
>>> s1=pd.Series([1,2,3])
>>> s2=pd.Series([5,1,3])
>>> s1>s2
0    False
1     True
2    False
dtype: bool

#################################################################################################

<S> > <n>
<S> < <n>
   ...
  #使Series对象中的每个元素都与<n>进行相应的运算

2.DataFrame与Index对象:

①规则类似
②不同维度的对象进行运算时采用广播规则(参见 Python.第三方模块.科学计算.NumPy模块.字节交换,复制,索引,广播,IO.四 部分)

四.移动窗口函数

在移动窗口上计算各种统计函数常见于时间序列,称为"移动窗口函数"(Moving Window Function),这样可以平滑噪音或断裂数据

1.滚动窗口(Rolling Window)
(1)Rolling对象:

创建相应的Rolling对象:[<r>=]<s_or_df>.rolling(<window>[,min_periods=None,center=False,win_type=None,on=None,axis=0,closed=None])
  #参数说明:
  	s_or_df:指定Series/DataFrame对象
  	  #注意:为DataFrame对象时,会对各列分别计算
  	window:指定窗口大小(即每次计算使用的值的数量);为int/str(日期偏移量)
  	min_periods:指定最小的进行计算的值数;为int<=window/None(表示等于window)
  	  #即当值数达到min_periods后,即使仍小于<window>,也使用这些值进行计算
  	axis:指定沿哪个轴计算;为int(0表示沿列计算)/str
  	r:返回得到的Rolling对象

#实例:
>>> s=pd.Series([1,3,4,5,2,5,7])
>>> r1=s.rolling(2)
>>> r1
Rolling [window=2,center=False,axis=0]
>>> r2=s.rolling(4,min_periods=2)

(2)调用相关函数:

求总和:<r>.sum()
求算术平均值:<r>.mean()
求方差:<r>.var()
求标准差:<r>.std()
求最大值:<r>.max()
求最小值:<r>.min()
求Pearson相关系数矩阵:<r>.corr()
求协方差矩阵:<r>.cov()
求偏度:<r>.skew()
求峰度:<r>.kurt()
  #即求连续<window>个值的指定统计量

#实例:接上
>>> r1.sum()
0     NaN
1     4.0#=<s>[0]+<s>[1]
2     7.0#=<s>[1]+<s>[2]
3     9.0#=<s>[2]+<s>[3]
4     7.0#=<s>[3]+<s>[4]
5     7.0#=<s>[4]+<s>[5]
6    12.0#=<s>[5]+<s>[6]
dtype: float64
>>> r1.mean()
0    NaN
1    2.0#=(<s>[0]+<s>[1])/2
2    3.5#=(<s>[1]+<s>[2])/2
3    4.5#=(<s>[2]+<s>[3])/2
4    3.5#=(<s>[3]+<s>[4])/2
5    3.5#=(<s>[4]+<s>[5])/2
6    6.0#=(<s>[5]+<s>[6])/2
dtype: float64
>>> r1.var()
0    NaN
1    2.0
2    0.5
3    0.5
4    4.5
5    4.5
6    2.0
dtype: float64
>>> r1.std()
0         NaN
1    1.414214
2    0.707107
3    0.707107
4    2.121320
5    2.121320
6    1.414214
dtype: float64
>>> r1.max()
0    NaN
1    3.0
2    4.0
3    5.0
4    5.0
5    5.0
6    7.0
dtype: float64
>>> r1.min()
0    NaN
1    1.0
2    3.0
3    4.0
4    2.0
5    2.0
6    5.0
dtype: float64
>>> r2.sum()
0     NaN
1     4.0#=<s>[0]+<s>[1]
2     8.0#=<s>[0]+<s>[1]+<s>[2]
3    13.0#=<s>[0]+<s>[1]+<s>[2]+<s>[3]
4    14.0#=<s>[1]+<s>[2]+<s>[3]+<s>[4]
5    16.0#=<s>[2]+<s>[3]+<s>[4]+<s>[5]
6    19.0#=<s>[3]+<s>[4]+<s>[5]+<s>[6]
dtype: float64

2.扩展窗口(Expanding Window)
(1)Expanding对象:

创建Expanding对象:<s_or_df>.expanding([min_periods=1,center=None,axis=0])

#实例:
>>> s=pd.Series([1,3,4,5,2,5,7])
>>> e=s.expanding()
>>> for i in e:
...     print(i)
...
0    1
dtype: int64
0    1
1    3
dtype: int64
0    1
1    3
2    4
dtype: int64
...#此处省略
0    1
1    3
2    4
3    5
4    2
5    5
6    7
dtype: int64

(2)调用相关函数:

同 1.(2) 部分

#实例:接上
>>> e.sum()
0     1.0#=e[0]
1     4.0#=e[0]+e[1]
2     8.0#=e[0]+e[1]+e[2]
3    13.0#=e[0]+e[1]+e[2]+e[3]
4    15.0#=e[0]+e[1]+e[2]+e[3]+e[4]
5    20.0#=e[0]+e[1]+e[2]+e[3]+e[4]+e[5]
6    27.0#=e[0]+e[1]+e[2]+e[3]+e[4]+e[5]+e[6]
dtype: float64

3.指数加权移动(Exponentially Weighted Moving;ewm)
(1)指数加权移动对象:

创建ExponentialMovingWindow对象:<s_or_df>.ewm([com=None,span=None,halflife=None,alpha=None,min_periods=0,adjust=True,ignore_na=False,axis=0,times=None])
  #需要定义"衰减因子"(Decay Factor),以便使近期的观测值拥有更大的权重
  #参数说明:
  	com:通过质心指定衰减因子;为float≥0
  	  #α=1/(1+com)
  	span:通过持续时间指定衰减因子;为float≥1
  	  #α=2/(1+span)
  	halflife:通过半衰期指定衰减因子;为float>0
  	  #α=1-math.exp(-math.log(2)/halflife)
  	alpha:直接指定衰减因子;为0<float≤1
  	  #应指定以上4个参数中的任意1个
  	adjust:???
  	ignore_na:是否忽略NaN;为bool

#实例:
>>> ts=pd.Series(np.random.randn(100),index=pd.date_range('1/1/2000',periods=100,freq='D'))
>>> ts.ewm(alpha=0.5)
ExponentialMovingWindow [com=1.0,min_periods=1,adjust=True,ignore_na=False,axis=0]
>>> e=ts.ewm(alpha=0.5)

(2)调用相关函数:

可调用mean()/var()/std()/corr()/cov()

#实例:接上
>>> e.mean().plot()
<AxesSubplot:>
>>> plt.show()

在这里插入图片描述
五.分组对象(GroupBy对象)
1.创建:

通过<S>.groupby()或<df>.groupby()得到

2.使用

注意:GroupBy对象实际上尚未进行运算,但包含了对各分组执行运算所需的一切信息

(1)迭代:

>>> df=pd.DataFrame({'性别':['男','女','男','女','男','女','男','男'],'成绩':[98,93,70,56,67,64,89,87],'年龄':[15,14,15,12,13,14,15,16]})
>>> g=df.groupby("性别")
>>> for i,v in g:
...     print(i)
...     print(v)
...
女
  性别  成绩  年龄
1  女  93  14
3  女  56  12
5  女  64  14
男
  性别  成绩  年龄
0  男  98  15
2  男  70  15
4  男  67  13
6  男  89  15
7  男  87  16

(2)索引:

>>> g["年龄"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001A577D19820>
>>> g[["成绩","年龄"]]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A577D19820>
>>> gn=g["年龄"]
>>> for i,v in gn:
...     print(i)
...     print(v)
...
女
1    14
3    12
5    14
Name: 年龄, dtype: int64
男
0    15
2    15
4    13
6    15
7    16
Name: 年龄, dtype: int64

语法糖:

df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
#相当于:
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

(3)方法:

求各组的均值:<gb>.mean()

#实例:
>>> g.mean()
      成绩         年龄
性别
女   71.0  13.333333
男   82.2  14.800000
>>> g=df.groupby(["性别","年龄"])
>>> g.mean()
              成绩
性别 年龄
女  12  56.000000
   14  78.500000
男  13  67.000000
   15  85.666667
   16  87.000000

######################################################################################################################

求各组中的记录条数:<gb>.size()

#实例:接上
>>> g.size()
性别
女    3
男    5
dtype: int64

EdVzAs

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Python 第三方模块数据分析 Pandas模块其他

一.索引与切片操作1.对Series对象(1)方括号形:通过索引取值:<S>[<index>] #当返回值仅有1个时,返回类型和数据类型相同;当返回值有多个时,返回Series对象通过切片取值:<S>[<begin>:<end>[:<step>]] #总是返回Series对象;当标签不唯一时,不要使用标签进行切片 #索引和切片既可用于取值,也可用于赋值;与list/ndarray的索引方法不同,但形式相同 #参数
复制链接

扫一扫