pandas组队学习打卡日记4.23

最新推荐文章于 2023-10-05 06:00:00 发布

Litost.

最新推荐文章于 2023-10-05 06:00:00 发布

阅读量210

点赞数 1

文章标签： python numpy 数据分析

本文链接：https://blog.csdn.net/weixin_45792684/article/details/105663978

版权

第二章：索引

在这里插入图片描述

单级索引

一.常用索引

1. loc[ ]

df.loc [ 行，列，间隔 ] 间隔为-1代表逆取
函数式索引：（a）匿名函数（b）定义函数
与iloc不同：（a）使用索引标签的值进行索引（b）包括切片右端

2. iloc[ ]

df.iloc [ 行，列，间隔 ] 间隔为-1代表逆取
与loc不同：（a）使用绝对索引的位置进行索引（b）不包括切片右端

3. [ ]操作

（1）Series
单元素索引:先要从DataFrame中取出Series
可用绝对位置的索引;也可将索引位置换为索引标签，使用索引标签取值

s1 = pd.Series(df['Math'])
s1[0:3]

s2 = pd.Series(df['Math'],index=df.index)
s2[1101]

多行索引:必须使用绝对位置的索引
在这里插入图片描述
函数式索引:匿名函数
布尔索引

（2）DataFrame
单行索引：绝对切片；用get_loc取某一索引标签的位置——先用df.index.get_loc(索引标签)得到行所在的绝对位置，再用该得到的位置用[]操作
多行索引：和单行索引都容易报错，最好使用loc
单、多列索引
函数式索引
布尔索引

二.布尔索引

选出哪些是返回True值的

1. 布尔符号：

& 和
两个条件用 df [ (条件1) & (条件二) ]

df[(df['Gender']=='F')&(df['Address']=='street_2')].head()

| 或
df [ (条件1) | (条件2) ]

df[(df['Math']>85)|(df['Address']=='street_7')].head()

~ 非

df [ ~(条件) ]

df[~(df['Address']=='street_1')]

loc和[]中相应位置都能使用布尔列表选择：

在这里插入图片描述

2.isin方法

书写：df [ ‘column’ ].isin ( [ ‘value1’ , ‘value2’ ] )

df[df['Address'].isin(['street_1','street_4'])&df['Physics'].isin(['A','A+'])]

字典的方法：

df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]
#all与&的思路是类似的，其中的1代表按照跨列方向判断是否全为True

在这里插入图片描述

三.快速标量索引

1. 使用条件：只需要取一个元素

2. 方法：

at方法：display ( df.at [ 索引标签位置 , ’ column ’ ] )
iat方法：display ( df.iat [ 绝对索引位置 , 所选column 的索引位置 ] )

3. 对比时间可知比loc和iloc更快

在这里插入图片描述

四.区间索引

区间默认左开右闭

1. interval_range

pd.interval_range(start=0,end=5)

pd.interval_range(start=0,periods=8,freq=5)

2. 用cut将数值列转为区间为元素的分类变量

先确定一个区间，利用pd.cut(df[‘column’],bins=[num1,num2,num3,num4,num5…])
将这个区间和对应的需要的列提出来 df.join(区间，rsuffix=’_interval’)[[需要的column1，需要的column2]]\rest_index().set_index(‘区间’)

math_interval = pd.cut(df['Math'],bins=[0,40,60,80,100])

df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]\
            .reset_index().set_index('Math_interval')
df_i.head()

3. 区间索引的选取

选包含某个值的区间所对应的几行数据 df_new.loc[值]

df_i.loc[65].head()

选包含几个值的区间所对应的几行数据 df_new.loc[[值1,值2]

df_i.loc[[65,90]].head()

选包含某个区间所对应的数据，先把分类变量转为区间变量，在用overlap方法，用loc会报错df_new[df_new.index.astype(‘interval’).overlaps(pd.Interval(下限值,上限值))].head()

#df_i.loc[pd.Interval(70,75)].head() 报错
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head(

多级索引

一.创建多级索引

1.from_tuple

直接创建元组

tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

通过zip创建元组

L1 = list('AABB')
L2 = list('abab')
tuples = list(zip(L1,L2))
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

通过array创建元组
内部自动会直接变成一个元组

arrays = [['A','a'],['A','b'],['B','a'],['B','b']]
mul_index = pd.MultiIndex.from_tuples(arrays, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

2.from_product

将两个列表里面的元素两两组合形成一个元组

L1 = ['A','B']
L2 = ['a','b']
pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))

3.set_index内部指定索引方法

df_using_mul = df.set_index(['Class','Address'])
df_using_mul.head()

二.多层索引切片

1.一般切片

（1）单个索引

df.loc [ 'index1' , 'index2' ]

索引不排序时可能会报错

df.sort_index().loc[ 'index1' 'index2' ]

（2）多层切片
里面是元组

df.sort_index().loc[( 'index1.1' 'index2.1')  ：( 'index1.2' 'index2.2')]

不排序不能用多层切片

非元组

	df.sort_index().loc[( 'index1.1' 'index2.1') ： 'index1.2' ]

2.特殊情况

（1）由元组构成列表

df.sort_index().loc[ [( 'index1.1' 'index2.1') , ( 'index1.2' 'index2.2')] ]

（2）由列表构成元组

df.sort_index().loc[ ( ['index1.1' 'index1.2'] , [ 'index2.1' , ' index2.2'] ) ]

三. 多层索引中的slice对象

适用对象：索引两层，列两层

L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_s

非常灵活

idx=pd.IndexSlice
df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]]
#idx['B':,df_s['D']['d']>0.3取行：首先在‘B’和‘C’中挑选；然后在D-d那列挑出大于0.3的行。
#idx[df_s.sum()>4取列：列相加大于4

四.索引层的交换

1.swaplevel方法（两层交换）

df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()

2.reorder_levels方法（多层交换）

#将原索引位置的 [0,1,2]改为[2,0,1]
df_muls.reorder_levels([2,0,1],axis=0).sort_index().head()

#也可直接用name
df_muls.reorder_levels(['Address','School','Class'],axis=0).sort_index().head()

索引设定

一.index_col参数

导入文件时使用

pd.read_csv('data/table.csv',index_col=['Address','School']).head()

二.reindex和reindex_like

1.reindex

指重新索引，它的重要特性在于索引对齐，很多时候用于重新排序

df.reindex(index=[1101,1203,1206,2402])
df.reindex(columns=['Height','Gender','Average']).head()

缺失值填充：fill_value或method=‘bfill’/‘ffill’/‘nearst’

df.reindex(index=[1101,1203,1206,2402],method='bfill')
#bfill表示用所在索引1206的后一个有效行填充，ffill为前一个有效行，nearest是指最近的

2.reindex_like

生成一个横纵索引完全与参数列表一致的DataFrame，数据使用被调用的表

df_temp.reindex_like(df[0:5][['Weight','Height']])

三.set_index和reset_index

1.set_index

（1）使用表内列作为索引

df.set_index('Class').head()

里面加一个参数append参数可以将当前索引维持不变，仅新增一个索引

df.set_index('Class').head()

（2）使用非表内的列作为索引要先转化为Series

df.set_index(pd.Series(range(df.shape[0]))).head()

可以使用直接添加多列做多层索引

df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))]).head()

2.reset_index

默认状态回到自然数索引

df.reset_index().head()

用level参数指定哪一层被reset，用col_level参数指定set到哪一层

df_temp1 = df_temp.reset_index(level=1,col_level=1)

四.rename_axis和rename

1.rename_axis

是针对多级索引的方法，作用是修改某一层的索引名，而不是索引标签

df_temp.rename_axis(index={'Lower':'LowerLower'},columns={'Big':'BigBig'})

2.rename

是修改行或列的索引标签而不是索引名

df_temp.rename(index={'A':'T'},columns={'e':'changed_e'})

常用索引型函数

一. where函数

条件不符合的（False）全部填充为NaN

df.where(df['Gender']=='M').head()

1.参数

第一个是布尔条件，第二个是填充值

df.where(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()

2.dropna方法

df.dropna.( )  #默认axis=0，丢弃含有缺失值的行
df.dropna(how = 'all')    # 丢弃该行全部元素都为缺失值的行
df.dropna(axis = 1)       # 丢弃有缺失值的列
df.dropna(axis=1,how="all")   # 丢弃该列全元素为缺失值的列
df.dropna(axis=0,subset = ["column1", "column2"])   # 丢弃‘column1’和‘column2’这两列中有缺失值的行

二.mask函数

条件符合的（True）全部填充为NaN

df.mask(df['Gender']=='M').dropna().head()
df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()

三.query函数

query函数中的布尔表达式中，下面的符号都是合法的：行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符

df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')

重复元素处理

一.duplicated方法

返回一个布尔列表

1.默认情况

df.duplicated(‘column’) 该列的元素是否出现了重复
返回False：该元素第一次出现
返回True：该元素与之前出现过的重复

#参考一列
df.duplicated('Class').head()
#参考多列
df.duplicated(['Class','School'])

2.参数

keep参数：默认为first——首次出现为不重复；keep=‘last’——最后一次出现为不重复；keep=False——所有相同的项都标记为重复(True)

df.duplicated('Class',keep='last').tail()
df.duplicated('Class',keep=False).head()

二.drop_duplicates方法

剔除重复项，参数同duplicated，直接返回一个DataFrame

df.drop_duplicates('Class')
df.drop_duplicates('Class',keep='last')
df.drop_duplicates(['School','Class'])

抽样函数sample

（a）n为样本量

在这里插入代码片

（b）frac为抽样比

在这里插入代码片

（c）replace为是否放回

在这里插入代码片

（d）axis为抽样维度，默认为0，即抽行；axis=1，抽列

df.sample(n=3,axis=1).head()

（e）weights为样本权重，自动归一化

df.sample(n=3,weights=np.random.rand(df.shape[0])).head()

某一列为权重
样本权重给出了各个样本的重要性

df.sample(n=3,weights=df['Math']).head()

问题与练习

问题

一：
1.改变行或列的顺序先把需要调整的列的数据拿出来，之后，再将这个列删掉，最后，再用插入的方式把这个列调整到对应的位置上。

df_id = df.id
df = df.drop('id',axis=1)
df.insert(0,'id',df_id)

2.如何更改奇偶行

二：
方法：loc；iloc；直接用[]操作；isin。

三：
query函数比其它函数更快
在这里插入图片描述
四：
可以在单级索引中使用Slice

五：
用df.dropna(axis = 1) 丢弃有缺失值的列

六：
1.索引设定的所有方法

index_col参数：导入文件的时候作为里面的参数

reindex和reindex_like：重新排序

set_index
使用表内列作为索引df.set_index(‘column’）
使用非表内的列表作为索引，要先转化为Series
可以直接添加多列

reset_index：重置索引

rename_axis和rename：重命名索引名或索引标签
2.

七：
适用于要求高效率的场合

八：
重复元素无用时，需要剔除时要进行重复元素处理。

练习

练习一：
（a）

import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/76709/Desktop/4.18组对学习pandas/joyful-pandas-master/data/UFO.csv')
df.rename(columns={'duration (seconds)':'duration'},inplace=True)
df['duration'].astype('float')
df.head()
df_new=df.loc[df['duration']>60.0]
df_new.set_index('shape')
df_new['shape'].value_counts().head(1).index

（b）

interval1=pd.cut(df['latitude'],bins=[-90,-72,-54,-36,-18,0,18,36,54,72,90])
interval2=pd.cut(df['longitude'],bins=[-180,-150,-120,-90,-60,-30,0,30,60,90,120,150,180])

df_interval=pd.DataFrame({'col1':list(interval1),'col2':list(interval2),'col3':df['latitude'],'col4':df['longitude']})
df_interval
df2=df_interval.set_index(['col1','col2'])
df2
df2.index.value_counts()

练习二：
（a）

import pandas as pd
import numpy as np
df=pd.read_csv('C:/Users/76709/Desktop/4.18组对学习pandas/joyful-pandas-master/data/Pokemon.csv')
df['Type 2'].count()/df['#'].count()

（b）

df2=df.loc[lambda x:x['Total']>580]
df2.shape[0]
df2['Legendary'].value_counts()/df2.shape[0]

（c）

df[df['Type 1']=='Fighting'].sort_values(by='Attack',ascending=False).iloc[:3]

（d）有点难度


df['range'] = df.iloc[:,5:11].max(axis=1)-df.iloc[:,5:11].min(axis=1)
attribute = df[['Type 1','range']].set_index('Type 1')
max_range = 0
result = ''
for i in attribute.index.unique():
    temp = attribute.loc[i,:].mean()
    if temp.values[0] > max_range:
        max_range = temp.values[0]
        result = i
result

（e）有点难度

df.query('Legendary == True')['Type 1'].value_counts(normalize=True).index[0]
attribute = df.query('Legendary == True')[['Type 1','Total']].set_index('Type 1')
max_value = 0
result = ''
for i in attribute.index.unique()[:-1]:
    temp = attribute.loc[i,:].mean()
    if temp[0] > max_value:
        max_value = temp[0]
        result = i
result