python数据分析之pandas使用总结

最新推荐文章于 2021-10-04 12:51:11 发布

ZCH_Debby

最新推荐文章于 2021-10-04 12:51:11 发布

阅读量838

点赞数

文章标签：大数据

本文链接：https://blog.csdn.net/ZCH_Debby/article/details/105644468

版权

python数据分析之pandas使用总结

一、pandas基础

在这里插入图片描述
1.1 文件读取和写入

1.1.1 文件读取

df = pd.read_csv('data/table.csv')
df_txt = pd.read_table('data/table.txt') 
df_excel = pd.read_excel('data/table.xlsx')

1.1.2 文件写入

df.to_csv('data/new_table.csv', index=False)  #False去除索引
df.to_excel('data/new_table2.xlsx', sheet_name='Sheet1')  #选装openpyxl

1.2 基本数据结构

1.2.1 Series

s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'],name='这是一个Series',dtype='float64')
s.values
s.index
s.dtype
s['a']
s.mean()

series调用方法：在这里插入图片描述

1.2.2 DataFrame

df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]，index=list('一二三四五')
df['col1']
df.rename(index={'一':'one'},columns={'col1':'new_col1'})  #修改行列名
df.index
df.columns
df.values
df.shape
df.mean() 

df1 = pd.DataFrame({'A':[1,2,3]},index=[1,2,3])
df2 = pd.DataFrame({'A':[1,2,3]},index=[3,1,2])
df1-df2 #由于**索引对齐**，因此结果不是0

df.drop(index='五',columns='col1')   #删除行列，设置inplace=True后会直接在原DataFrame中改动
del df['col1']
df.pop('col1') #pop方法直接在原来的DataFrame上操作，且返回被删除的列，与python中的pop函数类似

df1['B']=list('abc') #增加列
df1.assign(C=pd.Series(list('def')))

df.select_dtypes(include=['number']).head()  #根据类型选择列
df.select_dtypes(include=['float']).head()

s.to_frame()  #series转为dataframe
s.to_frame().T #可转置

1.3 常用基本函数

df.head() 
df.tail()
df['列名'].unique()/nunique() #显示某列所有唯一值/有多少个唯一值
df['列名'].count()  #count返回非缺失值元素个数¶
df['列名'].value_counts()  #value_counts返回每个元素有多少个

df.info()  #info函数返回有哪些列、有多少非缺失值、每列的类型	
![在这里

df.describe()  #describe默认统计数值型数据的各个统计量，也可指定某列，也可用分位数（df.describe(percentiles=[.05, .25, .75, .95])）

在这里插入图片描述

df['Math'].idxmax()  #返回最大值的索引
df['Math'].nlargest(3)   #返回最大值的索引及值

df['Math'].clip(33,80).head()  #clip是对超过或者低于某些值的数进行截断¶
df['Math'].mad()  #平均绝对离差
df['Address'].replace(['street_1','street_2'],['one','two']).head()  #replace是对某些值进行替换¶
df.replace({'Address':{'street_1':'one','street_2':'two'}}).head()

df['data1'] = df['data1'].map(lambda x : "%.3f"%x)  #map() 是一个Series的函数，DataFrame结构中没有map()。map()将一个自定义函数应用于Series结构中的每个元素(elements)。
![在这里插入

# apply()将一个函数作用于series、DataFrame中的每个行或者列
df['Math'].apply(lambda x:str(x)+'!').head() #可以使用lambda表达式，也可以使用函数
df.apply(lambda x:x.apply(lambda x:str(x)+'!')).head() #这是一个稍显复杂的例子，有利于理解apply的功能

1.4 排序

1.4.1 索引排序

df.set_index('Math').sort_index().head() #可以设置ascending参数，默认为升序，True

1.4.2 值排序

df.sort_values(by=['Address','Height']).head()

二、索引

df = pd.read_csv('data/table.csv',index_col='ID')
df.head()

在这里插入图片描述

2.1 单级索引

2.1.1 loc、iloc、[ ]使用

常用的索引方法可能就是这三类，其中iloc表示位置索引，loc表示标签索引，[]也具有很大的便利性，各有特点。

loc方法
loc中能传入的只有布尔列表和索引子集构成的列表

df.loc[1103]    #单行索引 

df.loc[[1102,2304]]    #多行索引

df.loc[:,'Height'].head()   #单列索引

df.loc[:,['Height','Math']].head()   #多列索引

df.loc[1102:2401:3,'Height':'Math'].head()    #联合索引

df.loc[lambda x:x['Gender']=='M'].head()   #函数式索引

df.loc[df['Address'].isin(['street_7','street_4'])].head()   #布尔索引


未完待续...

iloc方法
iloc中接收的参数只能为整数或整数列表，不能使用布尔索引

df.iloc[3]    #单行索引
df.iloc[:,7::-2].head()   #单列索引
df.iloc[lambda x:[3]].head()  #函数式索引

[ ]操作符
如果不想陷入困境，请不要在行索引为浮点时使用[]操作符，因为在Series中的浮点[]并不是进行位置比较，而是值比较，非常特殊

series的[ ]操作

s = pd.Series(df['Math'],index=df.index)
s[1101]
s[0:4]
s[lambda x: x.index[16::-6]]
s[s>80]

DataFrame的[ ]操作
一般来说，[]操作符常用于列选择或布尔选择，尽量避免行的选择

df[1:2]

row = df.index.get_loc(1102)  #获取某一行索引
df[row:row+1]

df[['School','Math']].head()
df['School'].head()

df[lambda x:['Math','Physics']].head()

df[df['Gender']=='F'].head()

2.1.2 布尔索引

布尔符号：’&’,’|’,’~’：分别代表和and，或or，取反not

df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()
#如果不加values就会索引对齐发生错误，Pandas中的索引对齐是一个重要特征，很多时候非常使用
#但是若不加以留意，就会埋下隐患

isin方法

df[df['Address'].isin(['street_1','street_4'])&df['Physics'].isin(['A','A+'])]
#上面也可以用字典方式写：
df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]
#all与&的思路是类似的，其中的1代表按照跨列方向判断是否全为True

2.1.3 快速标量索引

当只需要取一个元素时，at和iat方法能够提供更快的实现

display(df.at[1101,'School'])
display(df.loc[1101,'School'])
display(df.iat[0,0])
display(df.iloc[0,0])
#可尝试去掉注释对比时间
#%timeit df.at[1101,'School']
#%timeit df.loc[1101,'School']
#%timeit df.iat[0,0]
#%timeit df.iloc[0,0]

2.1.4 区间索引

此处介绍并不是说只能在单级索引中使用区间索引，只是作为一种特殊类型的索引方式，在此处先行介绍

用interval_range方法

pd.interval_range(start=0,end=5)
#closed参数可选'left''right''both''neither'，默认左开右闭

在这里插入图片描述

pd.interval_range(start=0,periods=8,freq=5)
#periods参数控制区间个数，freq控制步长

在这里插入图片描述

利用cut将数值列转为区间为元素的分类变量，例如统计数学成绩的区间情况

math_interval = pd.cut(df['Math'],bins=[0,40,60,80,100])
#注意，如果没有类型转换，此时并不是区间类型，而是category类型
math_interval.head()

在这里插入图片描述

区间索引的选取

df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]\
            .reset_index().set_index('Math_interval')
df_i.head()

在这里插入图片描述

df_i.loc[65].head()
#包含该值就会被选中

在这里插入图片描述

如果想要选取某个区间，先要把分类变量转为区间变量，再使用overlap方法

#df_i.loc[pd.Interval(70,75)].head() 报错
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head()

在这里插入图片描述

2.2 多级索引

2.2.1 创建多级索引

2.2.2 多层索引切片

2.2.3 多层索引中的slice对象

2.2.4 索引层的交换

2.3 索引设定

2.4 常用索引型函数

2.4.1 where函数

2.4.2 mask函数

2.4.3 query函数

2.5 重复元素处理

2.5.1 duplicated方法

2.5.2 drop_duplicates方法

2.6 抽样函数

这里的抽样函数指的就是sample函数
df.sample(n=5) #n为样本量
df.sample(frac=0.05) #rac为抽样比
df.sample(n=df.shape[0],replace=True).head() #replace为是否放回

未完待续。。。

python数据分析之pandas使用总结

python数据分析之pandas使用总结

文章目录

一、pandas基础

二、索引

2.1 单级索引

2.1.1 loc、iloc、[ ]使用

2.1.2 布尔索引

2.1.3 快速标量索引

2.1.4 区间索引

2.2 多级索引

2.2.1 创建多级索引

2.2.2 多层索引切片

2.2.3 多层索引中的slice对象

2.2.4 索引层的交换

2.3 索引设定

2.4 常用索引型函数

2.4.1 where函数

2.4.2 mask函数

2.4.3 query函数

2.5 重复元素处理

2.5.1 duplicated方法

2.5.2 drop_duplicates方法

2.6 抽样函数