pandas基操

来自内蒙古的田园蒙牛

已于 2022-06-21 15:36:30 修改

阅读量116

点赞数 1

于 2022-06-21 15:22:46 首次发布

本文链接：https://blog.csdn.net/zhan9le/article/details/125391316

版权

数据分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、文件读取

倒入库：

import pandas as pd

1.文本文件

csv:文本文件分割符，默认分割符为‘，’；sep是改变文本分割符的；如果赋值错误，则会连在一片

order = pd.read_table(r'文件名.csv',encoding='gbk',sep=',')

read_table:可以读任何文本文件；；encoding:编码格式

order1 = pd.read_csv(r'文件名.csv',encoding='gbk',sep=',')

read_csv:专门读csv文件

2.excel文件（.xls(2007年); .xlsx(2007年后)）

打开excel文件：

detai = pd.read_excel(r'文件名.xlsx')

保存excel文件：

detai = pd.to_excel(r'文件名.xlsx')

二、dataframe的常规属性操作

1，表头，列索引

print(order.columns)

2，数据

print(order.values)

3，行索引方式

print(order.index)

4，元素类型

print(order.dtypes)

5，元素个数

print(order.size)

6，dataframe维度

print(order.ndim)

7，dataframe结构

print(order.shape)

8，转置

print(order.T)

三、增删查改dataframe的数据

1.查

print(order['列索引名']['行索引名'])
print(order[['多个列索引名']][['多个行索引名']])

（1）多行数据

print(order[:][:5])
print(order.head())  # 从头开始，默认为前五个值
print(order.tail(10))  # 从后面开始，后十个

（2）切片方法loc,iloc

loc[是前闭后闭]，，iloc[是前闭后开]

order.loc[行索引名称,列索引名称]
order.loc[行索引位置，列索引位置]

（3）高级用法

mask1 =文件名['列索引']==条件
mask2 =文件名['列索引']==条件
mask =np.any((mask1,mask2),axis=0)  # 跟 mask=mask1|mask2 一样
print(detai.loc[mask,['列索引名','列索引名']])

2.改

detai.loc[detai['用于判断的列索引']==条件,'要改变的列索引']=改变后的数据

更改是对读取的原始表进行修改，操作无法撤销；

3.增

detai['要添加的新的列的索引名'] =数据（数组或列表）

4.删

detai.drop(labels=range(10),axis=0,inplace=True)  # 删除前十列，axis只能等于0， inplace是是否保存
detai.drop(labels=['要删除的列索引']，axis=1,inplace=True)  # 删除要删除的列，axis只能等于1

三、dataframe统计描述

1.数值型数据：

'''
np.min  np.mean   np.var   np.argmin
np.max  np.cumsum  np.std  np.argmax
np.sum  np.cumprod  np.argsort
np.ptp # 极差（max-min）  np.cov  # 协方差
np.median  # 中位数
'''
print(np.mean(detai['列索引']))
print(detai['列索引'].mean())

2.describe 函数

（1）针对数值型数据

print(detai[['列索引']].describe())  # 列索引下必须是数值型
# 返回的是：count:非空数目：mean；std；min;25%;50%;75%;max

（2）类别型数据用法

print(detai['列索引'].describe())  # 列索引下必须是类别型
# 返回：count，unique（类别数目），top:众数；freq：众数出现的频率
print(detai['dishes_name'].count())  # 非空数目
print('频数统计结果：',detai['dishes_name'].value_counts())
print('众数：',detai['dishes_name'].mode())

（3）数据类型转换 category(类别型)

detai['要转变的列索引'].astype('category')

四、对时间的操作

1.转换时间格式，（把/转换为-）

order['时间的那一列的列索引'] = pd.to_datetime(order['时间的那一列的列索引'])

2.通过推导式提取时间

yesr =[i.year for i in order['格式为-的那一列是时间的索引']]  # yesr:年
month =[i.month for i in order['格式为-的那一列是时间的索引']]  # month:月
day =[i.day for i in order['格式为-的那一列是时间的索引']]  # day:日
## 小时：hour  分钟：minute  秒：second  日期（有大小）：date 一年中的第几周：week

3.时间的运算：hours 加小时

time1 = order['lock_time']+pd.Timedelta(days=1)
# print(order['lock_time']-pd.to_datetime('2016-1-1'))  # 日期

五、分组聚合内计算

import pandas as pd
import numpy as np
detail =pd.read_excel(r’文件名.xlsx’)

1.通过简单的统计函数

detaigeoup =detail[['列索引1','列索引2','列索引3']].groupby(by='索引1')  # by:以谁为聚合

2.第二种聚合：

agg:不同字段要求相同的统计函数

print(detail[['列索引1','列索引2']].agg([np.mean,np,sum]))

agg:对不同字段要求不同的统计函数

print(detail.agg({'列索引1':np.sum,'列索引2':np.mean}))
print(detai.agg({'counts':np.sum,'amounts':[np.mean,np.sum,np.median]}))

transform 方法聚合：func

print(detai[['列索引1','列索引2']].transform(lambda x:x*2).head())

六、透视表与交叉表

1.透视表

index:行分组键

detailpivot =pd.pivot_table(detai[['counts','order_id','amounts','dishes_name']],index=['order_id','dishes_name'],aggfunc=np.sum)

columns: 列分组键

detailpivot = pd.pivot_table(detai[['counts', 'order_id', 'amounts', 'dishes_name']], values='counts',  columns='dishes_name',aggfunc=np.sum)

行列都用

detailpivot = pd.pivot_table(detai[['counts', 'order_id', 'amounts', 'dishes_name']],values='counts', index='order_id', columns='dishes_name',aggfunc=np.sum, fill_value=0, margins=True)  # margins :是否求和

2.交叉表

print(pd.crosstab(index=detai['列索引1'],columns =detai[列索引2],values=detai['列索引3],aggfunc=np.sum))

来自内蒙古的田园蒙牛

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录