Pandas系列学习教程(七)

Pandas系列教程(七)


  对于想要入门数据科学的朋友们来说,Python是一个很好的选择,除了因为简单的语法外,Python 生态中提供了很多在数值计算方面非常优秀的库,其中Pandas不可不提,Pandas是很强大是数据集处理工具,往往和numpy, matplotlib 等库搭配使用,我也是刚刚开始学习Pandas, 顺便翻译了一下官方的Pandas教程, 这里使用的是jupyter notebook, 因为博客不支持html直接编辑,所以只能转化为markdown 格式,如果想直接查看html版本可点击每一节下的链接。本文仅供学习和交流使用,欢迎大家交流和指正!


摘要

  • groupby()函数的使用
  • percentile 分位点的使用
  • mean(),std()等统计函数

HTML版本点击此处

import pandas as pd
import sys

print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
Python version 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
Pandas version 0.23.0
# 以日期为index创建
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL'] 
data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
idx = pd.date_range('1/1/2012', periods=10, freq='MS')
df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
df1['State'] = States

# 创建第二个数据帧
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
df2['State'] = States
# 用pd.concat()函数拼接数据帧
df = pd.concat([df1,df2])

df
RevenueState
2012-01-011.0NY
2012-02-012.0NY
2012-03-013.0NY
2012-04-014.0NY
2012-05-015.0FL
2012-06-016.0FL
2012-07-017.0GA
2012-08-018.0GA
2012-09-019.0FL
2012-10-0110.0FL
2013-01-0110.0NY
2013-02-0110.0NY
2013-03-019.0NY
2013-04-019.0NY
2013-05-018.0FL
2013-06-018.0FL
2013-07-017.0GA
2013-08-017.0GA
2013-09-016.0FL
2013-10-016.0FL
# 方法一

# 获得一份原数据帧的拷贝,避免丢失数据
newdf = df.copy()

newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > newdf['1.96*std']

newdf
RevenueStatex-Mean1.96*stdOutlier
2012-01-011.0NY5.755.200273True
2012-02-012.0NY4.755.200273False
2012-03-013.0NY3.755.200273False
2012-04-014.0NY2.755.200273False
2012-05-015.0FL1.755.200273False
2012-06-016.0FL0.755.200273False
2012-07-017.0GA0.255.200273False
2012-08-018.0GA1.255.200273False
2012-09-019.0FL2.255.200273False
2012-10-0110.0FL3.255.200273False
2013-01-0110.0NY3.255.200273False
2013-02-0110.0NY3.255.200273False
2013-03-019.0NY2.255.200273False
2013-04-019.0NY2.255.200273False
2013-05-018.0FL1.255.200273False
2013-06-018.0FL1.255.200273False
2013-07-017.0GA0.255.200273False
2013-08-017.0GA0.255.200273False
2013-09-016.0FL0.755.200273False
2013-10-016.0FL0.755.200273False
# 方法二

newdf1 = df.copy()

State = newdf1.groupby('State')

# 对State进行transform 则得到的结果中state已经排好序了,lambda表达式得到的x
# 是Revenue的值
newdf1['x-Mean'] = State.transform(lambda x: abs(x - x.mean()))
newdf1['1.96*std'] = State.transform(lambda x:1.96*x.std())
newdf1['Outlier'] = State.transform(lambda x: abs(x - x.mean()) > 1.96*x.std())

newdf1
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f99067b2e80>
RevenueStatex-Mean1.96*stdOutlier
2012-01-011.0NY5.007.554813False
2012-02-012.0NY4.007.554813False
2012-03-013.0NY3.007.554813False
2012-04-014.0NY2.007.554813False
2012-05-015.0FL2.253.434996False
2012-06-016.0FL1.253.434996False
2012-07-017.0GA0.250.980000False
2012-08-018.0GA0.750.980000False
2012-09-019.0FL1.753.434996False
2012-10-0110.0FL2.753.434996False
2013-01-0110.0NY4.007.554813False
2013-02-0110.0NY4.007.554813False
2013-03-019.0NY3.007.554813False
2013-04-019.0NY3.007.554813False
2013-05-018.0FL0.753.434996False
2013-06-018.0FL0.753.434996False
2013-07-017.0GA0.250.980000False
2013-08-017.0GA0.250.980000False
2013-09-016.0FL1.253.434996False
2013-10-016.0FL1.253.434996False
"""
方法二对多个关键字分组
这种分组方式统计的revenue就是在计算均值的时候计入每年状态相同的月份
"""


# 拷贝数据帧
newdf = df.copy()

StateMonth = newdf.groupby(['State', lambda x: x.month])

newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) )
newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() )
newdf
RevenueStateOutlierx-Mean1.96*std
2012-01-011.0NYFalse4.512.473364
2012-02-012.0NYFalse4.011.087434
2012-03-013.0NYFalse3.08.315576
2012-04-014.0NYFalse2.56.929646
2012-05-015.0FLFalse1.54.157788
2012-06-016.0FLFalse1.02.771859
2012-07-017.0GAFalse0.00.000000
2012-08-018.0GAFalse0.51.385929
2012-09-019.0FLFalse1.54.157788
2012-10-0110.0FLFalse2.05.543717
2013-01-0110.0NYFalse4.512.473364
2013-02-0110.0NYFalse4.011.087434
2013-03-019.0NYFalse3.08.315576
2013-04-019.0NYFalse2.56.929646
2013-05-018.0FLFalse1.54.157788
2013-06-018.0FLFalse1.02.771859
2013-07-017.0GAFalse0.00.000000
2013-08-017.0GAFalse0.51.385929
2013-09-016.0FLFalse1.54.157788
2013-10-016.0FLFalse2.05.543717
# 方法三

# 拷贝原始数据帧
newdf = df.copy()

State = newdf.groupby('State')

def s(group):
    group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
    group['1.96*std'] = 1.96*group['Revenue'].std()  
    group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
    return group

Newdf2 = State.apply(s)
Newdf2
RevenueStatex-Mean1.96*stdOutlier
2012-01-011.0NY5.007.554813False
2012-02-012.0NY4.007.554813False
2012-03-013.0NY3.007.554813False
2012-04-014.0NY2.007.554813False
2012-05-015.0FL2.253.434996False
2012-06-016.0FL1.253.434996False
2012-07-017.0GA0.250.980000False
2012-08-018.0GA0.750.980000False
2012-09-019.0FL1.753.434996False
2012-10-0110.0FL2.753.434996False
2013-01-0110.0NY4.007.554813False
2013-02-0110.0NY4.007.554813False
2013-03-019.0NY3.007.554813False
2013-04-019.0NY3.007.554813False
2013-05-018.0FL0.753.434996False
2013-06-018.0FL0.753.434996False
2013-07-017.0GA0.250.980000False
2013-08-017.0GA0.250.980000False
2013-09-016.0FL1.253.434996False
2013-10-016.0FL1.253.434996False
# 方法三

# 拷贝原始数据帧
newdf = df.copy()

StateMonth = newdf.groupby(['State', lambda x: x.month])

def s(group):
    group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
    group['1.96*std'] = 1.96*group['Revenue'].std()  
    group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
    return group

Newdf2 = StateMonth.apply(s)
Newdf2
RevenueStatex-Mean1.96*stdOutlier
2012-01-011.0NY4.512.473364False
2012-02-012.0NY4.011.087434False
2012-03-013.0NY3.08.315576False
2012-04-014.0NY2.56.929646False
2012-05-015.0FL1.54.157788False
2012-06-016.0FL1.02.771859False
2012-07-017.0GA0.00.000000False
2012-08-018.0GA0.51.385929False
2012-09-019.0FL1.54.157788False
2012-10-0110.0FL2.05.543717False
2013-01-0110.0NY4.512.473364False
2013-02-0110.0NY4.011.087434False
2013-03-019.0NY3.08.315576False
2013-04-019.0NY2.56.929646False
2013-05-018.0FL1.54.157788False
2013-06-018.0FL1.02.771859False
2013-07-017.0GA0.00.000000False
2013-08-017.0GA0.51.385929False
2013-09-016.0FL1.54.157788False
2013-10-016.0FL2.05.543717False
newdf = df.copy()

State = newdf.groupby('State')

newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper']) 
newdf
RevenueStateLowerUpperOutlier
2012-01-011.0NY-7.00019.000False
2012-02-012.0NY-7.00019.000False
2012-03-013.0NY-7.00019.000False
2012-04-014.0NY-7.00019.000False
2012-05-015.0FL2.62511.625False
2012-06-016.0FL2.62511.625False
2012-07-017.0GA6.6257.625False
2012-08-018.0GA6.6257.625True
2012-09-019.0FL2.62511.625False
2012-10-0110.0FL2.62511.625False
2013-01-0110.0NY-7.00019.000False
2013-02-0110.0NY-7.00019.000False
2013-03-019.0NY-7.00019.000False
2013-04-019.0NY-7.00019.000False
2013-05-018.0FL2.62511.625False
2013-06-018.0FL2.62511.625False
2013-07-017.0GA6.6257.625False
2013-08-017.0GA6.6257.625False
2013-09-016.0FL2.62511.625False
2013-10-016.0FL2.62511.625False
x = [1,2,3,4]
y = [2,3,4,5]
dataset = list(zip(x,y))
du = pd.DataFrame(data=dataset,columns=['a','b'])
du
ab
012
123
234
345
du['a']
0    1
1    2
2    3
3    4
Name: a, dtype: int64
du['a'].quantile()
2.5
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值