[pandas学习笔记] - 不同列数据处理方式的性能差异

16 篇文章 0 订阅
4 篇文章 0 订阅

这里参考了他的测试案例《还在抱怨pandas运行速度慢?这几个方法会颠覆你的看法》

https://www.jianshu.com/p/ef690275390c

案例:
按小时分割十年的数据。制作成dataframe。
将一天24小时平均分成三份,0-7,8-15,16-23,打上对应的tag。

# -*- coding: utf-8 -*-
"""
Created on Tue Feb  4 14:19:45 2020

@author: Administrator
"""

import pandas as pd
import numpy as np
import time

def time_elapse(fn):
    def _wrapper(*args, **kwargs):
        start = time.perf_counter_ns()
        fn(*args, **kwargs)
        print(f"{fn.__name__} cost {(time.perf_counter_ns() - start)/1_000_000_000} s")
    return _wrapper


df = pd.DataFrame({
    "Time": [x for x in pd.date_range('20100101', '20200101',freq='1H')], 
    "Hour": [x.hour for x in pd.date_range('20100101', '20200101',freq='1H')]})

def f(hour):
    c = 0
    if 0 <= hour < 8:
        c = 1
    elif 8 <= hour < 16:
        c = 2
    elif 16 <= hour < 24:
        c = 3
    return (c)

# 266s
# 使用了循环与loc
@time_elapse
def f1():
    df["Tag"] = 0
    for i in range(len(df)):
        h = df.iloc[i]["Hour"]
        df.loc[i, "Tag"] = f(h)

# 35s
# 使用了循环与iloc
@time_elapse
def f1_1():
    df["Tag"] = 0
    for i in range(len(df)):
        h = df.iloc[i]["Hour"]
        df.iloc[i]["Tag"] = f(h)

# 8.5s
# 使用了iterrows与list
@time_elapse
def f2():
    df["Tag"] = 0
    c = []
    for index, row in df.iterrows():
        h = row["Hour"]
        c.append(f(h))
    df["Tag"] = c
    
# 0.35s
# 使用了itertuples与list
@time_elapse
def f2_1():
    df["Tag"] = 0
    c = []
    for row in df.itertuples():
        h = row.Hour
        c.append(f(h))
    df["Tag"] = c
    
# 0.035s
# 使用了apply
@time_elapse
def f3():
    df["Tag"] = df.Hour.apply(f)
    
# 0.0129084 s
# 使用了索引,列操作
@time_elapse
def f4():
    index_1 = df.Hour.isin(range(0, 8))
    index_2 = df.Hour.isin(range(8, 16))
    index_3 = df.Hour.isin(range(16, 24))
    
    df.loc[index_1, "Tag"] = 1
    df.loc[index_2, "Tag"] = 2
    df.loc[index_3, "Tag"] = 3
    
#  0.0051495 s
# 使用pd.cut
@time_elapse
def f5():
    df["Tag"] = pd.cut(x=df.Hour, bins=[0, 8, 16, 24],
                        include_lowest=True, labels=[1, 2, 3]).astype(int)

# 0.001368 s
# 使用了np
@time_elapse
def f6():
    c = np.array([1, 2, 3])
    df["Tag"] = c[np.digitize(df.Hour, bins=[8, 16, 24])]
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值