python处理文件效率对比awk

最新推荐文章于 2021-05-25 15:54:12 发布

bitcarmanlee

最新推荐文章于 2021-05-25 15:54:12 发布

阅读量5.6k

点赞数

分类专栏： python 文章标签： python 文本处理 awk 效率对比

本文链接：https://blog.csdn.net/bitcarmanlee/article/details/50957417

版权

python 专栏收录该内容

81 篇文章 15 订阅

订阅专栏

有如下三文件：

wc -l breakfast_all cheap_all receptions_all
   3345271 breakfast_all
   955890 cheap_all
   505504 receptions_all
4806665 总用量

head -3 cheap_all
a    true
b    true
c    true

三个文件的结构都类似，第一列为uid。现在想统计三个文件中总共有多少不重复的uid。特意用python与awk分别写了代码，测试两者处理文本的速度。

python代码：

#!/usr/bin/env python
#coding:utf-8

import time

def t1():
    dic = {}
    filelist = ["breakfast_all","receptions_all","cheap_all"]
    start = time.clock()
    for each in filelist:
        f = open(each,'r')
        for line in f.readlines():
            key = line.strip().split()[0]
            if key not in dic:
                dic[key] = 1

    end = time.clock()
    print len(dic)
    print 'cost time is: %f' %(end - start)

def t2():
    uid_set = set()
    filelist = ["breakfast_all","receptions_all","cheap_all"]
    start = time.clock()
    for each in filelist:
        f = open(each,'r')
        for line in f.readlines():
            key = line.strip().split()[0]
            uid_set.add(key)

    end = time.clock()
    print len(uid_set)
    print 'cost time is: %f' %(end - start)

t1()
t2()

用awk处理

#!/bin/bash

function handle()
{
    start=$(date +%s%N)
    start_ms=${start:0:16}
    awk '{a[$1]++} END{print length(a)}' breakfast_all receptions_all cheap_all
    end=$(date +%s%N)
    end_ms=${end:0:16}
    echo "cost time is:"
    echo "scale=6;($end_ms - $start_ms)/1000000" | bc
}

handle

运行python脚本
./test.py
3685715
cost time is: 4.890000
3685715
cost time is: 4.480000

运行sh脚本

./zzz.sh
3685715
cost time is:
4.865822

由此可见，python里头的set结构比dic稍微快一点点。整体上，awk的处理速度与python的处理速度大致相当！

bitcarmanlee

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python处理文件效率对比awk

有如下三文件：wc -l breakfast_all cheap_all receptions_all 3345271 breakfast_all 955890 cheap_all 505504 receptions_all 4806665 总用量head -3 cheap_alla trueb truec true三个文件的
复制链接

扫一扫

专栏目录