python 使用迭代创建数据处理的管道

最新推荐文章于 2023-03-30 00:08:06 发布

置顶诸葛老刘

最新推荐文章于 2023-03-30 00:08:06 发布

阅读量516

点赞数

文章标签：数据处理管道 yield from

本文链接：https://blog.csdn.net/weixin_39791387/article/details/100034221

版权

文章目录

使用场景
解决方案
扩展

使用场景

以流水线式的形式对数据进行迭代处理(类似unix下的管道), 比如海量数据的处理,没法完全将数据加载到内存中去

解决方案

生成器函数是一种实现管道机制的好方法
优点：
- 占用内存较少
- 每个生成器函数都短小且功能独立。缩写和维护都很方便。
- 通用性比较好
示例

# -*- coding: utf-8 -*-
'''
# Created on 八月-23-19 11:21
# test2.py
# @author: zhugelaoliu
# @DESC: zhugelaoliu
'''
"""
有个超大的目录,其中都是想要处理的日志文件
"""
import os
import fnmatch
import gzip
import bz2
import re

def gen_find(filepath, top):
    """
    find all filenames in directory tree that match a shell wildcard pattern
    查找目录树中与shell通配符模式匹配的所有文件名
    """
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist, filepath):
            yield os.path.join(path, name)


def gen_opener(filenames):
    """
    open a sequence of filenames one at a time producting a file object.
    the file is closed immediately when proceeding to the next iteration.
    生成一个文件对象，一次打开一个文件名序列。
     进行下一次迭代时，文件立即关闭。
    """
    for filename in filenames:
        if filename.endswith('.gz'):
            f = gzip.open(filename, 'rt')
        elif filename.endswith('.bz2'):
            f = bz2.open(filename, 'rt')
        else:
            f = open(filename, 'rt')
        yield f 
        f.close()

def gen_concatenate(iterators):
    """
    chain a sequence of iterators together into a single sequence
    将一系列迭代器链接在一起形成一个序列
    """
    for it in iterators:
        yield from it
    

def gen_grep(pattern, lines):
    """
    look for a regex pattern in a sequence of lines
    在一系列行中寻找正则表达式模式
    """
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line

扩展

使用场景扩展:
- 解析、读取实时的数据源、定期轮询等
重点理解 gen_concatenate函数中的yield from it, 这是一个子生成器语句,
扁平化处理嵌套型的序列(推荐使用yield from 关键字

from collections import Iterable

def flatten(items, ignore_types=(str, bytes)):
	"""
	这个函数的通用性非常高
	"""
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, ignore_types):
            yield from flatten(x)
        else:
            yield x

items = [1, 2, [11, 22, [111, 222, [1111, 2222]]]]

for x in f(items):
    print(x)

yield from在涉及协程和基于生成器的并发高级程序中有着更重要的作用.

诸葛老刘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 使用迭代创建数据处理的管道

文章目录使用场景解决方案扩展使用场景以流水线式的形式对数据进行迭代处理(类似unix下的管道), 比如海量数据的处理,没法完全将数据加载到内存中去解决方案生成器函数是一种实现管道机制的好方法优点：占用内存较少每个生成器函数都短小且功能独立。缩写和维护都很方便。通用性比较好示例# -*- coding: utf-8 -*-'''# Created on 八月-2...
复制链接

扫一扫