python文本关键词提取,python从多个.gz文件中提取关键字

Question: How to search keywords from multiple files in Python(including compressed gz file and uncompressed file)

I have multiple archived logs in a folder, the latest file is "messages",and the older logs will auto-compressed as .gz file.

-rw------- 1 root root 21262610 Nov 4 11:20 messages

-rw------- 1 root root 3047453 Nov 2 15:49 messages-20191102-1572680982.gz

-rw------- 1 root root 3018032 Nov 3 04:43 messages-20191103-1572727394.gz

-rw------- 1 root root 3026617 Nov 3 17:32 messages-20191103-1572773536.gz

-rw------- 1 root root 3044692 Nov 4 06:17 messages-20191104-1572819469.gz

I wrote a function:

store all filenames in a list.(success)

open each file in the list, if it is gz file, use gzip.open().

search keywords

but I think this way is not very smart, because actually the message log is very big and it is separated into multiple gz files.And I have lots of keywords stored in a keywords file.

So is there a better solution to concatenate all files into a I/O stream and then extract keywords from the stream.

def open_all_message_files(path):

files_list=[]

for root, dirs, files in os.walk(path):

for file in files:

if file.startswith("messages"):

files_list.append(os.path.join(root,file))

for x in files_list:

if x.endswith('gz'):

with gzip.open(x,"r") as f:

for line in f:

if b'keywords_1' in line:

print(line)

if b'keywords_2' in line:

print(line)

else:

with open(x,"r") as f:

for line in f:

if 'keywords_1' in line:

print(line)

if 'keywords_2' in line:

print(line)

解决方案

This is my first answer in stackoverflow, so please bear with me.

I had this very similar problem where I needed to analyze several logs, some of which were huge to fit entirely into memory.

A solution to this problem, is to create a data processing pipeline, similar to a unix/linux pipeline. The idea behind is to break each task to their own individual function and use generators to achieve a more memory efficient approach.

import os

import gzip

import re

import fnmatch

def find_files(pattern, path):

"""

Here you can find all the filenames that match a specific pattern

using shell wildcard pattern that way you avoid hardcoding

the file pattern i.e 'messages'

"""

for root, dirs, files in os.walk(path):

for name in fnmatch.filter(files, pattern):

yield os.path.join(root, name)

def file_opener(filenames):

"""

Open a sequence of filenames one at a time

and make sure to close the file once we are done

scanning its content.

"""

for filename in filenames:

if filename.endswith('.gz'):

f = gzip.open(filename, 'rt')

else:

f = open(filename, 'rt')

yield f

f.close()

def chain_generators(iterators):

"""

Chain a sequence of iterators together

"""

for it in iterators:

# Look up yield from if you're unsure what it does

yield from it

def grep(pattern, lines):

"""

Look for a pattern in a line

"""

pat = re.compile(pattern)

for line in lines:

if pat.search(line):

yield line

# A simple way to use these functions together

logs = find_files('messages*', 'One/two/three')

files = file_opener(logs)

lines = chain_generators(files)

each_line = grep('keywords_1', lines)

for match in each_line:

print(match)

Let me know if you have any questions in regards to my answer

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值