python文本关键词提取,python从多个.gz文件中提取关键字

最新推荐文章于 2023-02-02 22:41:04 发布

日贩

最新推荐文章于 2023-02-02 22:41:04 发布

阅读量194

点赞数

文章标签： python文本关键词提取

Question: How to search keywords from multiple files in Python(including compressed gz file and uncompressed file)

I have multiple archived logs in a folder, the latest file is "messages",and the older logs will auto-compressed as .gz file.

-rw------- 1 root root 21262610 Nov 4 11:20 messages

-rw------- 1 root root 3047453 Nov 2 15:49 messages-20191102-1572680982.gz

-rw------- 1 root root 3018032 Nov 3 04:43 messages-20191103-1572727394.gz

-rw------- 1 root root 3026617 Nov 3 17:32 messages-20191103-1572773536.gz

-rw------- 1 root root 3044692 Nov 4 06:17 messages-20191104-1572819469.gz

I wrote a function:

store all filenames in a list.(success)

open each file in the list, if it is gz file, use gzip.open().

search keywords

but I think this way is not very smart, because actually the message log is very big and it is separated into multiple gz files.And I have lots of keywords stored in a keywords file.

So is there a better solution to concatenate all files into a I/O stream and then extract keywords from the stream.

def open_all_message_files(path):

files_list=[]

for root, dirs, files in os.walk(path):

for file in files:

if file.startswith("messages"):

files_list.append(os.path.join(root,file))

for x in files_list:

if x.endswith('gz'):

with gzip.open(x,"r") as f:

for line in f:

if b'keywords_1' in line:

print(line)

if b'keywords_2' in line:

print(line)

else:

with open(x,"r") as f:

for line in f:

if 'keywords_1' in line:

print(line)

if 'keywords_2' in line:

print(line)

解决方案

This is my first answer in stackoverflow, so please bear with me.

I had this very similar problem where I needed to analyze several logs, some of which were huge to fit entirely into memory.

A solution to this problem, is to create a data processing pipeline, similar to a unix/linux pipeline. The idea behind is to break each task to their own individual function and use generators to achieve a more memory efficient approach.

import os

import gzip

import re

import fnmatch

def find_files(pattern, path):

"""

Here you can find all the filenames that match a specific pattern

using shell wildcard pattern that way you avoid hardcoding

the file pattern i.e 'messages'

"""

for root, dirs, files in os.walk(path):

for name in fnmatch.filter(files, pattern):

yield os.path.join(root, name)

def file_opener(filenames):

"""

Open a sequence of filenames one at a time

and make sure to close the file once we are done

scanning its content.

"""

for filename in filenames:

if filename.endswith('.gz'):

f = gzip.open(filename, 'rt')

else:

f = open(filename, 'rt')

yield f

f.close()

def chain_generators(iterators):

"""

Chain a sequence of iterators together

"""

for it in iterators:

# Look up yield from if you're unsure what it does

yield from it

def grep(pattern, lines):

"""

Look for a pattern in a line

"""

pat = re.compile(pattern)

for line in lines:

if pat.search(line):

yield line

# A simple way to use these functions together

logs = find_files('messages*', 'One/two/three')

files = file_opener(logs)

lines = chain_generators(files)

each_line = grep('keywords_1', lines)

for match in each_line:

print(match)

Let me know if you have any questions in regards to my answer

日贩

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python文本关键词提取,python从多个.gz文件中提取关键字

Question: How to search keywords from multiple files in Python(including compressed gz file and uncompressed file)I have multiple archived logs in a folder, the latest file is "messages",and the older...
复制链接

扫一扫