python eml解析_如何在python中读取eml文件?

这是您获取电子邮件内容的方式,即* .eml文件.

这在

Python2.5 – 2.7上完美运行.尝试3.它也应该工作.

from email import message_from_file

import os

# Path to directory where attachments will be stored:

path = "./msgfiles"

# To have attachments extracted into memory, change behaviour of 2 following functions:

def file_exists (f):

"""Checks whether extracted file was extracted before."""

return os.path.exists(os.path.join(path, f))

def save_file (fn, cont):

"""Saves cont to a file fn"""

file = open(os.path.join(path, fn), "wb")

file.write(cont)

file.close()

def construct_name (id, fn):

"""Constructs a file name out of messages ID and packed file name"""

id = id.split(".")

id = id[0]+id[1]

return id+"."+fn

def disqo (s):

"""Removes double or single quotations."""

s = s.strip()

if s.startswith("'") and s.endswith("'"): return s[1:-1]

if s.startswith('"') and s.endswith('"'): return s[1:-1]

return s

def disgra (s):

"""Removes < and > from HTML-like tag or e-mail address or e-mail ID."""

s = s.strip()

if s.startswith(""): return s[1:-1]

return s

def pullout (m, key):

"""Extracts content from an e-mail message.

This works for multipart and nested multipart messages too.

m -- email.Message() or mailbox.Message()

key -- Initial message ID (some string)

Returns tuple(Text, Html, Files, Parts)

Text -- All text from all parts.

Html -- All HTMLs from all parts

Files -- Dictionary mapping extracted file to message ID it belongs to.

Parts -- Number of parts in original message.

"""

Html = ""

Text = ""

Files = {}

Parts = 0

if not m.is_multipart():

if m.get_filename(): # It's an attachment

fn = m.get_filename()

cfn = construct_name(key, fn)

Files[fn] = (cfn, None)

if file_exists(cfn): return Text, Html, Files, 1

save_file(cfn, m.get_payload(decode=True))

return Text, Html, Files, 1

# Not an attachment!

# See where this belongs. Text, Html or some other data:

cp = m.get_content_type()

if cp=="text/plain": Text += m.get_payload(decode=True)

elif cp=="text/html": Html += m.get_payload(decode=True)

else:

# Something else!

# Extract a message ID and a file name if there is one:

# This is some packed file and name is contained in content-type header

# instead of content-disposition header explicitly

cp = m.get("content-type")

try: id = disgra(m.get("content-id"))

except: id = None

# Find file name:

o = cp.find("name=")

if o==-1: return Text, Html, Files, 1

ox = cp.find(";", o)

if ox==-1: ox = None

o += 5; fn = cp[o:ox]

fn = disqo(fn)

cfn = construct_name(key, fn)

Files[fn] = (cfn, id)

if file_exists(cfn): return Text, Html, Files, 1

save_file(cfn, m.get_payload(decode=True))

return Text, Html, Files, 1

# This IS a multipart message.

# So, we iterate over it and call pullout() recursively for each part.

y = 0

while 1:

# If we cannot get the payload, it means we hit the end:

try:

pl = m.get_payload(y)

except: break

# pl is a new Message object which goes back to pullout

t, h, f, p = pullout(pl, key)

Text += t; Html += h; Files.update(f); Parts += p

y += 1

return Text, Html, Files, Parts

def extract (msgfile, key):

"""Extracts all data from e-mail, including From, To, etc., and returns it as a dictionary.

msgfile -- A file-like readable object

key -- Some ID string for that particular Message. Can be a file name or anything.

Returns dict()

Keys: from, to, subject, date, text, html, parts[, files]

Key files will be present only when message contained binary files.

For more see __doc__ for pullout() and caption() functions.

"""

m = message_from_file(msgfile)

From, To, Subject, Date = caption(m)

Text, Html, Files, Parts = pullout(m, key)

Text = Text.strip(); Html = Html.strip()

msg = {"subject": Subject, "from": From, "to": To, "date": Date,

"text": Text, "html": Html, "parts": Parts}

if Files: msg["files"] = Files

return msg

def caption (origin):

"""Extracts: To, From, Subject and Date from email.Message() or mailbox.Message()

origin -- Message() object

Returns tuple(From, To, Subject, Date)

If message doesn't contain one/more of them, the empty strings will be returned.

"""

Date = ""

if origin.has_key("date"): Date = origin["date"].strip()

From = ""

if origin.has_key("from"): From = origin["from"].strip()

To = ""

if origin.has_key("to"): To = origin["to"].strip()

Subject = ""

if origin.has_key("subject"): Subject = origin["subject"].strip()

return From, To, Subject, Date

# Usage:

f = open("message.eml", "rb")

print extract(f, f.name)

f.close()

我使用邮箱为我的邮件组编程了这个,这就是为什么它如此复杂.

它永远不会让我失望.从来没有任何垃圾.如果message是multipart,则输出字典将包含a

密钥“文件”(子字典),其中包含提取的其他非文本或html文件的所有文件名.

这是一种提取附件和其他二进制数据的方法.

您可以在pullout()中更改它,或者只是更改file_exists()和save_file()的行为.

construct_name()从消息id和multipart消息构造一个文件名

文件名,如果有的话.

在pullout()中,Text和Html变量是字符串.对于在线邮件组,可以将任何文本或HTML打包到多部分中,而不是一次性附件.

如果您需要更复杂的内容,请将Text和Html更改为列表并附加到它们并根据需要添加它们.

什么都没有问题.

也许这里有一些错误,因为它适用于mailbox.Message(),

不是用email.Message().我在email.Message()上尝试过它并且运行正常.

你说,你“希望列出所有”.来自哪里?如果你参考POP3邮箱或一些不错的开源邮件的邮箱,那么你使用邮箱模块.如果您想从其他人列出它们,那么您就遇到了问题.例如,要从MS Outlook获取邮件,您必须知道如何读取OLE2复合文件.其他邮件很少将它们称为* .eml文件,因此我认为这正是您想要做的.然后在PyPI上搜索olefile或compoundfiles模块和Google,了解如何从MS Outlook收件箱文件中提取电子邮件.或者保存自己一团糟,然后将它们从那里导出到某个目录.当您将它们作为eml文件时,请应用此代码.

要批量读取多个 EML 文件,你可以使用 `glob` 模块来获取指定目录下的所有 EML 文件文件路径,然后使用 `email.parser` 模块来解析每个 EML 文件。 以下是一个示例代码,演示如何批量读取 EML 文件解析它们: ```python import glob from email.parser import Parser # 指定包含 EML 文件的目录路径 eml_dir = 'path/to/eml/files' # 获取目录下所有的 EML 文件路径 eml_files = glob.glob(eml_dir + '/*.eml') # 创建一个解析器对象 parser = Parser() # 遍历每个 EML 文件解析 for eml_file in eml_files: # 打开 EML 文件读取内容 with open(eml_file, 'r') as f: eml_text = f.read() # 解析 EML 文件 email_object = parser.parsestr(eml_text) # 现在可以访问解析后的电子邮件对象的各个部分了 print('发件人:', email_object['From']) print('主题:', email_object['Subject']) print('正文:', email_object.get_payload()) print('---') # 用于分隔不同的 EML 文件输出 ``` 在上面的示例,我们首先使用 `glob` 模块获取指定目录下的所有以 `.eml` 结尾的文件路径。然后,我们遍历每个文件,并打开它们,读取文件内容。接下来,我们使用 `Parser` 对象解析每个 EML 文件,并访问解析后的电子邮件对象的各个部分,如发件人、主题和正文。 请确保将 `eml_dir` 替换为你实际存储 EML 文件的目录路径。 希望这个示例对你有帮助!如果你有任何其他问题,请随时提问。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值