我是一个python初学者,尝试从邮件头中提取数据。我在一个文本文件中有数千封电子邮件,我想从每封邮件中提取发件人地址、收件人地址和日期,并将其写入新文件中的一个以分号分隔的行中。在
这很难看,但我想到的是:import re
emails = open("demo_text.txt","r") #opens the file to analyze
results = open("results.txt","w") #creates new file for search results
resultsList = []
for line in emails:
if "From - " in line: #recgonizes the beginning of a email message and adds a linebreak
newMessage = re.findall(r'\w\w\w\s\w\w\w.*', line)
if newMessage:
resultsList.append("\n")
if "From: " in line:
address = re.findall(r'[\w.-]+@[\w.-]+', line)
if address:
resultsList.append(address)
resultsList.append(";")
if "To: " in line:
if "Delivered-To:" not in line: #avoids confusion with 'Delivered-To:' tag
address = re.findall(r'[\w.-]+@[\w.-]+', line)
if address:
for person in address:
resultsList.append(person)
resultsList.append(";")
if "Date: " in line:
date = re.findall(r'\w\w\w\,.*', line)
resultsList.append(date)
resultsList.append(";")
for result in resultsList:
results.writelines(result)
emails.close()
results.close()
这是我的演示_文本.txt':
^{pr2}$
输出为:somebody_1@hotmail.com;somebody_2@gmail.com;3_nobodies@yahoo.com.ar;Mon, 15 Jan 2007 18:13:43 +0000;
除了在我的演示中的“From:”字段中有一个换行符,这个输出就可以了_文本.txt(第24行),所以我错过了nobody@hotmail.com'. 在
我不知道如何告诉我的代码跳过换行符,仍然在From:标记中找到电子邮件地址。在
总的来说,我相信有很多更明智的方法来完成这项任务。如果有人能给我指出正确的方向,我会很感激的。在