我将从一个更大的CSV文件中收集分散的电子邮件。我现在正在学习正则表达式。我正在尝试从此例句中提取电子邮件。但是,电子邮件中仅包含@符号和紧接其前的字母。您能帮我看看发生了什么事吗?
import re
String = "'Jessica's email is jessica@gmail.com, and Daniel's email is daniel123@gmail.com. Edward's is edwardfountain@gmail.com, and his grandfather, Oscar's, is odawg@gmail.com.'"
emails = re.findall(r'.[@]', String)
names = re.findall(r'[A-Z][a-z]*',String)
print(emails)
print(names)
解决方案
您的正则表达式电子邮件根本不起作用:emails = re.findall(r'.[@]', String)然后匹配anychar@。
我会尝试另一种方法:匹配句子并提取名称,给电子邮件加上以下经验性假设(如果您的文本变化太大,将破坏逻辑)
所有名称后面都跟着's"和is某处(使用非贪心.*?来匹配两者之间的所有内容
\w匹配任何字母数字字符(或下划线),并且域只匹配一个点(否则它匹配句子的最后一个点)
码:
import re
String = "'Jessica's email is jessica@gmail.com, and Daniel's email is daniel123@gmail.com. Edward's is edwardfountain@gmail.com, and his grandfather, Oscar's, is odawg@gmail.com.'"
print(re.findall("(\w+)'s.*? is (\w+@\w+\.\w+)",String))
结果:
[('Jessica', 'jessica@gmail.com'), ('Daniel', 'daniel123@gmail.com'), ('Edward', 'edwardfountain@gmail.com'), ('Oscar', 'odawg@gmail.com')]
转换为dict甚至会给您一个字典名称=>地址:
{'Oscar': 'odawg@gmail.com', 'Jessica': 'jessica@gmail.com', 'Daniel': 'daniel123@gmail.com', 'Edward': 'edwardfountain@gmail.com'}
一般情况下需要更多字符(不确定我是否详尽):
String = "'Jessica's email is jessica_123@gmail.com, and Daniel's email is daniel-123@gmail.com. Edward's is edward.fountain@gmail.com, and his grandfather, Oscar's, is odawg@gmail.com.'"
print(re.findall("(\w+)'s.*? is ([\w\-.]+@[\w\-.]+\.[\w\-]+)",String))
结果:
[('Jessica', 'jessica_123@gmail.com'), ('Daniel', 'daniel-123@gmail.com'), ('Edward', 'edward.fountain@gmail.com'), ('Oscar', 'odawg@gmail.com')]