python获取邮件并转为pdf

最新推荐文章于 2024-03-30 18:32:15 发布

尖沙咀段坤

最新推荐文章于 2024-03-30 18:32:15 发布

阅读量891

点赞数

分类专栏： python python爬虫文章标签： python 网络爬虫

本文链接：https://blog.csdn.net/Micrasoft007/article/details/127280552

版权

python 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

python爬虫

4 篇文章 1 订阅

订阅专栏

业务场景

客户要求写一个程序能够自动登录他的邮箱，然后读取邮件中的求职者简历，并将该类型的邮件转为pdf以及获取关键信息和附件。

实现方式

1、使用imaplib库访问医院HR邮箱，获取邮件并使用email将邮件信息解析为html信息
2、筛选出简历信息后使用pdfkit将邮件内容转为pdf
3、使用etree读取html并通过find找寻关键信息存
主要逻辑：
1、业务方法resume_collect调用getMail获取邮件list进行解析
2、获取邮件方法getMail使用imaplib登录邮箱获取邮件信息通过解析邮件首部方法parseHeader、解析邮件/信体方法parseBody将邮件元数据转为易解析的html和text信息
3、解析邮件首部方法parseHeader通过email库解析邮件的主题、发件人、收件人等首部信息
4、解析邮件/信体方法parseBody通过email库解析邮件的信体以及附件内容

获取邮件getMail方法

def getMail(self, host, username, password, port=993):
    try:
        serv = imaplib.IMAP4_SSL(host, port)
    except Exception, e:
        serv = imaplib.IMAP4(host, port)

    serv.login(username, password)
    # MAP4 ID extension 通信规则 RFC 2971 协议 上传ID信息
    imaplib.Commands['ID'] = ('AUTH')
    args = ("name", "XXXX", "contact", "XXXX@163.com", "version", "1.0.0", "vendor", "myclient")
    typ, dat = serv._simple_command('ID', '("' + '" "'.join(args) + '")')
    # 选择邮箱
    serv.select("INBOX")
    # 搜索邮件内容
    typ, data = serv.search(None, 'ALL')
    email_list = []
    for num in data[0].split()[::-1]:
        typ, data = serv.fetch(num, '(RFC822)')
        text = data[0][1]
        message = email.message_from_string(text)  # 转换为email.message对象
        subject_name = self.parseHeader(message)  # 解析邮件首部
        body, attach = self.parseBody(message)  # 解析邮件/信体
        email_list.append({'subject_name': subject_name, 'body': body, 'attach': attach})

    serv.close()
    serv.logout()
    return email_list

解析邮件首部parseHeader方法

def parseHeader(self, message):
    """ 解析邮件首部 """
    subject = message.get('subject')
    h = email.Header.Header(subject)
    dh = email.Header.decode_header(h)
    if dh[0][0] and dh[0][1]:
        subject = unicode(dh[0][0], dh[0][1])
    else:
        if dh[0][0]:
            subject = unicode(dh[0][0])
        else:
            if dh[0][1]:
                subject = unicode(dh[0][1])
            else:
                subject = u'无法解析改标题'
    # 主题
    return subject
    # 发件人
    # print 'From:', email.utils.parseaddr(message.get('from'))[1]
    # 收件人
    # print 'To:', email.utils.parseaddr(message.get('to'))[1]
    # 抄送人
    # print 'Cc:', email.utils.parseaddr(message.get_all('cc'))[1]

解析邮件/信体parseBody方法

def parseBody(self, message):
    """ 解析邮件/信体 """
    body = []
    attach = []
    # 循环信件中的每一个mime的数据块
    for part in message.walk():
        # 这里要判断是否是multipart，是的话，里面的数据是一个message 列表
        if not part.is_multipart():
            charset = part.get_charset()
            # print 'charset: ', charset
            contenttype = part.get_content_type()
            # print 'content-type', contenttype
            name = part.get_param("name")  # 如果是附件，这里就会取出附件的文件名
            if name:
                # 有附件
                # 下面的三行代码只是为了解码象=?gbk?Q?=CF=E0=C6=AC.rar?=这样的文件名
                fh = email.Header.Header(name)
                fdh = email.Header.decode_header(fh)
                fname = fdh[0][0]
                attach_data = part.get_payload(decode=True) #　解码出附件数据，然后存储到文件中
                attach.append((fname, attach_data))
            else:
                # 不是附件，是文本内容
                body.append(part.get_payload(decode=True))  # 解码出文本内容，直接输出来就可以了。
    return body, attach

主体业务resume_collect方法

本例是一个从邮箱获取简历邮件的实现，其他的用法可以根据情况修改主体业务方法
由于实际业务内容太多，所以只拆了主要骨架出来，看起来不是很连贯，不过也不影响理解，具体解析方式还得自己打断点看解析的邮件内容是什么样的。

# 从收件服务器的用户的邮箱中获取邮件信息
def resume_collect(self, server_config):
    email_list = self.getMail(server_config.server, server_config.user, server_config.password)
    for rec in email_list:
        if '51job' in rec['body'][0]:
            try:
                channel = '前程无忧'
                html_obj = etree.HTML(unicode(rec['body'][1], "utf-8"))
                info_xmls = html_obj.xpath('//td/strong')
                name = info_xmls[0].text.replace('\r\n', '').replace(' ', '')
                # 此处通过etree在html中搜寻过滤有效内容
            except:
                channel = '前程无忧'
                name = rec['subject_name']
                down_link = '由于邮件页面改变，当前解析脚本已无法解析，请联系开发人员修复'
        elif '智联' in rec['body'][0]:
            try:
                channel = '智联招聘'
                html_obj = etree.HTML(unicode(rec['body'][1], "utf-8"))
                name = html_obj.xpath('//table/tbody/tr/td/div[@class="call-tip"]/b')[0].text
                # 此处通过etree在html中搜寻过滤有效内容
            except:
                channel = '智联招聘'
                name = rec['subject_name']
                down_link = '由于邮件页面改变，当前解析脚本已无法解析，请联系开发人员修复'
        else:
            name = rec['subject_name']
        if name:
            if rec.get('attach', []):
            	# 获取附件
                for attach_rec in rec['attach']:
                    datas = base64.encodestring(attach_rec[1])
            # 把邮件html转为pdf存入附件
            try:
                for html_rec in rec['body']:
                    if 'html' in html_rec or '<div>' in html_rec:
                        datas = base64.encodestring(pdfkit.from_string(str(unicode(html_rec, "utf-8")), False, options={'encoding': 'UTF-8'}))
                        break
            except:
                pass