python最简单的爬取邮箱地址_python3爬取网页中的邮箱地址

最新推荐文章于 2024-02-26 23:56:32 发布

weixin_39975366

最新推荐文章于 2024-02-26 23:56:32 发布

阅读量1k

点赞数

文章标签： python最简单的爬取邮箱地址

1、爬虫分析

分析结果对：

http://xxx.com?method=getrequest&gesnum=00000001

http://xxx.com?method=getrequest&gesnum=00000002

http://xxx.com?method=getrequest&gesnum=00000003

返回的数据进行爬取

由于返回的python3 JSON数据中存在单个转义字符“\”的处理没有处理好

req =requests.get(url=url,headers=headers,verify=False,timeout=60).json()

于是通过返回的是 bytes 型的二进制数据进行处理。

req =requests.get(url=url,headers=headers,verify=False,allow_redirects=False,timeout=60)

data= json.dumps(bytes.decode(req.content,'UTF-8'))

2、python3爬虫编写

#!/usr/bin/python3

#-*- coding:utf-8 -*-

#编写环境 windows 7 x64 Notepad++ + Python3.5.0

import urllib3

urllib3.disable_warnings()

import sys

import requests

import re

import json

cookie = '''JSESSIONID=1B7407076DE01727BC48DCD56FF9BA70; entsoft=entsoft; JSESSIONID=4877B5AC1DF6307E90CF1641D3863A6C; radId=45991FBF-0BC4-3BA4-08E2-00072022FB2C'''

headers ={

'Accept': 'application/json, text/plain, */*',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',

'Accept-Encoding': 'gzip, deflate',

'Accept-Language': 'zh-CN,zh;q=0.9',

'Cookie': cookie,

}

#输出00000001-00000300存放在num.txt中

def getNum():

filename='C:\\Users\\Administrator\\Desktop\\脚本\\num.txt'

file = open(filename,'w')

for i in range(1,300):

file.write(("%08d" % i)+'\n')

file.close()

def main():

#url ='http://xxx.com?method=getrequest&gesnum=00000001'

getNum()

filename='C:\\Users\\Administrator\\Desktop\\脚本\\num.txt'

with open(filename,'r') as file:

for line in file:

url ='http://xxx.com?method=getrequest&gesnum={line}'.format(line=line)

#print(url)

#req =requests.get(url=url,headers=headers,verify=False,timeout=60).json()

#遇到问题： python3 JSON数据中存在单个转义字符“\”的处理没解决于是使用下面的方式

req =requests.get(url=url,headers=headers,verify=False,allow_redirects=False,timeout=60)

#使用json.dumps的方法，可以将json对象转化为字符串

#print(req.content)

#response.text 返回的是一个 unicode 型的文本数据

#response.content 返回的是 bytes 型的二进制数据

#由于返回unicode 型的文本数据报错，使用返回bytes 型的二进制数据

data= json.dumps(bytes.decode(req.content,'UTF-8'))

#print(data)

#正则匹配邮箱地址

emailRegex = r"[-_\w\.]{0,64}@([-\w]{1,63}\.)*[-\w]{1,63}"

email = re.search(emailRegex,data)

print(email)

if __name__ == '__main__':

main()

3、输出邮件格式如下：

<_sre.SRE_Matchobject; span=(158,184), match='xxxx@hotmail.com'>

<_sre.SRE_Matchobject; span=(145,170), match='xxxx@nordictelecom.net'>

4、对返回邮件格式进行处理如下：

#!/usr/bin/python3

#-*- coding:utf-8 -*-

#编写环境 windows 7 x64 Notepad++ + Python3.5.0

def main():

filename = "C:\\Users\\Administrator\\Desktop\\脚本\\email_handle.txt"

filename1 = "C:\\Users\\Administrator\\Desktop\\脚本\\email_handle_handle.txt"

file1 = open(filename1,'w')

with open(filename,'r') as file:

for line in file:

data=line[48:]

print(data)

file1.write(data)

file.close()

file1.close()

if __name__ == '__main__':

main()

5、处理后邮件格式如下，在txt文本中查找替换'>为空即可：

xxxx@hotmail.com'>

xxxx@nordictelecom.net'>

6、参考

python爬虫使用Cookie的两种方法

https://blog.csdn.net/weixin_38706928/article/details/80376572

Python3 关于UnicodeDecodeError/UnicodeEncodeError: ‘gbk’ codec can’t decode/encode bytes类似的文本编码问题

https://www.cnblogs.com/worstprogrammer/p/5189758.html

Python模拟登陆(使用requests库)

https://blog.csdn.net/majianfei1023/article/details/49927969

Python的urllib3软件包的证书认证及警告的禁用

https://blog.csdn.net/taiyangdao/article/details/72825735

JSON在线解析及格式化验证

https://www.json.cn/

weixin_39975366

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python最简单的爬取邮箱地址_python3爬取网页中的邮箱地址

1、爬虫分析分析结果对：http://xxx.com?method=getrequest&gesnum=00000001http://xxx.com?method=getrequest&gesnum=00000002http://xxx.com?method=getrequest&gesnum=00000003返回的数据进行爬取由于返回的python3 JSON数据中存在单个转义字符“\”的处理 ...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。