文章目录
一、准备
1.数据
https://www.89ip.cn/index_1.html
2.数据字段
IP地址 端口
二、数据爬取
1.设置预设
from fake_useragent import UserAgent
ua = UserAgent()
headers = {
'User-agent':ua.random
}
2.观察网页源码(部分)
<tbody>
<tr>
<td>
60.177.152.181 </td>
<td>
9000 </td>
<td>
浙江省杭州市 </td>
<td>
电信 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
58.253.157.31 </td>
<td>
9999 </td>
<td>
广东省揭阳市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
49.89.143.135 </td>
<td>
9999 </td>
<td>
江苏省宿迁市 </td>
<td>
电信 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
49.70.32.177 </td>
<td>
9999 </td>
<td>
江苏省宿迁市 </td>
<td>
电信 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
27.43.187.233 </td>
<td>
9999 </td>
<td>
广东省梅州市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
39.81.60.119 </td>
<td>
9000 </td>
<td>
山东省 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
60.167.132.215 </td>
<td>
1133 </td>
<td>
安徽省芜湖市 </td>
<td>
电信 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
60.13.42.200 </td>
<td>
9999 </td>
<td>
甘肃省平凉市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
36.249.49.41 </td>
<td>
9999 </td>
<td>
福建省泉州市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
36.249.48.38 </td>
<td>
9999 </td>
<td>
福建省泉州市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
60.167.82.237 </td>
<td>
1133 </td>
<td>
安徽省芜湖市鸠江区 </td>
<td>
电信 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
218.249.45.162 </td>
<td>
52316 </td>
<td>
北京市 </td>
<td>
鹏博士长城宽带 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
27.43.190.119 </td>
<td>
9999 </td>
<td>
广东省梅州市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
27.43.184.246 </td>
<td>
9999 </td>
<td>
广东省梅州市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
<tr>
<td>
58.253.159.117 </td>
<td>
9999 </td>
<td>
广东省揭阳市 </td>
<td>
联通 </td>
<td>
2020/10/15 19:15:02 </td>
</tr>
</tbody>
3.制定正则规则
# 匹配ip地址
reg = '(\S\d{1,2}[^\w]([\d.{1,3}\.]{7,14}))'
iplist = re.findall(reg, page_code)
# 匹配端口号
reg = '(?<=\s)\d+(?=\t{1,2}</td>)'
portlist = re.findall(reg, page_code)
三、数据存取
存为JSON文件
将多个字段列表进行打包,并放入一个list中
return list(zip(result, portlist))
调用json.dumps()方法,将传入数据转化为json格式数据
datalist = [{'IP地址':d[0],'端口':d[1]} for d in data]
json.dumps(datalist,ensure_ascii=False)
使用with…as语句,将转换后的数据写入json文件中
with open(filename,'w',encoding='utf-8') as file_object:
file_object.write(json.dumps(datalist,ensure_ascii=False))
最终的数据呈现如下
[
{
"IP地址": "60.13.42.96",
"端口": "9999"
},
{
"IP地址": "36.248.129.112",
"端口": "9999"
},
{
"IP地址": "58.253.155.239",
"端口": "9999"
},
{
"IP地址": "27.43.184.49",
"端口": "9999"
},
{
"IP地址": "49.89.86.136",
"端口": "9999"
},
{
"IP地址": "60.169.240.99",
"端口": "9999"
},
{
"IP地址": "49.87.210.8",
"端口": "9999"
},
{
"IP地址": "59.33.55.19",
"端口": "9999"
},
{
"IP地址": "36.248.133.57",
"端口": "9999"
},
{
"IP地址": "58.253.157.204",
"端口": "9999"
},
{
"IP地址": "58.253.156.7",
"端口": "9999"
},
{
"IP地址": "49.70.99.73",
"端口": "9999"
},
{
"IP地址": "60.179.252.12",
"端口": "3000"
},
{
"IP地址": "27.43.187.110",
"端口": "9999"
},
{
"IP地址": "60.182.19.42",
"端口": "9000"
}
]
四、完整代码
# -*- coding:utf-8 -*-
# Created by ZhaoWen on 2020/10/15
# 89免费代理 https://www.89ip.cn/index_1.html
import requests
from fake_useragent import UserAgent
import json
import re
import time
# 保存到txt文件中
def save_txt(filename,data):
filename = filename+'.'+'txt'
with open(filename,'w',encoding='utf-8') as file_object:
file_object.write(data)
# 保存到json文件中
def save_json(filename,data):
filename = filename+'.'+'json'
datalist = [{'IP地址':d[0],'端口':d[1]} for d in data]
# print(datalist)
with open(filename,'w',encoding='utf-8') as file_object:
file_object.write(json.dumps(datalist,ensure_ascii=False))
class proxies_89():
url = ''
ua = UserAgent()
headers = {
'User-agent':ua.random
}
@classmethod
def run(self,url):
self.url = url
if url != None:
page_code = self.get_page(self.url, self.headers)
# 匹配ip地址
reg = '(\S\d{1,2}[^\w]([\d.{1,3}\.]{7,14}))'
iplist = re.findall(reg, page_code)
result = []
for ip in iplist:
result.append(ip[0])
# 匹配端口号
reg = '(?<=\s)\d+(?=\t{1,2}</td>)'
portlist = re.findall(reg, page_code)
return list(zip(result, portlist))
else:
print('没有提供url')
# 获取页面内容
@classmethod
def get_page(self,url,headers):
rep = requests.get(url=url,headers=headers)
if rep.status_code == 200:
return rep.text
return ''
if __name__ == '__main__':
for i in range(1,3):
url = 'https://www.89ip.cn/index_'+str(i)+'.html'
print('url->'+url)
result = proxies_89.run(url)
save_json('89ip_'+str(i),result)
time.sleep(10)
五、问题解决
中文写入json文件,输出格式为Unicode编码格式
如
{
"title": "\u3010Python\u3011\u8bf7\u95ee\u53bb\u54ea\u91cc\u4e0b\u8f7drequests\u7684\u5b89\u88c5\u5e93\uff1f",
"time": "2020-10-15 14:15:02",
"author": "\u541b_GV14Do",
"url": "https://www.lmonkey.com/ask/23002"
},
解决办法为
json.dumps(datalist,ensure_ascii=False)
在调用dumps()方法时,给参数ensure_ascii传入False参数
六、参考文档
使用Requests+正则表达式爬取学习猿地-猿来如此模块页面信息,并保存为JSON格式