爬虫搭建自己的代理池

最新推荐文章于 2024-07-17 20:13:30 发布

ʕ ᵔᴥᵔ ʔ

最新推荐文章于 2024-07-17 20:13:30 发布

阅读量282

点赞数

分类专栏： Python脚本渗透测试文章标签： python xpath

本文链接：https://blog.csdn.net/qq_45388306/article/details/106083257

版权

Python脚本同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

渗透测试

2 篇文章 0 订阅

订阅专栏

快代理爬取

工具：python3、requests和lxml模块

步骤1：调用模块

# !/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import etree

步骤2：设置请求头，分析网页准备爬取

网页分析我习惯用xpath配合谷歌的xpath插件好用的很

设置请求头

headers = header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    # 通过requests的get方法访问目标网站，获得响应对象
    response = requests.get(url=url, headers=headers)
    print(url, response)

在这里插入图片描述

分析网页可知代理在==xpath语法“//table[@class=“table table-bordered table-striped”]/tbody//tr”==下的trl列表集里，可以先把列表集采集下来接着用for循环遍历提取代理的IP和端口，并用一个txt文本保存数据代码如下：

def scray_ip(url):
    # 请求响应头
    headers = header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    # 通过requests的get方法访问目标网站，获得响应对象
    response = requests.get(url=url, headers=headers)
    print(url, response)
    #创建一个etree对象，response.text为访问后的到的整个快代理页面
    etree_obj = etree.HTML(response.text)
    #通过筛选response.text，得到包含ip信息的列表
    ip_list = etree_obj.xpath('//table[@class="table table-bordered table-striped"]/tbody//tr')
    item = []
    #遍历得到的集合，将ip，和端口信息进行拼接，添加到item列表
    for ip in ip_list:
        ip_num = ip.xpath('./td[@data-title="IP"]/text()')[0]
        port_num = ip.xpath('./td[@data-title="PORT"]/text()')[0]
        http = ip_num + ':' +port_num
        item.append(http)
    #遍历访问，检测IP活性
    with open('采集到的IP.txt', 'w')as f:
        for it in item:
            #因为并不是每个IP都是能用，所以要进行异常处理
            try:
                proxy = {
                    'http':it
                        }
                url1 = "https://www.baidu.com"
                #遍历时，利用访问百度，设定timeout=1,即在1秒内，未送到响应就断开连接
                res = requests.get(url=url1,proxies=proxy,headers=headers,timeout=1)
                #打印检测信息，elapsed.total_seconds()获取响应的时间
                print(it +'--',res.elapsed.total_seconds())
                # 判断网页状态码
                if res.status_code == 200:
                    f.write(it + '\n')
            except BaseException as e:
                print(e)
        f.close()

步骤3：多页采集

在这里插入图片描述
刚才的代码采集了第一页的内容，不妨再用一个for循环遍历指定页数的ip

def ip_page():
    for i in range(10):
        # 请求路径，快代理网站
        url = 'https://www.kuaidaili.com/free/inha/' + str(i) + '/'
        scray_ip(url)

最后，总结代码就是:

# !/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import etree


def scray_ip(url):
    # 请求响应头
    headers = header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    # 通过requests的get方法访问目标网站，获得响应对象
    response = requests.get(url=url, headers=headers)
    print(url, response)
    #创建一个etree对象，response.text为访问后的到的整个西刺代理页面
    etree_obj = etree.HTML(response.text)
    #通过筛选response.text，得到包含ip信息的列表
    ip_list = etree_obj.xpath('//table[@class="table table-bordered table-striped"]/tbody//tr')
    item = []
    #遍历得到的集合，将ip，和端口信息进行拼接，添加到item列表
    for ip in ip_list:
        ip_num = ip.xpath('./td[@data-title="IP"]/text()')[0]
        port_num = ip.xpath('./td[@data-title="PORT"]/text()')[0]
        http = ip_num + ':' +port_num
        item.append(http)
    #遍历访问，检测IP活性
    with open('采集到的IP.txt', 'w')as f:
        for it in item:
            #因为并不是每个IP都是能用，所以要进行异常处理
            try:
                proxy = {
                    'http':it
                        }
                url1 = "https://www.baidu.com"
                #遍历时，利用访问百度，设定timeout=1,即在1秒内，未送到响应就断开连接
                res = requests.get(url=url1,proxies=proxy,headers=headers,timeout=1)
                #打印检测信息，elapsed.total_seconds()获取响应的时间
                print(it +'--',res.elapsed.total_seconds())
                # 判断网页状态码
                if res.status_code == 200:
                    f.write(it + '\n')
            except BaseException as e:
                print(e)
        f.close()


def ip_page():
    for i in range(10):
        # 请求路径，快代理网站
        url = 'https://www.kuaidaili.com/free/inha/' + str(i) + '/'
        scray_ip(url)


def main():
    ip_page()


if __name__ == '__main__':
    main()

运行效果：

在这里插入图片描述

ʕ ᵔᴥᵔ ʔ

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫搭建自己的代理池

快代理爬取工具：python3、requests和lxml模块步骤1：调用模块# !/usr/bin/env python# -*- coding: utf-8 -*-import requestsfrom lxml import etree步骤2：设置请求头，分析网页准备爬取网页分析我习惯用xpath配合谷歌的xpath插件好用的很设置请求头headers = header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;
复制链接

扫一扫

专栏目录