Python3 requests爬取代理IP并验证可用性(附多线程模式)

简要介绍:

使用python3 环境,需要自己安装的包有 requests  (网址请求,获取页面信息)和  Lxml(页面解析,信息提取) 。

首先确定从何处获取 “IP”,本人此次爬取的是 西刺网  的免费IP代理。

大概流程:

  1.  请求有免费IP的网址(本次使用“http://www.xicidaili.com/nn/”)
  2.  获取网址的页面信息
  3.  从已经获得的页面信息中,提取有用的数据(关于代理IP的)。
  4.  对爬取的IP数据进行筛选,验证IP 的是否可用(**西刺网上的免费IP并不是全部一定可用**)
  5.  将有用的IP信息(IP地址及端口号)进行存储,写入文

 

import requests
from lxml import etree

#代理IP的信息存储
def write_proxy(proxies):
    print(proxies)
    for proxy in proxies:
        with open("ip_proxy.txt", 'a+') as f:
            print("正在写入:", proxy)
            f.write(proxy + '\n')
    print("录入完成!!!")



#解析网页,并得到网页中的代理IP
def get_proxy(html):
    #对获取的页面进行解析
    selector = etree.HTML(html)
    #print(selector.xpath("//title/text()"))
    proxies = []
    #信息提取
    for each in selector.xpath("//tr[@class='odd']"):
        #ip.append(each[0])
        ip = each.xpath("./td[2]/text()")[0]
        port = each.xpath("./td[3]/text()")[0]
        proxy = ip + ":" + port

        proxies.append(proxy)
    print(len(proxies))
    test_proxies(proxies)



#验证已得到IP的可用性,本段代码通过访问百度网址,返回的response状态码判断(是否可用)。
def test_proxies(proxies):
    proxies = proxies
    url= "http://www.baidu.com/"
    header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    }
    normal_proxies=[]
    count = 1
    for proxy in proxies:
        print("第%s个。。" % count)
        count+=1
        try:
            response=requests.get(url,headers=header,proxies= {"http":proxy},timeout= 1)
            if response.status_code == 200:
                print("该代理IP可用:",proxy)
                normal_proxies.append(proxy)
            else:
                print("该代理IP不可用:",proxy)
        except Exception:
            print("该代理IP无效:", proxy)
            pass
    #print(normal_proxies)
    write_proxy(normal_proxies)


def get_html(url):
    header = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    }
    response = requests.get(
        url,
        headers=header,
    )
    #print(response.text)
    get_proxy(response.text)


if __name__ == "__main__":

    url = "http://www.xicidaili.com/nn/"
    get_html(url)

 

 

多线程模式:

           因为验证IP是否可用,程序是串行(故此只能验证一个结束后,才进行验证下一个,**效率极低**。)

流程还是同上。

import requests
import threading
from lxml import etree



#解析网页,并得到网页中的IP代理
def get_proxy(html):
    selector = etree.HTML(html)
    #print(selector.xpath("//title/text()"))
    proxies = []

    for each in selector.xpath("//tr[@class='odd']"):
        #ip.append(each[0])
        ip = each.xpath("./td[2]/text()")[0]
        port = each.xpath("./td[3]/text()")[0]
        #拼接IP地址,端口号
        proxy = ip + ":" + port
        proxies.append(proxy)
    print(len(proxies))
    test_proxies(proxies)


def thread_write_proxy(proxy):
    with open("./ip_proxy.txt", 'a+') as f:
        print("正在写入:", proxy)
        f.write(proxy + '\n')
        print("录入完成!!!")


#添加线程模式
def thread_test_proxy(proxy):
    url = "http://www.baidu.com/"
    header = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    }
    try:
        response = requests.get(
            url, headers=header, proxies={"http": proxy}, timeout=1)
        if response.status_code == 200:
            print("该代理IP可用:", proxy)
            #normal_proxies.append(proxy)
            thread_write_proxy(proxy)
        else:
            print("该代理IP不可用:", proxy)
    except Exception:
        print("该代理IP无效:", proxy)
        pass



#验证已得到IP的可用性
def test_proxies(proxies):
    proxies = proxies
    #print("test_proxies函数开始运行。。。\n", proxies)
    for proxy in proxies:
        test = threading.Thread(target=thread_test_proxy, args=(proxy, ))
        test.start()



def get_html(url):
    header = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    }
    response = requests.get(
        url,
        headers=header,
    )
    #print(response.text)
    get_proxy(response.text)


if __name__ == "__main__":

    url = "http://www.xicidaili.com/nn/"
    get_html(url)

 

  • 5
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值