【Python3.6爬虫学习记录】（十一）使用代理IP及用多线程测试IP可用性--刷访问量

最新推荐文章于 2024-08-08 17:04:33 发布

子耶

最新推荐文章于 2024-08-08 17:04:33 发布

阅读量3.3k

点赞数 2

分类专栏： Python 文章标签： python 多线程爬虫测试 ip

本文链接：https://blog.csdn.net/qq_36962569/article/details/77417169

版权

Python 专栏收录该内容

22 篇文章 2 订阅

订阅专栏

前言：本来准备写一个刷空间留言的脚本，然而kb TX，无限循环空间验证码。上午还傻x的学验证码识别，后来才发现根本发不了留言，即使填的是对的，仍然继续弹出。无奈，睡了一觉，开始搞新玩意–代理IP！其实之前就应该用到的，然而一直用selenium，没好好看header，也没用cookie和IP。之后用到再补上，同时还有简单验证码的识别等。

可以了解一下代理IP相关知识：通过Python爬虫代理IP快速增加博客阅读量
发现一篇很好的文章： Python3网络爬虫(十一)：爬虫黑科技之让你的爬虫程序更像人类用户的行为(代理IP池等)
关于刷访问量，主要机制是网页限制仅不同IP登陆才能增加访问量，虽然有的网页是根据cookies增加，后者更低级。

第一部分 requests , ChromeDriver, PhantomJS的代理IP使用

1-1 requests使用代理IP

 http = 'http://'+str(ip)
        proxies = {
            "http": http
        }
 try:
        r = requests.get("http://blog.csdn.net/qq_36962569/article/details/77387299", proxies=proxies)
except Exception as e:
        print(+e)

同理，requests模块添加headers ，cookies ，data，可以直接

requests.get(url,headers=headers)
requests.get(url,cookies=cookies
requests.get(url,data=data)

也可以传递多个参数，

requests.get(url,headers=headers,data=data)

参考链接：
Python 笔记七：Requests爬虫技巧（隆重推出，十分详细）
Python爬虫技巧—设置代理IP

1-2 ChromeDriver使用代理IP

def ChromeDriverWithIP():
    PROXY = "47.52.108.18"
    chrome_options = webdriver.ChromeOptions()
    # 两种用法添加代理IP
    # chrome_options.add_argument('--proxy-server=http://35.189.128.127')
    chrome_options.add_argument('--proxy-server={0}'.format(PROXY))
    # 传递代理IP
    chrome = webdriver.Chrome(chrome_options=chrome_options)
    chrome.get('http://www.cnblogs.com/buzhizhitong/p/5714419.html')
    print('2: ', chrome.page_source)

1-3 PhantomJS使用代理IP

#phantomjs selenium 如何动态修改代理
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType

def DynamicUsingIP():
    proxy = Proxy(
        {
            'proxyType': ProxyType.MANUAL,
            'httpProxy': '210.38.1.134'  # 代理ip和端口
        }
    )
    # 新建一个代理IP对象
    desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
    # 加入代理IP
    proxy.add_to_capabilities(desired_capabilities)
    driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
    # 测试一下，打开使用的代理IP地址信息
    driver.get('http://1212.ip138.com/ic.asp')
    print(driver.page_source)
    # # 现在开始切换ip
    # # 再新建一个ip
    # proxy = Proxy(
    #     {
    #         'proxyType': ProxyType.MANUAL,
    #         'httpProxy': 'ip:port'  # 代理ip和端口
    #     }
    # )
    # # 再新建一个“期望技能”，（）
    # desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
    # # 把代理ip加入到技能中
    # proxy.add_to_capabilities(desired_capabilities)
    # # 新建一个会话，并把技能传入
    # driver.start_session(desired_capabilities)
    # driver.get('http://httpbin.org/ip')
    # print(driver.page_source)
    driver.quit()

参考链接：
盘点selenium phantomJS使用的坑（介绍PhantomJS相关的注意事项）
在Selenium中设置代理IP（介绍多种设置方法）
selenium phantomjs 设置代理ip方法
 phantomjs和selenium设置proxy、headers（）

第二部分测试代理IP的可用性

2-1 未使用线程测试

# IP check，将可用的IP重新保存到IP
def IPCheck():
    IP = []
    SuccessIP = []
    # 读取文件
    with open('IP.txt','r') as f:
        for line in f:
            IP.append(line[:-1])
    # request模块使用代理
    for ip in IP:
        http = 'http://'+str(ip)
        proxies = {
            "http": http
        }
        time.sleep(10)
        try:
            r = requests.get("http://blog.csdn.net/qq_36962569/article/details/77387299", proxies=proxies)
        except:
            print(str(ip)+'---connect failed')
        else:
            SuccessIP.append(ip)
            print(str(ip)+'---success')
    # 重新保存
    n=0
    f=open('IP.txt','w')
    for ip in SuccessIP:
        f.write(ip+'\n')
        n+=1
    f.close()
    print('Total are '+str(n)+' successful IP')

速度非常慢，基本上测试50个，得用3分钟。而使用多线程，测试70个，仅用十来秒（真tn的快嘞）。
参考链接：
使用python验证代理ip是否可用

2-2 使用多线程测试

# 使用多线程验证IP 可用性
def TreadCheckIP():
    # 获得IP
    proxys = []
    with open('IP.txt','r') as f:
        for line in f:
            proxys.append(line[:-1])
    proxy_ip = open('proxy_ip.txt', 'w')  # 新建一个储存有效IP的文档
    lock = threading.Lock()  # 建立一个锁

    # 验证代理IP有效性的方法
    def test(i):
        socket.setdefaulttimeout(5)  # 设置全局超时时间
        try:
            http = 'http://' + str(proxys[i])
            proxies = {
                "http": http
            }
            r = requests.get("http://blog.csdn.net/qq_36962569/article/details/77387299", proxies=proxies)

            lock.acquire()  # 获得锁
            print(proxys[i], 'is OK')
            proxy_ip.write('%s\n' % str(proxys[i]))  # 写入该代理IP
            lock.release()  # 释放锁
        except Exception as e:
            lock.acquire()
            print(proxys[i], e)
            lock.release()

    # 单线程验证
    '''for i in range(len(proxys)):
        test(i)'''
    # 多线程验证
    threads = []
    for i in range(len(proxys)):
        thread = threading.Thread(target=test, args=[i])
        threads.append(thread)
        thread.start()
    # 阻塞主进程，等待所有子线程结束
    for thread in threads:
        thread.join()
    proxy_ip.close()  # 关闭文件