多线程爬虫——抓取代理ip

最新推荐文章于 2024-08-13 18:10:28 发布

gvliew

最新推荐文章于 2024-08-13 18:10:28 发布

阅读量4.7k

点赞数 5

分类专栏： python 文章标签：代理ip 爬虫多线程

本文链接：https://blog.csdn.net/dala_da/article/details/79439737

版权

python 专栏收录该内容

13 篇文章 1 订阅

订阅专栏

在之前的blog：使用爬虫刷csdn博客访问量中，我所使用的10个ip地址都是事先填写好的，总不能每次使用都去西刺搞10个ip贴上去吧。。。

于是就试着抓了一下，发现每个ip因为都要去检验好不好用，很耽误时间。

正好最近在学习多线程爬虫，试着写了一下，开辟了四个线程，速度快了很多

过程中，遇到了些比较棘手的问题

输出是会有两行数据输出到一行中，这个百度比较容易解决，给写入文件或输出到界面的语句加锁即可，保证多个进程只有一个进程在输出

另一个比较头疼的问题是，我的线程总是处于非结束状态，只有通过join中设置参数，使得该线程经过数秒的延迟后，主线程结束强制这些线程结束。但毕竟有点一叶障目，换一台主机，不同的处理器可能导致处理的时间不一样

后来查阅了一下Queue的get函数，函数说明为：

# -*- coding: utf-8 -*-
"""
Created on Sat Mar 03 19:06:18 2018

@author: Administrator
"""

import urllib2
import re
import requests
import time
from threading import Thread
from threading import Lock
from Queue import Queue

#从西刺抓下来的所有代理ip
all_find_list=[]
#将所有抓到的代理压入队列，四个线程可以从队列中获取代理ip
gaoni_queue=Queue()
#能够成功连接的代理ip
success_list=[]

lock=Lock()

def get_proxy(checking_ip):
    #根据得到的代理ip，设置proxy的格式
    proxy_ip = 'http://' + checking_ip
    proxy_ips = 'https://' + checking_ip
    proxy = {'https': proxy_ips, 'http': proxy_ip}
    return proxy
    
def checking_ip():
    global gaoni_queue
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }
    
    while 1:
        #若从队列1秒内无法获得代理ip，说明所有代理均已检测完成，抛出Empty异常
        try:
            checking_ip = gaoni_queue.get(True,1)
        except:
            gaoni_queue.task_done()
            break
            
        proxy=get_proxy(checking_ip)
        url = 'https://www.csdn.net/'
        #使用上面的url，测试代理ip是否能够链接
        try:
            page = requests.get(url, headers=headers, proxies=proxy)
        except:
            lock.acquire()
            print checking_ip,'失败'.decode('utf-8')
            lock.release()
        else:
            lock.acquire()
            print checking_ip,'成功'.decode('utf-8')
            success_list.append(checking_ip)
            lock.release()


def get_all():
    headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
    global all_find_list
    for i in range(1,2):
        #从xici网站的高匿页面获取ip
        url='http://www.xicidaili.com/nn/%d'%i
        r = requests.get(url,headers=headers)
        data=r.text
        #抓取所需数据的正则表达式
        p=r'<td>(.*?)</td>\s+<td>(.*?)</td>\s+<td>\s+(.*?)\s+</td>\s+<td class="country">(.*?)</td>'
        find_list=re.findall(p,data)
        all_find_list+=find_list
    #将ip地址与端口组成规定格式
    for row in all_find_list:
        ip=row[0]+':'+row[1]
        gaoni_queue.put(ip)
        


if __name__=='__main__':
    get_all()
    print gaoni_queue.qsize()
    thread_1=Thread(target=checking_ip)
    thread_2=Thread(target=checking_ip)
    thread_3=Thread(target=checking_ip)
    thread_4=Thread(target=checking_ip)
    thread_1.start()
    thread_2.start()
    thread_3.start()    
    thread_4.start()
    thread_1.join()
    thread_2.join()
    thread_3.join()
    thread_4.join()
    f=open("ip.txt","w")
    for row in success_list:
        f.write(row+'\n')
    f.close()

最后的数据全部写入文件中