Python2.*爬虫获取免费有效IP代理

爬虫代理的获取代码

获取网站免费的代理IP

  1. 免费的IP代理:找到很多关于IP代理的网站,发现稍微靠谱点的IP代理获取网站很少,这里推荐一个网站:http://www.xicidaili.com/nn; 相对而言这个网站的代理还是相对比较好的。
  2. 收费的IP代理:阿布云这个网站提供的代理还是比较好的 https://www.abuyun.com/,但是价格超贵。如果是网站对IP代理需求不是非常高,就用免费的。
  3. 下面爬取IP代理代码供大家参考下(用到bs4, urllib2等包):
#!/usr/bin/env python
# encoding: utf-8
"""
Spider ip proxy. Website: http://www.xicidaili.com/nn
Authors: idKevin
Date: 20170717
"""

from bs4 import BeautifulSoup
import urllib2
import time
import socket
import random

def getContent(Url):
    """ Get the web site content.  """


    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0'}
    req = urllib2.Request(Url, headers=header)       # request

    while True:
        try:
            response = urllib2.urlopen(req).read()   # Web site content
            break
        except urllib2.HTTPError as e:     # Ouput log to debug easily
            print 1, e
            time.sleep(random.choice(range(5, 20)))
        except urllib2.URLError as e:
            print 2, e
            time.sleep(random.choice(range(10, 30)))
        except socket.timeout as e:
            print 3, e
            time.sleep(random.choice(range(15, 20)))
        except Exception as e:
            print 4, e
            time.sleep(random.choice(range(10, 20)))

    return response                        # The website content

def extractIPAddress(content):
    """ Extract web IP address and port. """
    proxys = []                                   # proxy list
    soup = BeautifulSoup(content, 'html.parser')  # soup object
    trs = soup.find_all('tr')                     # extract tr tag
    for tds in trs[1:]:
        td = tds.find_all('td')                   # extract td tag
        proxys.append(str(td[1].contents[0]) + ":" + str(td[2].contents[0]))

    return proxys

def getProxys():
    """ main function. """
    Url = 'http://www.xicidaili.com/nn/1'   # assign relevant url
    content = getContent(Url)               # achieve html content
    proxys = extractIPAddress(content)      # achieve proxys
    # for e in proxys:                        # output proxy on screen 
    #     print e

    return proxys

过滤无效的代理IP

  1. 并不是所有的代理都能用,可能的原因:我们所处的网络连不到这个代理,或者这个代理连不到我们的目标网址,所以我们踢除无效代理。
  2. 根据你要访问的网站,过滤掉无效的代理IP,过滤条件根据你要爬的网站设置,下面贴出过滤代码(下面的代码12行调用上面的代码模块):
#!/usr/bin/env python
# encoding: utf-8
"""
Effective proxys. Website: http://www.xicidaili.com/nn
Authors: idKevin
Date: 20170717
"""

import urllib2
import time
import random
import xiciProxys           # import user-defined package
import socket
import re


def testProxys(proxys):
    """ Test the proxys. """
    validProxys = []
    Url = "http://ip.chinaz.com/getip.aspx"
    for proxy in proxys:
        try:
            # set proxy
            proxy_handler = urllib2.ProxyHandler({'http':proxy, 'https':proxy})
            opener = urllib2.build_opener(proxy_handler)
            urllib2.install_opener(opener)
            # request website
            response = urllib2.urlopen(Url, timeout=5).read()

            # set filtration condition according website
            if re.findall('{ip:.*?,address:..*?}', response) != []: # remove invalid proxy
                validProxys.append(proxy)
                print "%s\t%s" % (proxy, response)
        except Exception as e:
            # print "%s\t%s" % (proxy, "invalid")
            continue

    return validProxys

执行结果:

49.84.29.115:8118 {ip:’49.84.29.115’,address:’江苏省镇江市 电信’}
114.97.238.176:8118 {ip:’114.97.238.176’,address:’安徽省合肥市包河区 电信’}
122.72.32.72:80 {ip:’36.102.239.105’,address:’浙江省 电信’}
1.197.56.73:808 {ip:’1.197.56.73’,address:’河南省许昌市 电信’}
113.206.74.90:8118 {ip:’113.206.74.90’,address:’重庆市 联通’}
121.31.79.66:8123 {ip:’121.31.79.66’,address:’广西梧州市 联通’}
110.73.1.255:8123 {ip:’110.73.1.255’,address:’广西防城港市 联通’}
180.108.189.180:8118 {ip:’180.108.189.180’,address:’江苏省苏州市 电信’}
171.39.117.158:8123 {ip:’171.39.117.158’,address:’广西百色市 联通’}
112.114.92.175:8118 {ip:’112.114.92.175’,address:’云南省临沧市 电信’}
110.73.14.172:8123 {ip:’110.73.14.172’,address:’广西防城港市 联通’}
42.53.40.61:80 {ip:’42.53.40.61’,address:’辽宁省 联通’}

大概可以获取13个免费代理,有些网站对IP代理网站要求不是非常高,可以免强满足要求。


参考文献:
Python 爬虫入门(二)—— IP代理使用
玩Python之HTTP代理

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值