数据抓取 -- 使用代理IP爬取数据:(1):即便代理IP只有1%的无效的情况下如何保证100%把数据爬取下来/while循环使用即便只有1%成功率的代理IP,也能确保爬下数据

为了防止反爬虫,我们一定会用到代理IP,但是代理IP是不稳定的,经常无效。这样会导致数据爬去失败。这里可以通过while,try,except 语句,制作个循环,确保数据爬取成功。

使用下面代码就可以完成:

code = 0
while code <200:
    proxies = {'https': random.choice(proxies_list),                                             #随机选取代理IP
               'http': random.choice(proxies_list)}
    headers = {
        'user-agent': random.choice(USER_AGENTS)}                                          #随机选取代理的AGENT

    try:
        response = requests.get(url=url, headers=headers,proxies=proxies,timeout=1)
        code = 300
        print(code,'爬取数据成功')
    except:
        code = 2
        print('IP无效','code等于',code)

解释:code等于0的时候while 会继续循环,

循环的选取代理IP地址和浏览器的AGENT。

当成功抓取数据的时候,code 变为300,循环结束。如果失败,code为2,然后继续随机抽取代理IP地址和浏览器的AGENT,进行爬取。

 

 

 

完整代码如下:

import requests
from bs4 import BeautifulSoup
import chardet
import re
import random
import getIPa_from_rds

#抽取IP地址
proxies_list = getIPa_from_rds.get_Ip(20000)  # 从数据库内导入IP地址 可选择数量
print('导出的代理IP的数量:',len(proxies_list))

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
    'Opera/8.0 (Windows NT 5.1; U; en)',
    'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]

url = 'http://list.secoo.com/bags/30-0-0-0-0-1-0-0-1-10-0-0.shtml#pageTitle'

code = 0
while code <200:
    proxies = {'https': random.choice(proxies_list),                                             #随机选取代理IP
               'http': random.choice(proxies_list)}
    headers = {
        'user-agent': random.choice(USER_AGENTS)}                                          #随机选取代理的AGENT

    try:
        response = requests.get(url=url, headers=headers,proxies=proxies,timeout=1)
        code = 300
        print(code,'爬取数据成功')
    except:
        code = 2
        print('IP无效','code等于',code)

response.encoding = chardet.detect(response.content)['encoding']
text = response.text
soup = BeautifulSoup(text, 'lxml')

new_url_list = soup.find_all('a', href=re.compile('source=list'), id=re.compile('name'))
for i in new_url_list:
    print(i.get('href'))

爬取结果:

连接数据库成功!
导出的代理IP的数量: 12870
IP无效 code等于 2
IP无效 code等于 2
300 爬取数据成功
http://item.secoo.com/25309631.shtml?source=list
http://item.secoo.com/14599260.shtml?source=list
http://item.secoo.com/27025211.shtml?source=list
http://item.secoo.com/50001837.shtml?source=list
http://item.secoo.com/31570927.shtml?source=list
http://item.secoo.com/49990861.shtml?source=list
http://item.secoo.com/48405606.shtml?source=list
http://item.secoo.com/50037824.shtml?source=list
http://item.secoo.com/48406523.shtml?source=list
http://item.secoo.com/50006856.shtml?source=list
http://item.secoo.com/49997574.shtml?source=list
http://item.secoo.com/49913924.shtml?source=list
http://item.secoo.com/50009607.shtml?source=list
http://item.secoo.com/49893505.shtml?source=list
http://item.secoo.com/49913889.shtml?source=list
http://item.secoo.com/49913966.shtml?source=list
http://item.secoo.com/49891258.shtml?source=list
http://item.secoo.com/49915674.shtml?source=list
http://item.secoo.com/49914106.shtml?source=list
http://item.secoo.com/49892994.shtml?source=list
http://item.secoo.com/50057683.shtml?source=list
http://item.secoo.com/49899448.shtml?source=list
http://item.secoo.com/49915569.shtml?source=list
http://item.secoo.com/50020170.shtml?source=list
http://item.secoo.com/49896368.shtml?source=list
http://item.secoo.com/49899378.shtml?source=list
http://item.secoo.com/49899385.shtml?source=list
http://item.secoo.com/49884804.shtml?source=list
http://item.secoo.com/49899399.shtml?source=list
http://item.secoo.com/49899952.shtml?source=list
http://item.secoo.com/49898986.shtml?source=list
http://item.secoo.com/50040757.shtml?source=list
http://item.secoo.com/49900540.shtml?source=list
http://item.secoo.com/31649572.shtml?source=list
http://item.secoo.com/31552727.shtml?source=list
http://item.secoo.com/31193634.shtml?source=list
http://item.secoo.com/31650902.shtml?source=list
http://item.secoo.com/31642208.shtml?source=list
http://item.secoo.com/31580083.shtml?source=list
http://item.secoo.com/31581266.shtml?source=list

 

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值