关于代理协程爬虫的问题总结

最近在用代理文本爬虫,因为kaggle对某些信息的获取比较敏感,以sleep random 25的情况下仍然跑300个数据就gg了,显示‘too many requests’, 然后就得重新跑。
我想这不行啊,这要跑到啥时候,然后就想着用代理跑。。
我用的是快代理,私密代理的集中代理,得参照这文档 快代理使用文档
根据经验,得设置这么几个参数:

batch = 20 #每次的多少个数据(行)异步
interval = 10 #每次隔 batch*interval次输出一次csv, 其中task每次异步batch个任务
ip_number = 30#ip量得大于batch,否则报错
orderid = 964177646067726
signature = 'd21sx6qptydqjfu56llxdoiejbmcqxrg'

orderid和ip_number分别是订单号,和代理ip池里面想放的ip数目。

经过萌新的摸索感觉有几个点值得记录:

  1. 最重要还是判定random的时间,虽然这ip都是钱,时间有限,很珍贵,但是捏=>你要是ip没访问几次,人网站就给你封了,那不就更爽歪歪。我这个搞了好久,感觉是np random5秒这样,用的代理ip不要太贪
  2. 另外就是你得有个机制判定==>因为是异步爬取,就是一个batch完接着下个batch,那么,只有当这个batch结束了之后,我们再进行下一个batch的异步爬取。根据木桶定律,这个取决于request get最久的那个ip,它要多久,整个batch就有多久;
    如果一个batch如果跑的久些(另外,requests的时候记得加timeout啊,不然等你爬完数据都到明年了hhh,,而且一定要timeout = 整数,比如timeout=7,你不能写timeout=(3,7),这个跟requests.get不一样),基本上最后那个ip肯定是有点访问太快啦,人网站锁你个几秒,让你缓缓,这样人服务器才比较舒服。
    然后如果你这个batch刚结束,下一个batch,又直接random你的ip池,结郭一不小心有random到刚才那个ip,也就是这个刚刚人家才觉得你有点烦了想锁你几秒的这个ip, 又去访问人家的网站=> 那好家伙,人服务器肯定不爽了,再锁你更久。这样很浪费时间,甚至kaggle服务器看你这都给你这么暗示了,你还一直访问,敬酒不吃吃罚酒,直接ip给你封了,那你更爽,这个ip就废了hhh
    所以,每个batch开始前,对每个ip进行一个简单的判断是一个很有必要的事情,如果它才刚访问完呢,您就让它稍微歇会,喝口水先
#对于一次请求,从代理ip池抽取代理的过程
#下面的[-1],指的是这个代理ip是不是刚刚从网上拿下来的,如果是的话(没用过)记0,不是则记录1;[-2]指的是记录采用这个代理ip时的时间,为下一次batch的判断做准备
#下面这个大于5.5,就是让那个刚搞完的ip喝口水
while(True):
        alive_length = len(alive_proxy_list)
        random_index = int(np.random.random()*alive_length)
        if(alive_proxy_list[random_index][-1] == 0 or (time.time()-alive_proxy_list[random_index][-2]) >5.5):
            print('检查千万不要ip还在访问呢,然后就又去访问另外一个url了')
            #改变ip的时间time.time(),以及是否是第一次访问的情况
            alive_proxy_list[random_index][-2] = time.time()
            alive_proxy_list[random_index][-1] = 1
            proxy = alive_proxy_list[random_index][0]
            # proxies = {
            #     "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": proxy},
            #     "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": proxy}
            # }
            proxy_1 = "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": proxy}
            break
  1. 想让爬虫长时间稳定的跑,到一定程度就得扩容,不然一直用着以前的老ip,还没被清理,但是ip池里面只有那些,所以random来random去也就那么几个,用这种ip就request不到了;比如说像我代码这个,ip池子里面有30个,那到了ip池只剩下21-22就得扩容,增添到30个(别问我怎么知道的,问就是炼丹试出来的)

ps: 这个东西,你可以在request不到的情况下,做个标记,比如像我这个代码,异步batch 20行数据,那么我判定当异步有10行数据都request不到的话(yichang)就休息10秒钟,顺便打印一下alive_proxy_list的情况,也就是根据这个,我发现它每次还剩下19,20,13,18,21次(都小于22)这种就开始总是出现这种(yichang>10)的情况,所以我才上面定的21-22的时候就扩容,这样子就比较不容易出现yichang太多的情况

ps2 另外也不能太频繁访问代理的网页(扩容代理ip池),应该隔一会访问,这样也比较节约时间,如果你动不动就扩容,get快代理api链接是要时间的,没有必要

if (yichang > 10):
    print(f'休息10秒钟,顺便看看现在的alive_list的长度:{len(alive_proxy_list)}')
    print(alive_proxy_list)
    time.sleep(10)
if (yichang > 16):
    sys.exit()

完整代码:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
import numpy as np
import time
import random
import threading
import asyncio
import aiohttp
import requests

import json
import pandas as pd
import re
import random
from fake_useragent import UserAgent
def Change_UA():
    return str(UserAgent(path="E:/Application/python/fakeuseragent.json").random)
import sys



#参数
row = 10201
rowend = 20201
batch = 20 #每次的多少个数据(行)异步
interval = 10#每次隔 batch*interval次输出一次csv, 其中task每次异步batch个任务
ip_number = 30#ip量得大于batch,否则报错
orderid = 994186530441076
signature = '6y9uquhqnia6nio2z2frc5ulkqqsm97a'
一段时间 = 60#30就不行,会后面好多都get不到,也就是try那边不动
alive_proxy_list = []
member_urls = []#还剩余的所有urls
class ProxyPool():
    def __init__(self, orderid, proxy_count):
        self.orderid = orderid
        self.proxy_count = proxy_count if proxy_count < 50 else 50 # 池子维护的IP总数,建议一般不要超过50


    def _fetch_proxy_list(self, count):
        """调用快代理API获取代理IP列表"""
        try:
            #下面这个可能会很久
            res = requests.get("http://dps.kdlapi.com/api/getdps/?orderid=%s&num=%s&pt=1&sep=1&f_et=1&format=json" % (self.orderid, count),timeout =(3,7))
            temp_list1 = []
            for proxy_temp in res.json().get('data').get('proxy_list'):
                m = proxy_temp.split(',')
                m.append(time.time())
                m.append(0)#判断是否是第一次
                temp_list1.append(m)
                print('是这里吗')
            print(temp_list1)
            return temp_list1
        except:
            print("API获取IP异常,请检查订单")
            sys.exit()
        return []

    def _init_proxy(self):
        """初始化IP池"""
        alive_proxy_list = self._fetch_proxy_list(self.proxy_count)
        print('ip列表')
        print(alive_proxy_list)
    def add_alive_proxy(self, add_count):
        """导入新的IP, 参数为新增IP数"""
        print('ip列表')
        alive_proxy_list.extend(self._fetch_proxy_list(add_count))



    def run(self):
        sleep_seconds = 1
        self._init_proxy()
        shijian_first = time.time()
        while True:
            for proxy in alive_proxy_list:
                judge_whether_delete = float(proxy[1]) - sleep_seconds  # proxy[1]代表此IP的剩余可用时间
                if judge_whether_delete <= 3:
                    alive_proxy_list.remove(proxy)  # IP还剩3s时丢弃此IP
            shijian_second = time.time()
            print('循环ip的run检查,清理过期ip喽;另外,补充新鲜ip')
            if (shijian_second - shijian_first > 一段时间):
                for index,proxy in enumerate(alive_proxy_list):#enumerate可以直接对proxy进行修改,和正常的遍历不同
                    print(proxy[0])
                    print('+++++++++++++++++++++++++++++++++++++++')
                    temp_data = {
                        'orderid': orderid,
                        'proxy': proxy[0],
                        'signature': signature
                    }
                    print('---------------------------------------------')
                    shengyu_time = requests.get('https://dps.kdlapi.com/api/getdpsvalidtime', params=temp_data,timeout = (3,7))
                    shengyu_time = shengyu_time.json()['data']
                    print('剩余情况的展示')
                    print(shengyu_time)
                    for key in shengyu_time:
                        temp_sheng = shengyu_time[key]
                        print(f'{proxy[0]}还剩下{temp_sheng}s')
                        if (temp_sheng < 7):
                            alive_proxy_list.remove(proxy)
                            print(f'成功移除')
                        else:
                            print('大于6s,不急,留着')
                            proxy[1] = temp_sheng
                    shijian_first = time.time()

            if   (self.proxy_count-len(alive_proxy_list))>7:
                print('扩容啦')
                self.add_alive_proxy(self.proxy_count - len(alive_proxy_list))
            time.sleep(sleep_seconds*2)

    def start(self):
        """开启子线程更新IP池"""
        t = threading.Thread(target=self.run)
        t.setDaemon(True)  # 将子线程设为守护进程,主线程不会等待子线程结束,主线程结束子线程立刻结束
        t.start()




list_main=[]
async def fetch(session, item_url,i,row):
    print('进行最小单位的子任务呢=====================')
    # 用户名密码认证(私密代理/独享代理)
    username = "1362569083"
    password = "o58f87mw"
    # 对于一次请求,从代理ip池抽取代理的过程
    # 下面的[-1],指的是这个代理ip是不是刚刚从网上拿下来的,如果是的话(没用过)记0,不是则记录1;[-2]指的是记录采用这个代理ip时的时间,为下一次batch的判断做准备
    while (True):
        alive_length = len(alive_proxy_list)
        random_index = int(np.random.random() * alive_length)
        # 下面这个大于5.5,就是让那个刚搞完的ip喝口水
        try:
            if(alive_proxy_list[random_index]):
                pass
            if (alive_proxy_list[random_index][-1] == 0 or (time.time() - alive_proxy_list[random_index][-2]) > 3):
                print(f'{alive_proxy_list[random_index][0]}之前时间是{alive_proxy_list[random_index][-2]},现在是{time.time()}')
                print('检查千万不要ip还在访问呢,然后就又去访问另外一个url了')
                # 改变ip的时间time.time(),以及是否是第一次访问的情况
                alive_proxy_list[random_index][-2] = time.time()
                alive_proxy_list[random_index][-1] = 1
                proxy = alive_proxy_list[random_index][0]
                # proxies = {
                #     "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": proxy},
                #     "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": proxy}
                # }
                proxy_1 = "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": proxy}
                break
        except:
            pass

    # # 白名单方式(需提前设置白名单)
    # proxies = {
    #     "http": "http://%(proxy)s/" % {"proxy": proxy_ip},
    #     "https": "http://%(proxy)s/" % {"proxy": proxy_ip}
    # }
    # 要访问的目标网页
    proxy_auth = aiohttp.BasicAuth(username, password)
#####################################################################################
    url = 'http://www.kaggle.com' + item_url
    print(item_url)
    payload={}
    headers = {
        'authority': 'www.kaggle.com',
        'cache-control': 'max-age=0',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'upgrade-insecure-requests': '1',
        'user-agent': Change_UA(),
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'accept-language': 'zh-CN,zh;q=0.9',
        'cookie': '_ga=GA1.2.1969467303.1635834710; ka_sessionid=faf40f26927ec93abac47fd5f9c77cf5; .ASPXAUTH=3FA423837B43E0624AF2575E6B5396E07A6940939E168315BA086529F0FE4C4EAB6363A9628F595A975CEA86A4B6C73E333FF965EAB4491636EEEAABD5A406B9CD313C3CADDC13469024F0E82651691255B6B20A; CSRF-TOKEN=CfDJ8LdUzqlsSWBPr4Ce3rb9VL-NEmSuAvOl8H0W4Wfz-I951qpcrGtNnJE3i54hBy7c19XrrmfQgV7MPnLMn2PqURA6pCuLt-gUkgQHBDPX5ydPlnmEcDW0sF_7aOq4QS3DRgT3dkp8SfC6aXVazniAGWo; GCLB=CP3R4r-2mYTV5wE; _gid=GA1.2.118892275.1639891405; XSRF-TOKEN=CfDJ8LdUzqlsSWBPr4Ce3rb9VL_i8rVkC85Gm5UN2HjdY390e--JkGZUBrCaTZ3qAGZB4uYc8Y0W9Tb6afDyW0nlDhPZgiMn___I6F8L9zA6BorJ_DdpvNNMmlCDgANrnoJINhqI8JK6QclmALPSZtscR3CgLzIN4gLR1c5DCC6QQDQtyrwLgQKMdCcZQQt5Xyxv-Q; CLIENT-TOKEN=eyJhbGciOiJub25lIiwidHlwIjoiSldUIn0.eyJpc3MiOiJrYWdnbGUiLCJhdWQiOiJjbGllbnQiLCJzdWIiOiJjc2F0bGFudGlzIiwibmJ0IjoiMjAyMS0xMi0xOVQxNDozNTo0OS43Mjg0MDc3WiIsImlhdCI6IjIwMjEtMTItMTlUMTQ6MzU6NDkuNzI4NDA3N1oiLCJqdGkiOiJlODIzNTRlMS1iMTJmLTQzM2YtYjFmYi0zZTdhYzM3Y2YyYmIiLCJleHAiOiIyMDIyLTAxLTE5VDE0OjM1OjQ5LjcyODQwNzdaIiwidWlkIjo4NzkwNDgzLCJkaXNwbGF5TmFtZSI6IkNTIEF0bGFudGlzIiwiZW1haWwiOiJjczEzNjI1NjkwODNAZ21haWwuY29tIiwidGllciI6Ik5vdmljZSIsInZlcmlmaWVkIjpmYWxzZSwicHJvZmlsZVVybCI6Ii9jc2F0bGFudGlzIiwidGh1bWJuYWlsVXJsIjoiaHR0cHM6Ly9zdG9yYWdlLmdvb2dsZWFwaXMuY29tL2thZ2dsZS1hdmF0YXJzL3RodW1ibmFpbHMvZGVmYXVsdC10aHVtYi5wbmciLCJmZiI6WyJHaXRIdWJQcml2YXRlQWNjZXNzIiwiS2VybmVsc1NhdmVUb0dpdEh1YiIsIkRvY2tlck1vZGFsU2VsZWN0b3IiLCJHY2xvdWRLZXJuZWxJbnRlZyIsIktlcm5lbEVkaXRvckNvcmdpTW9kZSIsIlRwdVVudXNlZE51ZGdlIiwiQ2FpcEV4cG9ydCIsIkNhaXBOdWRnZSIsIktlcm5lbHNGaXJlYmFzZVByb3h5IiwiS2VybmVsc0dDU1VwbG9hZFByb3h5IiwiS2VybmVsc0ZpcmViYXNlTG9uZ1BvbGxpbmciLCJLZXJuZWxzUHJldmVudFN0b3BwZWRUb1N0YXJ0aW5nVHJhbnNpdGlvbiIsIktlcm5lbHNQb2xsUXVvdGEiLCJLZXJuZWxzUXVvdGFNb2RhbHMiLCJEYXRhc2V0c0RhdGFFeHBsb3JlclYzVHJlZUxlZnQiLCJBdmF0YXJQcm9maWxlUHJldmlldyIsIkRhdGFzZXRzRGF0YUV4cGxvcmVyVjNDaGVja0ZvclVwZGF0ZXMiLCJEYXRhc2V0c0RhdGFFeHBsb3JlclYzQ2hlY2tGb3JVcGRhdGVzSW5CYWNrZ3JvdW5kIiwiS2VybmVsc1N0YWNrT3ZlcmZsb3dTZWFyY2giLCJLZXJuZWxzTWF0ZXJpYWxMaXN0aW5nIiwiRGF0YXNldHNNYXRlcmlhbERldGFpbCIsIkRhdGFzZXRzTWF0ZXJpYWxMaXN0Q29tcG9uZW50IiwiRGF0YXNldHNTaGFyZWRXaXRoWW91IiwiQ29tcGV0aXRpb25EYXRhc2V0cyIsIkRpc2N1c3Npb25zVXB2b3RlU3BhbVdhcm5pbmciLCJUYWdzTGVhcm5BbmREaXNjdXNzaW9uc1VJIiwiS2VybmVsc1N1Ym1pdEZyb21FZGl0b3IiLCJOb1JlbG9hZEV4cGVyaW1lbnQiLCJOb3RlYm9va3NMYW5kaW5nUGFnZSIsIlRQVUNvbW1pdFNjaGVkdWxpbmciLCJFbXBsb3llckluZm9OdWRnZXMiLCJFbWFpbFNpZ251cE51ZGdlcyIsIktNTGVhcm5EZXRhaWwiLCJGcm9udGVuZENvbnNvbGVFcnJvclJlcG9ydGluZyIsIktlcm5lbFZpZXdlckhpZGVGYWtlRXhpdExvZ1RpbWUiLCJLZXJuZWxWaWV3ZXJWZXJzaW9uRGlhbG9nV2l0aFBhcmVudEZvcmsiLCJEYXRhc2V0TGFuZGluZ1BhZ2VSb3RhdGluZ1NoZWx2ZXMiLCJMb3dlckRhdGFzZXRIZWFkZXJJbWFnZU1pblJlcyIsIk5ld0Rpc2N1c3Npb25zTGFuZGluZyIsIkRpc2N1c3Npb25MaXN0aW5nSW1wcm92ZW1lbnRzIiwiRGlzY3Vzc2lvbkVtcHR5U3RhdGUiLCJTY2hlZHVsZWROb3RlYm9va3MiLCJTY2hlZHVsZWROb3RlYm9va3NUcmlnZ2VyIiwiQXV0b05vdGVib29rT3V0cHV0VG9EYXRhc2V0VmVyc2lvbiIsIlRhZ1BhZ2VzRGVwcmVjYXRlIiwiRmlsdGVyRm9ydW1JbWFnZXMiLCJQaG9uZVZlcmlmeUZvckNvbW1lbnRzIiwiUGhvbmVWZXJpZnlGb3JOZXdUb3BpYyIsIk5hdkNyZWF0ZUJ1dHRvbiIsIk5ld05hdkJlaGF2aW9yIiwiTmV3TmF2VXNlckxpbmtzIiwiSW5DbGFzc1RvQ29tbXVuaXR5UGFnZXMiXSwiZmZkIjp7Iktlcm5lbEVkaXRvckF1dG9zYXZlVGhyb3R0bGVNcyI6IjMwMDAwIiwiRnJvbnRlbmRFcnJvclJlcG9ydGluZ1NhbXBsZVJhdGUiOiIwLjEwIiwiRW1lcmdlbmN5QWxlcnRCYW5uZXIiOiJ7XCJiYW5uZXJzXCI6IFsgeyBcInVyaVBhdGhSZWdleFwiOiBcIl4oL2MvLip8L2NvbXBldGl0aW9ucy8_KVwiLCAgXCJtZXNzYWdlSHRtbFwiOiBcIldlIGFyZSBhd2FyZSBvZiBhIHBhcnRpYWwgb3V0YWdlIG9uIHRoZSAnY29kZScgcGFnZSBmb3IgY29tcGV0aXRpb25zLiBXZSBhcmUgd29ya2luZyBvbiBmaXhpbmcgaXQuXCIsICAgICAgIFwiYmFubmVySWRcIjogXCIyMDIxLTEyLTE0LWNvbXBzLWNvZGUtYnJva2VuXCIgfSBdIH0iLCJDbGllbnRScGNSYXRlTGltaXQiOiI0MCJ9LCJwaWQiOiJrYWdnbGUtMTYxNjA3Iiwic3ZjIjoid2ViLWZlIiwic2RhayI6IkFJemFTeUE0ZU5xVWRSUnNrSnNDWldWei1xTDY1NVhhNUpFTXJlRSIsImJsZCI6IjMzYjNiZmRiN2Q4NWQwNDYxNzY0MzE1MWZmYTE0ZDE2ODk2YzY4YjkifQ.; CLIENT-TOKEN=eyJhbGciOiJub25lIiwidHlwIjoiSldUIn0.eyJpc3MiOiJrYWdnbGUiLCJhdWQiOiJjbGllbnQiLCJzdWIiOiJjc2F0bGFudGlzIiwibmJ0IjoiMjAyMS0xMi0xOVQxNDo1MzowMS40NjY2NjQ4WiIsImlhdCI6IjIwMjEtMTItMTlUMTQ6NTM6MDEuNDY2NjY0OFoiLCJqdGkiOiJkYWViYmZlYy0zN2EwLTRlNzQtYjc0Yi0xNTc1NzcxNGJjNGUiLCJleHAiOiIyMDIyLTAxLTE5VDE0OjUzOjAxLjQ2NjY2NDhaIiwidWlkIjo4NzkwNDgzLCJkaXNwbGF5TmFtZSI6IkNTIEF0bGFudGlzIiwiZW1haWwiOiJjczEzNjI1NjkwODNAZ21haWwuY29tIiwidGllciI6Ik5vdmljZSIsInZlcmlmaWVkIjpmYWxzZSwicHJvZmlsZVVybCI6Ii9jc2F0bGFudGlzIiwidGh1bWJuYWlsVXJsIjoiaHR0cHM6Ly9zdG9yYWdlLmdvb2dsZWFwaXMuY29tL2thZ2dsZS1hdmF0YXJzL3RodW1ibmFpbHMvZGVmYXVsdC10aHVtYi5wbmciLCJmZiI6WyJHaXRIdWJQcml2YXRlQWNjZXNzIiwiS2VybmVsc1NhdmVUb0dpdEh1YiIsIkRvY2tlck1vZGFsU2VsZWN0b3IiLCJHY2xvdWRLZXJuZWxJbnRlZyIsIktlcm5lbEVkaXRvckNvcmdpTW9kZSIsIlRwdVVudXNlZE51ZGdlIiwiQ2FpcEV4cG9ydCIsIkNhaXBOdWRnZSIsIktlcm5lbHNGaXJlYmFzZVByb3h5IiwiS2VybmVsc0dDU1VwbG9hZFByb3h5IiwiS2VybmVsc0ZpcmViYXNlTG9uZ1BvbGxpbmciLCJLZXJuZWxzUHJldmVudFN0b3BwZWRUb1N0YXJ0aW5nVHJhbnNpdGlvbiIsIktlcm5lbHNQb2xsUXVvdGEiLCJLZXJuZWxzUXVvdGFNb2RhbHMiLCJEYXRhc2V0c0RhdGFFeHBsb3JlclYzVHJlZUxlZnQiLCJBdmF0YXJQcm9maWxlUHJldmlldyIsIkRhdGFzZXRzRGF0YUV4cGxvcmVyVjNDaGVja0ZvclVwZGF0ZXMiLCJEYXRhc2V0c0RhdGFFeHBsb3JlclYzQ2hlY2tGb3JVcGRhdGVzSW5CYWNrZ3JvdW5kIiwiS2VybmVsc1N0YWNrT3ZlcmZsb3dTZWFyY2giLCJLZXJuZWxzTWF0ZXJpYWxMaXN0aW5nIiwiRGF0YXNldHNNYXRlcmlhbERldGFpbCIsIkRhdGFzZXRzTWF0ZXJpYWxMaXN0Q29tcG9uZW50IiwiRGF0YXNldHNTaGFyZWRXaXRoWW91IiwiQ29tcGV0aXRpb25EYXRhc2V0cyIsIkRpc2N1c3Npb25zVXB2b3RlU3BhbVdhcm5pbmciLCJUYWdzTGVhcm5BbmREaXNjdXNzaW9uc1VJIiwiS2VybmVsc1N1Ym1pdEZyb21FZGl0b3IiLCJOb1JlbG9hZEV4cGVyaW1lbnQiLCJOb3RlYm9va3NMYW5kaW5nUGFnZSIsIlRQVUNvbW1pdFNjaGVkdWxpbmciLCJFbXBsb3llckluZm9OdWRnZXMiLCJFbWFpbFNpZ251cE51ZGdlcyIsIktNTGVhcm5EZXRhaWwiLCJGcm9udGVuZENvbnNvbGVFcnJvclJlcG9ydGluZyIsIktlcm5lbFZpZXdlckhpZGVGYWtlRXhpdExvZ1RpbWUiLCJLZXJuZWxWaWV3ZXJWZXJzaW9uRGlhbG9nV2l0aFBhcmVudEZvcmsiLCJEYXRhc2V0TGFuZGluZ1BhZ2VSb3RhdGluZ1NoZWx2ZXMiLCJMb3dlckRhdGFzZXRIZWFkZXJJbWFnZU1pblJlcyIsIk5ld0Rpc2N1c3Npb25zTGFuZGluZyIsIkRpc2N1c3Npb25MaXN0aW5nSW1wcm92ZW1lbnRzIiwiRGlzY3Vzc2lvbkVtcHR5U3RhdGUiLCJTY2hlZHVsZWROb3RlYm9va3MiLCJTY2hlZHVsZWROb3RlYm9va3NUcmlnZ2VyIiwiQXV0b05vdGVib29rT3V0cHV0VG9EYXRhc2V0VmVyc2lvbiIsIlRhZ1BhZ2VzRGVwcmVjYXRlIiwiRmlsdGVyRm9ydW1JbWFnZXMiLCJQaG9uZVZlcmlmeUZvckNvbW1lbnRzIiwiUGhvbmVWZXJpZnlGb3JOZXdUb3BpYyIsIk5hdkNyZWF0ZUJ1dHRvbiIsIk5ld05hdkJlaGF2aW9yIiwiTmV3TmF2VXNlckxpbmtzIiwiSW5DbGFzc1RvQ29tbXVuaXR5UGFnZXMiXSwiZmZkIjp7Iktlcm5lbEVkaXRvckF1dG9zYXZlVGhyb3R0bGVNcyI6IjMwMDAwIiwiRnJvbnRlbmRFcnJvclJlcG9ydGluZ1NhbXBsZVJhdGUiOiIwLjEwIiwiRW1lcmdlbmN5QWxlcnRCYW5uZXIiOiJ7XCJiYW5uZXJzXCI6IFsgeyBcInVyaVBhdGhSZWdleFwiOiBcIl4oL2MvLip8L2NvbXBldGl0aW9ucy8_KVwiLCAgXCJtZXNzYWdlSHRtbFwiOiBcIldlIGFyZSBhd2FyZSBvZiBhIHBhcnRpYWwgb3V0YWdlIG9uIHRoZSAnY29kZScgcGFnZSBmb3IgY29tcGV0aXRpb25zLiBXZSBhcmUgd29ya2luZyBvbiBmaXhpbmcgaXQuXCIsICAgICAgIFwiYmFubmVySWRcIjogXCIyMDIxLTEyLTE0LWNvbXBzLWNvZGUtYnJva2VuXCIgfSBdIH0iLCJDbGllbnRScGNSYXRlTGltaXQiOiI0MCJ9LCJwaWQiOiJrYWdnbGUtMTYxNjA3Iiwic3ZjIjoid2ViLWZlIiwic2RhayI6IkFJemFTeUE0ZU5xVWRSUnNrSnNDWldWei1xTDY1NVhhNUpFTXJlRSIsImJsZCI6IjMzYjNiZmRiN2Q4NWQwNDYxNzY0MzE1MWZmYTE0ZDE2ODk2YzY4YjkifQ.; XSRF-TOKEN=CfDJ8LdUzqlsSWBPr4Ce3rb9VL80OT35htRldWkBS7zLwOYTBLM0H2zubx5p8lZbHqJQFSpvtCpunZsc6pV9FqdoaBhtlOvG1KGldJR_o4W3jv_j4l9Sv0EFWNTJZ_z_32_nDhdfkU-bjiQFQw-PFbg5jN5aZ3moKRHdcXGhNL2YlORS-hEi5AXuAulCT1hmk4QQEw'
    }
    try:
        response = await session.get(url,headers=headers,data = payload,proxy=proxy_1, timeout = 7,proxy_auth = proxy_auth)
        all_text = await response.text()
        await asyncio.sleep(np.random.random() * 3)
        # print(response.text)
        # time.sleep(np.random.random()*10)

        #########################################################################################
        country_obj = re.compile(r'.*?"country":"(?P<country>.*?)","region".*?')
        country = country_obj.search(all_text)
        try:
            user_country = country.group('country')
            print(user_country)
        except:
            user_country = None
            print('没国家')

        ############################################
        follower_obj = re.compile(r'"followers":\{"type":"following","count":(?P<followers>.*?),"list".*?"following":{"type":"following","count":(?P<following>.*?),"list"')
        follower = follower_obj.search(all_text)
        following = follower_obj.search(all_text)
        try:
            follower = follower.group('followers')
            following = following.group('following')
            print(follower)
            print(following)
        except:
            print('meiyou follows?????????')

        try:
            list_main.append([item_url, user_country, follower, following])
        except:
            print('没了')
            if (all_text == 'Too many requests'):
                print('Too many requests了')
                await asyncio.sleep(20)

        print(f'第{row+i}个用户国家提取完成')

    except:
        member_urls.append(item_url)
        print(f'可能超时了?,不知道,,,,,,,,,,跳过啦')
        await asyncio.sleep(2)
        return 1
        pass



async def main(loop):
    flag = 0 #是否完成任务!!
    i = 0 #初始i
    time_a = time.time()
    epoch = 0
    async with aiohttp.ClientSession() as session:
        while(True):
            num_url = len(member_urls)
            print(f'现在有{num_url}个url')

            print('准备进行这个batch的任务+++++++++++++++++++++++++++')
            yichang = 0
            if(num_url<i+batch):
                tasks = [loop.create_task(fetch(session, member_urls[index], index, row)) for index in range(i, num_url)]
                flag = 1
            else:
                tasks = [loop.create_task(fetch(session, member_urls[index],index,row)) for index in range(i,i+batch)]
            done, pending = await asyncio.wait(tasks)  # 执行并等待所有任务完成
            results = [r.result()  for r in done]  # 获取所有返回结果
            if(flag == 1):
                country_temp = pd.DataFrame(list_main,
                                            columns=['member_profileUrl', 'country', 'follower', 'following'])
                country_temp.to_csv(f'{row}to{rowend}_country_get.csv')
                break
            yichang = 0
            for iter in results:
                if(iter == 1):
                    yichang +=1
            if (yichang > 10):
                print(f'休息10秒钟,顺便看看现在的alive_list的长度:{len(alive_proxy_list)}')
                print(alive_proxy_list)
                time.sleep(10)
            # if(yichang>16):
            #     sys.exit()
            # print(results)
            print(f'第{i+batch-1}个结束')

            # print('下面是done------------------------------------------')
            # print(done)
            # print('下面是pending------------------------------------------')
            # print(pending)
            # print(f'有{yichang}个异常值')
            # if(yichang>9):
            #     print('程序有问题了')
            #     sys.exit()
            epoch = epoch+1
            if (epoch == interval):
                country_temp = pd.DataFrame(list_main,
                                            columns=['member_profileUrl', 'country', 'follower', 'following'])
                country_temp.to_csv(f'{row}to{row + i+batch}_country_get.csv')
                time_b = time.time()
                print(f'这个{batch}*{interval}={batch*interval}的epoch花了{time_a-time_b}s')
                #记录一下新的时间
                time_a=time.time()
                epoch = 0
            i=i+batch


def run():
    loop = asyncio.get_event_loop()
    # 异步发出5次请求
    loop.run_until_complete(main(loop))
    loop.close()
    print('全部结束')


if __name__ == '__main__':
    df_base = pd.read_csv('D:\AllFileAboutPythonProject\学习尝试\毕业设计\用meta爬的competition\get_country\left_member.csv')[row:rowend]
    member_urls = df_base['member_profileUrl'].tolist()
    print('member_url示例展示前三个:')
    print(member_urls[:3])
    ######################################################
    proxy_pool = ProxyPool(orderid,ip_number) # 订单号, 池子中维护的IP数
    proxy_pool.start()
    time.sleep(1)  # 等待IP池初始化

    #已经得到了那个随机代理
    run()



加更补充(感觉搞了一圈这个最有用,下次就先做这个):我发现了降低访问失败率的一个最好的方式就是==>直接把访问的item_url加到我要get 的url列表后面,这样子,总能保证最后能结束。比如说一个10万url的列表要get,可能里面代理get不到的有4000条,那就把这4000条统统append到这10万后面,这不就爽了~
另外,代理在此网站上爬取成功率为70%左右(也就是说,爬10000个数据总有3000个是搞不定的,那就像上面说的移到后边去),长时间爬取用均匀代理更值钱一些。

总结一下,从始至终,爬的数据总会差不多有30%-40%是爬不到的,一开始是想着怎么去化解代理被封的问题(我以为是代理被服务器发现,或者代理访问太频繁,timeout太短等问题),然鹅发现不管怎么处理,这个情况都得不到改善。后面才想着直接拿item_url直接添加到总的url后面,然后分批爬取,这样就能保证总的url都能爬下来。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值