Python爬虫四:美团爬虫(店铺信息抓取)

版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/xing851483876/article/details/81842329

 环境:Windows7 +Python3.6+Pycharm2017

目标:抓取美团美食移动端 深圳地区店铺的信息,包括:店铺名称、分类、地址、电话、人均消费、营业时间、评分、评价人数、经纬度。最后抓取2.1W条信息,程序运行约1h。工具(requests、selenium、chrome)

---全部文章: 京东爬虫 、链家爬虫美团爬虫微信公众号爬虫字体反爬Django笔记阿里云部署vi\vim入门----

一、美团桌面端

打开深圳美团https://sz.meituan.com/,点击美食,F12进入浏览器开发者模式。点击右上方Network和XHR,然后随便点击一个分区,比如香蜜湖。可以抓到一个请求叫:getPoiList?cityName=XXXXXXXX。点击可以看到请求的url中有一个参数_token。这个token参数应该通过某种算法算出来的,如果要模拟浏览器发请求,首先要知道如何生成token。这个token应该是通过JS生成的,一般遇到js加密的,要么破解加密原理,然后自己用代码实现。要么就是直接调用它的js代码。而且这个参数估计是最近几个月才加进去的,网上查了一遍也没有找到解决办法,自己看js文件也看不出什么,所以桌面端只能放弃。如有大神知道怎么处理这个token,望告知,谢谢!!如果真要拿token,用selenium+chrome应该也可以,每个token应该有一段有效期。

二、美团移动端 

桌面端搞不定,只能选择其他途径。现在很多网站都会有桌面版,移动版,还有APP,一般移动版的反爬会简单些。打开美团移动版 https://i.meituan.com/ ,F12打开浏览器开发者模式,可以点击下图1处的两个方框,模拟手机浏览器。

 然后点击美食,进入下图界面,看到右边的两个请求。第一个请求是页面的基本框架信息,比如上面各种分类信息,后面会用到。第二个请求list,是一个动态请求,用以获得商家信息。点击发现是一个post请求,请求的参数如下图红框中所示,多点击几家店铺就能看出参数的含义。变化的就四个参数areaId--地区分类、cataId--美食分类、offset--翻页参数、uuid--网站分发的id。

直接模拟浏览器发送post请求,修改offset来实现翻页,每页有15条数据,每翻一页 offset值加15。实测在当前美食页面下直接翻页,最多能翻67页,1005条数据,后面好像出验证码还是没数据返回了。所以我们要对店铺进行分类抓取。

我们需要的信息在店铺的详情页面,一般详情页面的url都是几个关键参数的拼凑,而这几个关键参数是可以在上面的列表页面抓取到的。我们点开一家店铺,观察url,发现主要是两个参数,一个是店铺的id:6268902,还有一个就是ct_poi参数,这两个参数都可以在上面的post请求返回数据中找到。

https://meishi.meituan.com/i/poi/6268902?ct_poi=314286840956592200722254147016600281179_a6268902_c0_e11543712825375195158 

还有就是我们进入页面详情浏览器能捕捉到很多的请求,我们需要的店铺信息 店铺名称、分类、地址、电话、人均消费、营业时间、评分、评价人数、经纬度,是哪个请求返回的,需要确认下。实际就是第一个请求,上面这个url。

点开第一个请求返回的html代码,直接ctrl+F搜索店铺电话号码,就能找到位置。在一个<script crossorigin='anonymous'>标签中,这样的标签有好几个,需要区分。用xpath解析的时候取标签内容,然后截取内容字符串前16位,看是不是window._appState,以此判断,剩下的就是json数据处理。

三、基本思路 

至此,爬取的基本思路就有了。先通过列表页面抓取店铺的id和ct_poi参数,构造详情页面url,再访问详情页面抓取信息。由于翻页只能翻67页,所以我们需要分类抓取。我们这里选择按区域分类,应该这样可以保证每一个区域下店铺数量小于67页(1005条)。店铺总数网站全城虽然显示的是46655,但是下面每个区域加起来应该是2.4W,而且全部类目下显示的也是总数2.4W,所以我觉得应该是总数在2.4W。所以现在的问题就是把每个区域的areaId抓到。

四、区域id抓取

点击美食页面 https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1

查看html代码,也是在一个<script crossorigin='anonymous'>标签中,可以看到每个区域对应的id。只是在浏览器上显示的数据并不完整,可以下载html到本地用编辑器打开。也是json格式数据的处理。这里就是南澳新区的数据要特殊处理下,因为它下面没有分区,我直接把它加到了坪山区内。

五、店铺id和ct_poi参数抓取

有了每个区域的id,可以直接构造post请求获取店铺信息,该请求需要加上cookie,一条cookie就可以抓完。返回数据是json格式,包含15条店铺信息,提取其中的店铺id和ct_poi保存到本地csv文件中。抓取完成后可以对信息做一次去重,店铺id相同的就认为是重复信息。代码中把店铺的分类cateName也保存下来,详情页面好像没有这个信息。代码如下,应该改下cookie就可以运行。去重后一共抓取到21872条数据。

#coding=utf-8
import csv
import time
import requests
import json


#区域店铺id ct_Poi cateName抓取,传入参数为区域id
def crow_id(areaid):
    id_list=[]
    url='https://meishi.meituan.com/i/api/channel/deal/list'
    head={'Host': 'meishi.meituan.com',
          'Accept': 'application/json',
          'Accept-Encoding': 'gzip, deflate, br',
          'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
          'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
          'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
           'Cookie':'XXXXXXXXXXXXXX'
                    }
    p = {'https': 'https://27.157.76.75:4275'}
    data={"uuid":"09dbb48e-4aed-4683-9ce5-c14b16ae7539","version":"8.3.3","platform":3,"app":"","partner":126,"riskLevel":1,"optimusCode":10,"originUrl":"http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1","offset":0,"limit":15,"cateId":1,"lineId":0,"stationId":0,"areaId":areaid,"sort":"default","deal_attr_23":"","deal_attr_24":"","deal_attr_25":"","poi_attr_20043":"","poi_attr_20033":""}
    r=requests.post(url,headers=head,data=data,proxies=p)
    result=json.loads(r.text)
    totalcount=result['data']['poiList']['totalCount']  #获取该分区店铺总数,计算出要翻的页数
    datas=result['data']['poiList']['poiInfos']
    print(len(datas),totalcount)
    for d in datas:
        d_list=['','','','']
        d_list[0]=d['name']
        d_list[1] = d['cateName']
        d_list[2] = d['poiid']
        d_list[3] = d['ctPoi']
        id_list.append(d_list)
    print('Page:1')
    #将数据保存到本地csv
    with open('meituan_id.csv','a', newline='',encoding='gb18030')as f:
        write=csv.writer(f)
        for i in id_list:
            write.writerow(i)

    #开始爬取第2页到最后一页
    offset=0
    if totalcount>15:
        totalcount-=15
        while offset<totalcount:
            id_list = []
            offset+=15
            m=offset/15+1
            print('Page:%d'%m)
            #构造post请求参数,通过改变offset实现翻页
            data2 = {"uuid": "09dbb48e-4aed-4683-9ce5-c14b16ae7539", "version": "8.3.3", "platform": 3, "app": "",
                    "partner": 126, "riskLevel": 1, "optimusCode": 10,
                    "originUrl": "http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1",
                    "offset": offset, "limit": 15, "cateId": 1, "lineId": 0, "stationId": 0, "areaId": areaid, "sort": "default",
                    "deal_attr_23": "", "deal_attr_24": "", "deal_attr_25": "", "poi_attr_20043": "", "poi_attr_20033": ""}
            try:
                r = requests.post(url, headers=head, data=data2,proxies=p)
                print(r.text)
                result = json.loads(r.text)
                datas = result['data']['poiList']['poiInfos']
                print(len(datas))
                for d in datas:
                    d_list = ['', '', '', '']
                    d_list[0] = d['name']
                    d_list[1] = d['cateName']
                    d_list[2] = d['poiid']
                    d_list[3] = d['ctPoi']
                    id_list.append(d_list)
                #保存到本地
                with open('meituan_id.csv', 'a', newline='',encoding='gb18030')as f:
                    write = csv.writer(f)
                    for i in id_list:
                        write.writerow(i)
            except Exception as e:
                print(e)


if __name__=='__main__':
    #直接将html代码中区域的信息复制出来,南澳新区的数据需要处理下,它下面没有分区
    a = {"areaObj": {"28": [{"id": 28, "name": "全部", "regionName": "福田区", "count": 4022},
                            {"id": 1056, "name": "香蜜湖", "regionName": "香蜜湖", "count": 105},
                            {"id": 744, "name": "梅林", "regionName": "梅林", "count": 421},
                            {"id": 1055, "name": "上沙/下沙", "regionName": "上沙/下沙", "count": 291},
                            {"id": 2008, "name": "华强南", "regionName": "华强南", "count": 263},
                            {"id": 742, "name": "八卦岭/园岭", "regionName": "八卦岭/园岭", "count": 217},
                            {"id": 741, "name": "华强北", "regionName": "华强北", "count": 572},
                            {"id": 743, "name": "皇岗/水围", "regionName": "皇岗/水围", "count": 136},
                            {"id": 756, "name": "新城市广场", "regionName": "新城市广场", "count": 140},
                            {"id": 6595, "name": "车公庙", "regionName": "车公庙", "count": 305},
                            {"id": 6596, "name": "景田", "regionName": "景田", "count": 144},
                            {"id": 6597, "name": "新洲/石厦", "regionName": "新洲/石厦", "count": 374},
                            {"id": 6974, "name": "竹子林", "regionName": "竹子林", "count": 107},
                            {"id": 6975, "name": "市民中心", "regionName": "市民中心", "count": 39},
                            {"id": 7993, "name": "会展中心", "regionName": "会展中心", "count": 461},
                            {"id": 7994, "name": "岗厦", "regionName": "岗厦", "count": 110},
                            {"id": 7996, "name": "福田保税区", "regionName": "福田保税区", "count": 29}],
                     "29": [{"id": 29, "name": "全部", "regionName": "罗湖区", "count": 2191},
                            {"id": 6976, "name": "国贸", "regionName": "国贸", "count": 232},
                            {"id": 758, "name": "莲塘", "regionName": "莲塘", "count": 125},
                            {"id": 2009, "name": "笋岗", "regionName": "笋岗", "count": 159},
                            {"id": 748, "name": "翠竹路沿线", "regionName": "翠竹路沿线", "count": 42},
                            {"id": 745, "name": "东门", "regionName": "东门", "count": 484},
                            {"id": 746, "name": "宝安南路沿线", "regionName": "宝安南路沿线", "count": 67},
                            {"id": 757, "name": "火车站", "regionName": "火车站", "count": 96},
                            {"id": 6598, "name": "万象城", "regionName": "万象城", "count": 127},
                            {"id": 6599, "name": "喜荟城/水库", "regionName": "喜荟城/水库", "count": 99},
                            {"id": 7659, "name": "地王大厦", "regionName": "地王大厦", "count": 85},
                            {"id": 8469, "name": "黄贝岭", "regionName": "黄贝岭", "count": 136},
                            {"id": 8470, "name": "春风万佳/文锦渡", "regionName": "春风万佳/文锦渡", "count": 19},
                            {"id": 8471, "name": "布心/太白路", "regionName": "布心/太白路", "count": 154},
                            {"id": 8790, "name": "田贝/水贝", "regionName": "田贝/水贝", "count": 85},
                            {"id": 8794, "name": "银湖/泥岗", "regionName": "银湖/泥岗", "count": 37},
                            {"id": 8795, "name": "新秀/罗芳", "regionName": "新秀/罗芳", "count": 33},
                            {"id": 13080, "name": "梧桐山", "regionName": "梧桐山", "count": 34},
                            {"id": 14095, "name": "KK mall", "regionName": "KK mall", "count": 74}],
                     "30": [{"id": 30, "name": "全部", "regionName": "南山区", "count": 3905},
                            {"id": 751, "name": "南头", "regionName": "南头", "count": 325},
                            {"id": 750, "name": "华侨城", "regionName": "华侨城", "count": 126},
                            {"id": 749, "name": "蛇口", "regionName": "蛇口", "count": 9},
                            {"id": 1057, "name": "南油", "regionName": "南油", "count": 218},
                            {"id": 1058, "name": "科技园", "regionName": "科技园", "count": 460},
                            {"id": 1059, "name": "西丽", "regionName": "西丽", "count": 586},
                            {"id": 4811, "name": "南山中心区", "regionName": "南山中心区", "count": 635},
                            {"id": 6591, "name": "海岸城/保利", "regionName": "海岸城/保利", "count": 158},
                            {"id": 6592, "name": "前海", "regionName": "前海", "count": 32},
                            {"id": 6593, "name": "白石洲", "regionName": "白石洲", "count": 190},
                            {"id": 6594, "name": "欢乐海岸", "regionName": "欢乐海岸", "count": 22},
                            {"id": 7597, "name": "太古城", "regionName": "太古城", "count": 57},
                            {"id": 7599, "name": "花园城", "regionName": "花园城", "count": 42},
                            {"id": 13109, "name": "海上世界", "regionName": "海上世界", "count": 225},
                            {"id": 23117, "name": "世界之窗", "regionName": "世界之窗", "count": 97},
                            {"id": 25152, "name": "南山京基百纳", "regionName": "南山京基百纳", "count": 22},
                            {"id": 36635, "name": "深圳湾", "regionName": "深圳湾", "count": 17}],
                     "31": [{"id": 31, "name": "全部", "regionName": "盐田区", "count": 407},
                            {"id": 754, "name": "大小梅沙", "regionName": "大小梅沙", "count": 36},
                            {"id": 755, "name": "沙头角", "regionName": "沙头角", "count": 118},
                            {"id": 8789, "name": "东部华侨城", "regionName": "东部华侨城", "count": 11},
                            {"id": 8796, "name": "盐田海鲜食街", "regionName": "盐田海鲜食街", "count": 22},
                            {"id": 15349, "name": "壹海城", "regionName": "壹海城", "count": 51},
                            {"id": 38055, "name": "溪涌", "regionName": "溪涌", "count": ""}],
                     "32": [{"id": 32, "name": "全部", "regionName": "宝安区", "count": 6071},
                            {"id": 6587, "name": "西乡", "regionName": "西乡", "count": 15},
                            {"id": 6586, "name": "新安", "regionName": "新安", "count": 413},
                            {"id": 6585, "name": "石岩", "regionName": "石岩", "count": 466},
                            {"id": 752, "name": "宝安中心区", "regionName": "宝安中心区", "count": 458},
                            {"id": 4653, "name": "港隆城", "regionName": "港隆城", "count": 137},
                            {"id": 6588, "name": "沙井", "regionName": "沙井", "count": 824},
                            {"id": 6589, "name": "福永", "regionName": "福永", "count": 631},
                            {"id": 7684, "name": "松岗", "regionName": "松岗", "count": 435},
                            {"id": 7685, "name": "公明", "regionName": "公明", "count": 433},
                            {"id": 7719, "name": "海雅缤纷城", "regionName": "海雅缤纷城", "count": 125},
                            {"id": 7735, "name": "固戍", "regionName": "固戍", "count": 237},
                            {"id": 8006, "name": "桃源居", "regionName": "桃源居", "count": 25},
                            {"id": 14404, "name": "时代城", "regionName": "时代城", "count": 2},
                            {"id": 17088, "name": "罗田/燕川", "regionName": "罗田/燕川", "count": 45},
                            {"id": 17089, "name": "西田", "regionName": "西田", "count": 29},
                            {"id": 17091, "name": "圳美", "regionName": "圳美", "count": 32},
                            {"id": 17092, "name": "田寮/长圳", "regionName": "田寮/长圳", "count": 3},
                            {"id": 23524, "name": "沙井京基百纳", "regionName": "沙井京基百纳", "count": 98},
                            {"id": 27275, "name": "宝立方", "regionName": "宝立方", "count": 125},
                            {"id": 36634, "name": "宝安机场", "regionName": "宝安机场", "count": 244},
                            {"id": 37084, "name": "光明新区", "regionName": "光明新区", "count": 1}],
                     "33": [{"id": 33, "name": "全部", "regionName": "龙岗区", "count": 5193},
                            {"id": 753, "name": "罗岗/求水山", "regionName": "罗岗/求水山", "count": 145},
                            {"id": 6600, "name": "五和/民营市场", "regionName": "五和/民营市场", "count": 250},
                            {"id": 6601, "name": "平湖", "regionName": "平湖", "count": 356},
                            {"id": 7656, "name": "横岗", "regionName": "横岗", "count": 568},
                            {"id": 7658, "name": "南澳", "regionName": "南澳", "count": 32},
                            {"id": 7663, "name": "南联", "regionName": "南联", "count": 311},
                            {"id": 7664, "name": "坪地", "regionName": "坪地", "count": 131},
                            {"id": 8472, "name": "大运", "regionName": "大运", "count": 186},
                            {"id": 9013, "name": "李朗聚星商城", "regionName": "李朗聚星商城", "count": 63},
                            {"id": 13335, "name": "较场尾/大鹏所城", "regionName": "较场尾/大鹏所城", "count": 152},
                            {"id": 13358, "name": "水头", "regionName": "水头", "count": 20},
                            {"id": 13359, "name": "东涌", "regionName": "东涌", "count": 2},
                            {"id": 13361, "name": "万科广场/世贸", "regionName": "万科广场/世贸", "count": 107},
                            {"id": 13412, "name": "华南城/奥特莱斯", "regionName": "华南城/奥特莱斯", "count": 191},
                            {"id": 18069, "name": "大芬/南岭", "regionName": "大芬/南岭", "count": 359},
                            {"id": 18228, "name": "双龙", "regionName": "双龙", "count": 316},
                            {"id": 19456, "name": "慢城/三联", "regionName": "慢城/三联", "count": 111},
                            {"id": 19457, "name": "布吉街/东站/天虹", "regionName": "布吉街/东站/天虹", "count": 404},
                            {"id": 26297, "name": "天虹/坂田/杨美", "regionName": "天虹/坂田/杨美", "count": 344},
                            {"id": 26298, "name": "岗头/万科/雪象", "regionName": "岗头/万科/雪象", "count": 199},
                            {"id": 35919, "name": "华为坂田基地", "regionName": "华为坂田基地", "count": 9},
                            {"id": 36519, "name": "杨梅坑/桔钓沙", "regionName": "杨梅坑/桔钓沙", "count": 39},
                            {"id": 36520, "name": "葵涌", "regionName": "葵涌", "count": 37},
                            {"id": 36530, "name": "官湖", "regionName": "官湖", "count": 9},
                            {"id": 36531, "name": "西涌", "regionName": "西涌", "count": 49},
                            {"id": 36636, "name": "坪山高铁站", "regionName": "坪山高铁站", "count": 41},
                            {"id": 37501, "name": "龙岗中心城", "regionName": "龙岗中心城", "count": 365}],
                     "9553": [{"id": 9553, "name": "全部", "regionName": "龙华区", "count": 3080},
                              {"id": 1061, "name": "龙华", "regionName": "龙华", "count": 958},
                              {"id": 6584, "name": "民治", "regionName": "民治", "count": 164},
                              {"id": 7721, "name": "观澜", "regionName": "观澜", "count": 433},
                              {"id": 7722, "name": "大浪", "regionName": "大浪", "count": 398},
                              {"id": 9326, "name": "梅林关", "regionName": "梅林关", "count": 125},
                              {"id": 9327, "name": "锦绣江南", "regionName": "锦绣江南", "count": 33},
                              {"id": 36633, "name": "深圳北站", "regionName": "深圳北站", "count": 190},
                              {"id": 37723, "name": "龙华新区", "regionName": "龙华新区", "count": 14}],
                     "23420": [{"id": 23420, "name": "全部", "regionName": "坪山区", "count": 393},
                               {"id": 6602, "name": "坪山", "regionName": "坪山", "count": 232},
                               {"id": 23429, "name": "坑梓/竹坑", "regionName": "坑梓/竹坑", "count": 128},
                               {"id": 9535, "name": "南澳大鹏新区", "regionName": "南澳大鹏新区", "count": 91}]

                     }}

    datas = a['areaObj']
    b = datas.values()
    area_list=[]
    for data in b:
        for d in data[1:]:
            area_list.append(d)  #将每个区域信息保存到列表,元素是字典
    l=0
    old=time.time()
    for i in area_list:
        l+=1
        print('开始抓取第%d个区域:'%l,i['regionName'], '店铺总数:',i['count'])
        try:
            crow_id(i['id'])
            now=time.time()-old
            print(i['name'],'抓取完成!','时间:%d'%now)
        except Exception as e:
            print(e)

   

六、店铺详情页面抓取

店铺详情页面的url已经可以构造,现在就是直接访问。就是一个简单的get请求,但是要带上完整的cookie,cookie有问题的话很快会弹验证码。一个cookie可以爬1000次后才会出现验证码,但是也有几百次出现的。用requests的session模块好像拿不到完整的cookie,本文是用selenium+chrome,使用代理ip访问美团,然后获取cookie,再把cookie和ip返回用以发起requests请求。实际测试中出现验证码后不换cookie,只更换ip也可以继续抓取。

 代码有两块,一个是主程序,还有一个get_cookie文件,用以cookie、ip的获取处理的,还有页面详情的解析模块。cookie、ip处理函数,先提取一个ip(我买的代理),然后访问美团深圳首页,sleep几秒,这个很关键,让页面完全加载,不然会少cookie。再访问美食页面。ip质量良莠不齐,使用前最好先测试下。这里用访问美食页面所需的时间来判断,大于3S的NG,重新提取ip。小于三秒的ok。然后获取下cookie,这里需要判断cookie是否完整,主要是_utma、_utmc、_utmz这几个参数有时会缺失,没有这几个参数很快会弹验证码,一般cookie长度18。页面解析函数也很简单,返回一个标志位mark和店铺信息info,标志位用以判断本次抓取是否成功。

主函数采用了多线程,比较简单,先获取ip、cookie,再开始爬取。需要注意的是爬取过程中异常的处理。主要异常有两种,一个是timeout:这种异常先sleep1秒,再抓一次,还是不行的话就判断本条抓取失败,如果连续三条抓取失败就需要重新获取ip、cookie。还有就是直接报‘由于目标计算机积极拒绝,无法连接’,访问次数太频繁了,被服务器识别了,就需要重新获取ip、cookie。

get_cookie 模块代码如下:

 

from selenium import webdriver
import requests
import time
import json
from lxml import etree
#返回一个ip和对应的cookie,cookie以字符串形式返回。ip需要经过测试
def get_cookie():
    mark=0
    while mark==0:
        #购买的ip获取地址
        p_url = 'XXXXXXXXXXXXX'
        r = requests.get(p_url)
        html = json.loads(r.text)
        a = html['data'][0]['ip']
        b = html['data'][0]['port']
        val = '--proxy-server=http://' + str(a) + ':' + str(b)
        val2 = 'https://' + str(a) + ':' + str(b)
        p = {'https': val2}
        print('获取IP:',p)
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument(val)
        driver = webdriver.Chrome(executable_path='C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',chrome_options=chrome_options)
        driver.set_page_load_timeout(8) #设置超时
        driver.set_script_timeout(8)
        url='https://i.meituan.com/shenzhen/'   #美团深圳首页
        url2='https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1'#美食页面
        try:
            driver.get(url)
            time.sleep(2.5)
            c1=driver.get_cookies()
            now = time.time()
            driver.get(url2)
            tt=time.time()-now
            print(tt)
            time.sleep(0.5)
            #ip速度测试,打开时间大于3S的NG
            if tt < 3:
                c=driver.get_cookies()
                driver.quit()
                print('*******************')
                print(len(c1),len(c))
                #判断cookie是否完整,正常的长度应该是18
                if len(c)>17:
                    mark=1
                    # print(c)
                    x={}
                    for line in c:
                        x[line['name']]=line['value']
                    #将cookie合成字符串,以便添加到header中,字符串较长就分了两段处理
                    co1='__mta='+x['__mta']+'; client-id='+x['client-id']+'; IJSESSIONID='+x['IJSESSIONID']+'; iuuid='+x['iuuid']+'; ci=30; cityname=%E6%B7%B1%E5%9C%B3; latlng=; webp=1; _lxsdk_cuid='+x['_lxsdk_cuid']+'; _lxsdk='+x['_lxsdk']
                    co2='; __utma='+x['__utma']+'; __utmc='+x['__utmc']+'; __utmz='+x['__utmz']+'; __utmb='+x['__utmb']+'; i_extend='+x['i_extend']+'; uuid='+x['uuid']+'; _hc.v='+x['_hc.v']+'; _lxsdk_s='+x['_lxsdk_s']
                    co=co1+co2
                    print(co)
                    return(p,co)
                else:
                    print('缺少Cookie,长度:',len(c))
            else:
                print('超时')
                driver.quit()
                time.sleep(3)
        except:
            driver.quit()
            pass


     #解析店铺详情页面,返回店铺信息info和一个标志位mark
     #传入参数u包含url和店铺分类,pc包含cookie和ip,m代表抓取的数量,n表示线程号,ll表示剩余店铺数量,ttt该线程抓取的总时长
def parse(u,pc,m,n,ll,ttt):
    mesg='Thread:'+str(n)+' No:'+str(m)+' Time:'+str(ttt)+' left:'+str(ll)#记录当前线程爬取的信息
    url = u[0]
    cate = u[1]
    p=pc[0]
    cookie=pc[1]
    mark = 0 #标志位,0表示抓取正常,1,2表示两种异常
    head = {'Host': 'meishi.meituan.com',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive',
            'Upgrade - Insecure - Requests': '1',
            'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
            'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
            'Cookie':cookie
            }
    info = [] #店铺信息存储
    try:
        r = requests.get(url, headers=head, timeout=3, proxies=p)
        r.encoding = 'utf-8'
        html = etree.HTML(r.text)
        datas = html.xpath('body/script[@crossorigin="anonymous"]')
        for data in datas:
            try:
                strs = data.text[:16]
                if strs == 'window._appState':
                    result = data.text[19:-1]
                    result = json.loads(result)
                    name = result['poiInfo']['name']
                    addr = result['poiInfo']['addr']
                    phone = result['poiInfo']['phone']
                    aveprice = result['poiInfo']['avgPrice']
                    opentime = result['poiInfo']['openInfo']
                    opentime = opentime.replace('\n', ' ')
                    avescore = result['poiInfo']['avgScore']
                    marknum = result['poiInfo']['MarkNumbers']
                    lng = result['poiInfo']['lng']
                    lat = result['poiInfo']['lat']
                    info = [name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat]
                    print(url)
                    print(mesg,name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat)
            except:
                pass
    except Exception  as e:
        print('Error  Thread:',n) #打印出异常的线程号
        print(e)
        s = str(e)[-22:-6]
        if s == '由于目标计算机积极拒绝,无法连接':
            print('由于目标计算机积极拒绝,无法连接',n)
            mark=1   #1类错误,需要更换ip
        else:
            mark=2   #2类错误,再抓取一次
    return(mark,info) #返回标志位和店铺信息


主函数模块代码如下:

import csv
import time
import threading
from get_cookie import get_cookie 
from get_cookie import parse

def crow(n,l): #参数n 区分第几个线程,l存储url的列表
    lock=threading.Lock()
    sym=0 #是否连续三次抓取失败的标志位
    pc=get_cookie()  #获取IP 和 Cookie
    m=0 #记录抓取的数量
    now=time.time()
    while True:
        if len(l)>0:
            u=l.pop(0)
            ll=len(l)
            m+=1
            ttt=time.time()-now
            result=parse(u,pc,m,n,ll,ttt)
            mark=result[0]
            info=result[1]
            if mark==2:
                time.sleep(1.5)
                result = parse(u, pc,m,n,ll,ttt)
                mark = result[0]
                info = result[1]
                if mark !=0:
                    sym+=1
            if mark==1:
                pc=get_cookie()
                result = parse(u, pc,m,n,ll,ttt)
                mark = result[0]
                info = result[1]
                if mark !=0:
                    sym+=1
            if mark==0: #抓取成功
                sym=0
                lock.acquire()
                with open('meituan.csv', 'a', newline='', encoding='gb18030')as f:
                    write = csv.writer(f)
                    write.writerow(info)
                f.close()
                lock.release()
            if sym>2: #连续三次抓取失败,换ip、cookie
                sym=0
                pc=get_cookie()
        else:
            print('&&&&线程:%d结束'%n)
            break


if __name__=='__main__':
    url_list=[]
    with open('mt_id.csv','r',encoding='gb18030')as f:
        read=csv.reader(f)
        for line in read:
            d_list=['','']
            url='https://meishi.meituan.com/i/poi/'+str(line[2])+'?ct_poi='+str(line[3])
            d_list[0]=url
            d_list[1]=line[1]
            url_list.append(d_list)
        f.close()
    th_list=[]
    for i in range(1,6):
        t=threading.Thread(target=crow,args=(i,url_list,))
        print('*****线程%d开始启动...'%i)
        t.start()
        th_list.append(t)
        time.sleep(30)
    for t in th_list:
        t.join()

七、结果

开5个线程的话应该一个小时就可以抓完,最后一共抓取到21828条数据,丢了不到50条数据。

水平有限,如有错误望指正。还有桌面版的抓取如有解决方法望告知,谢谢。

 

---全部文章: 京东爬虫 、链家爬虫美团爬虫微信公众号爬虫字体反爬Django笔记阿里云部署vi\vim入门、 Git基本操作 ----

                                                      

更多案例持续更新,欢迎关注个人公众号!

                         打赏作者 

 

展开阅读全文

没有更多推荐了,返回首页