Ali1688爬虫实践(2)

前面文章是基础篇,也是参考网上一些资料做的一个实验,结果效率太差,直接舍弃,第二种方法采用的是scrapy+xpath+mongodb+第三方ip代理做的数据爬取,下面简单做个网页分析

 

由于从首页进入一个个搜索的话会有很多弹窗,所以考虑过用selenium模拟点击关闭各种弹窗以及验证等等,但是效率实在不敢恭维,虽然封ip没那么严重,但是没那么多时间消耗,所以后来考虑直接通过1688供应商按省市来爬取

这里进去随便搜索一个省,然后分析网页构成,比如搜索“甘肃”,网页的完整链接是这样子

https://s.1688.com/company/company_search.htm?keywords=%B8%CA%CB%E0&button_click=top&earseDirect=false&n=y&netType=1%2C11&pageSize=30&offset=3&beginPage=99

但是对我们来说有用的部分只需要这些,包含关键字以及分页的相关信息即可

https://s.1688.com/company/company_search.htm?keywords=%B8%CA%CB%E0&pageSize=30&offset=3&beginPage=99

然后分析网页构成,检测元素定位到想要的获取数据的地方,比如要获取员工人数和详情页链接

经过xpath解析后,完整的代码如下

 companylist = response.xpath('//div[@id="sw_mod_searchlist"]/ul[@class="sm-company-list fd-clr"]/li')
        for item in companylist:
            detailBlock = item.xpath(
                './/div[@class="list-item-left"]//div[@class="wrap"]//div[@class="list-item-detail"]')
            detailurl = detailBlock.xpath('.//div[@class="detail-left"]//div[3]/a[contains(@href, "")]/@href').extract()#详情页的url
            comMember = self.getListWithDefault(detailBlock.xpath('string(.//div[@class="detail-left"]//div[3])').extract())[0]
            comMember= self.replaceSpace(comMember)#经过工具方法得到员工数
            print("detailurl>>>>",detailurl)
            yield scrapy.Request(detailurl[0], callback=self.parse_detail,meta={'detailurl': detailurl[0],'comMember':comMember})

数据传递到scrapy的回调函数,进入详情页获取更多数据,元素定位和xpath解析和前面没什么差别,注意的是详情页阿里有几种不同的布局,我这里只处理了两种布局,然后公司法人的相关信息是图片定位像素的方法加载的,我没做处理,最后吧所有数据和上个页面传递来的数据一并返回到item函数

 # 解析详情页获取数据
    def parse_detail(self, response):
        time.sleep(1.5)
        detailurl = response.meta['detailurl']
        comMember = response.meta['comMember']
        print("detailurl>>>",detailurl)
        companyTag = response.xpath('//h1[@class="company-name"]')
        # 公司名称
        companyName = self.getListWithDefault(companyTag.xpath('./span/text()').extract())[0]
        companyName = self.replaceSpace(companyName)
        # 诚信年限
        loyaltyYears = \
            self.getListWithDefault(companyTag.xpath('./a[@class="icon icon-chengxintong"]/text()').extract())[0]
        # 诚信等级
        loyaltyLevel = self.getListWithDefault(companyTag.xpath('./a[last()]/text()').extract())[0]

        contactTag = response.xpath('//div[@class="text company-contact"]')
        # 联系人  J_STRENGTH_CompanyContact
        global contactPerson, telephone, mobile
        contactPerson = contactTag.xpath(
            '//div[@id="J_COMMON_CompanyContact"]/span[@class="contact-info"]/text()').extract()  # self.getListWithDefault()[0]
        contactPerson = self.getContactPerson(contactPerson)
        # 有两种布局,ID不一样 如果一种没获取到数据换另一种
        if len(contactPerson) == 0:
            print('second contactPerson')
            contactPerson = \
                self.getListWithDefault(contactTag.xpath('//span[@id="J_STRENGTH_CompanyContact"]/text()').extract())[0]
            contactPerson = self.replaceSpace(contactPerson)
        # 固话
        telephone = \
            self.getListWithDefault(response.xpath('string(//span[@id="J_COMMON_CompanyInfoTelShow"])').extract())[0]
        telephone = self.replaceSpace(telephone)
        if len(telephone) == 0:
            print('second telephone')
            telephone = self.getListWithDefault(
                response.xpath('string(//span[@id="J_STRENGTH_CompanyInfoTelShow"])').extract())[0]
            telephone = self.replaceSpace(telephone)
        # 手机号码
        mobile = \
            self.getListWithDefault(response.xpath('string(//div[@id="J_COMMON_CompanyInfoPhoneShow"])').extract())[0]
        mobile = self.replaceSpace(mobile)
        if len(mobile) == 0:
            print('second mobile')
            mobile = \
                self.getListWithDefault(
                    response.xpath('string(//span[@id="J_STRENGTH_CompanyInfoPhoneShow"])').extract())[
                    0]
            mobile = self.replaceSpace(mobile)

        # 成交数
        translateNum = self.getListWithDefault(
            response.xpath('string(//div[@id="J_CompanyTradeCreditRecord"]/ul/li[1])').extract())[0]
        translateNum = self.replaceSpace(translateNum)
        # 累计买家数
        buyerNum = self.getListWithDefault(
            response.xpath('string(//div[@id="J_CompanyTradeCreditRecord"]/ul/li[2])').extract())[0]
        buyerNum = self.replaceSpace(buyerNum)
        # 注册时间
        registerTime = self.getListWithDefault(response.xpath(
            'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[1])')).extract()[0]
        registerTime = self.replaceSpace(registerTime)
        # 注册资金
        registerMoney = self.getListWithDefault(response.xpath(
            'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[2])').extract())[0]
        registerMoney = self.replaceSpace(registerMoney)
        # 运营范围
        operateArea = self.getListWithDefault(response.xpath(
            'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[3])').extract())[0]
        operateArea = self.replaceSpace(operateArea)
        # 地址
        address = self.getListWithDefault(response.xpath(
            'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[4])').extract())[0]
        address = self.replaceSpace(address).replace("查看地图", "")

        companyCode = self.getListWithDefault(
            response.xpath('string(//div[@class="register-data"]//table/tbody/tr[3])').extract())[0]
        companyCode = companyCode.replace("法定代表人:", "")
        companyCode = self.replaceSpace(companyCode).replace("\xa0", "")
        companyArea = self.getListWithDefault(
            response.xpath('string(//li[@id="J_FCA_DepthInspectionTab_product"]/div/ul/li[2])').extract())[0]
        companyArea = self.replaceSpace(companyArea)
        equipmentNum = self.getListWithDefault(
            response.xpath('string(//li[@id="J_FCA_DepthInspectionTab_product"]/div/ul/li[3])').extract())[0]
        equipmentNum = self.replaceSpace(equipmentNum)
        dataitem = Ali1688SpiderItem()
        dataitem['detailurl'] = detailurl
        dataitem['companyName'] = companyName
        dataitem['loyaltyYears'] = loyaltyYears
        dataitem['loyaltyLevel'] = loyaltyLevel
        dataitem['contactPerson'] = contactPerson
        dataitem['telephone'] = telephone
        dataitem['mobile'] = mobile
        dataitem['translateNum'] = translateNum
        dataitem['buyerNum'] = buyerNum
        dataitem['registerTime'] = registerTime
        dataitem['registerMoney'] = registerMoney
        dataitem['operateArea'] = operateArea
        dataitem['address'] = address
        dataitem['companyCode'] = companyCode
        dataitem['starfNum'] = comMember
        dataitem['companyArea'] = companyArea
        dataitem['equipmentNum'] = equipmentNum

        yield dataitem

前面是爬取一个页面的所有逻辑,但是每个省份都有很多页数据,那我们肯定要做翻页处理

我的思路是获取总页数,然后获取当前页面页码,如果当前页码小于总页数爬完所有列表就可以翻页

代码如下

 #爬取完当前所有item的详情页之后翻页
        total_page = response.xpath('//span[@class="total-page"]/text()').extract()[0]
        totalpage = re.sub("\D", "", total_page)
        cur_page = response.xpath('//span[@class="page-cur"]/text()').extract()[0]
        next_url = response.xpath('//a[@class="page-next"]/@href').extract_first()
        print("**************当前页面>>>第",cur_page+"页*****************")
        if(int(cur_page)<int(totalpage)):
            time.sleep(1.5)
            yield scrapy.Request(url=next_url,callback=self.parse)
        else:
            print("*********************当前省份爬取完毕**************************")

到此,spider的逻辑已经编写完毕,然后是中间件的处理

首先把切换ip的代理搞定,有两种方法可以切换代理ip,一种是把ip保存到txt文件第二种是直接请求接口

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
import time

import requests


class MyProxyMiddlleWare(object):

    def process_request(self, request, spider):
        # 得到地址
        proxy = self.get_Random_Proxy()
        # 设置代理
        request.meta['proxy'] = proxy

    def process_response(self, request, response, spider):
        # 如果该ip不能使用,更换下一个ip
        if response.status != 200:
            proxy = self.get_Random_Proxy()
            print("this is response ip:" + proxy)
            # 对当前reque加上代理
            request.meta['proxy'] = proxy
            return request
        return response

    def get_Random_Proxy(self):
        # 是从接口获取ip
        url = '第三方代理的接口地址'
        result = requests.get(url).json()
        if(len(result["data"])>0):
            ip = result["data"][0]["ip"]
            port = result["data"][0]["port"]
            proxy = "http://" + str(ip) + ":" + str(port)
            print('当前IP》》》',proxy)
            return proxy

        else:
            time.sleep(1)
            return self.get_Random_Proxy()

    def get_Random_Proxy(self):
        '''随机从文件中读取proxy'''
        while 1:
            with open('usefull_ip.txt', 'r') as f:
                proxies = f.readlines()
                if proxies:
                    break
                else:
                    time.sleep(1)
        proxy = random.choice(proxies).strip()
        return proxy

然后在setting里面配置好就行了

最后是数据的保存,数据拔下来,保存到MongoDB,代码如下,首先在setting里面配置数据库相关信息

mongo_host = 'localhost'
mongo_port = 27017
mongo_name = 'guangdong'
mongo_db_collectio = 'shopinfo'

 

然后在pipline里面配置

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

from Ali1688Spider.settings import mongo_host, mongo_port, mongo_name, mongo_db_collectio

# 保存数据到MongoDB方法1
class Ali1688SpiderPipeline(object):
    def __init__(self):
        host = mongo_host
        port = mongo_port
        dbname = mongo_name
        sheetname = mongo_db_collectio
        client = pymongo.MongoClient(host=host, port=port)
        mydb = client[dbname]
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item

# 保存数据到MongoDB方法2
class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URL'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

其他的比如MongoDB环境配置可以参考  MongoDB菜鸟教程

最后保存好数据还没完啊,作为一个优秀的coder,还得善后,爬取过程中难免会爬取重复数据,或者爬虫异常中断的情况,数据库的数据会有重复,那我们最后一步就是去重

我这里新建一个数据库dhdshop,然后插入几条数据,使用的命令如下

db.show dbs  //显示所有数据库
use dhdshop //切换到创建的数据库
doc ={"companyName":"湖南高新科技有限公司"}}
db.dhdshop.insert(doc) //插入数据,可以连续插入三次,同理再插入一些别的数据

db.dhdshop.find()  //查看所有数据

//以companyName为key,删除同样公司名字的数据,保留一条

db.dhdshop.aggregate([
    {
        $group: { _id: {companyName: '$companyName'},count: {$sum: 1},dups: {$addToSet: '$_id'}}
    },
    {
        $match: {count: {$gt: 1}}
    }
]).forEach(function(doc){
    doc.dups.shift();
    db.dhdshop.remove({_id: {$in: doc.dups}});
})

截图如下

最终结果是去除了所有重复的"华中科技大学高兴科技有限公司"和"深圳高新科技有限公司",“湖南高新科技有限公司”的重复数据,且都只保留一条

记得处理数据之前先备份数据库

其他的还有一些工具类代码如下

//校验接口获取的ip是否可以访问1688,可以正常访问的输出到控制台
def textIpIsUsefullFromJson():
    shoplist_url = 'https://s.1688.com/company/company_search.htm?keywords=%BA%A3%C4%CF&pageSize=30&offset=3&beginPage=1'
    shopdetail_url = 'https://litree.1688.com/page/creditdetail.htm?spm=b26110380.2178313.result.13.78a3a06dnwTa9o'

    url = '代理接口'
    result = requests.get(url).json()
    if(len(result["data"])>0):
        for i in range(0, len(result["data"])):
            ip = result["data"][i]["ip"]
            port = result["data"][i]["port"]
            proxy = "http://" + str(ip) + ":" + str(port)
            plist = requests.get(shoplist_url, headers=head, proxies={"http": proxy})
            if (plist.status_code == 200):
                pdetail = requests.get(shopdetail_url, headers=head, proxies={"http": proxy})
                if (pdetail.status_code == 200):
                    print(proxy)
    else:
        print('result is None',result)
        textIpIsUsefullFromJson()
    time.sleep(1)
    textIpIsUsefullFromJson()

也可以直接保存到txt文档

#保存到usefull_ip.txt
def writeIpFormJsonToText():
    url = '代理ip接口'
    result = requests.get(url).json()
    for i in range(0, len(result["data"])):
        ip = result["data"][i]["ip"]
        port = result["data"][i]["port"]
        proxy = "http://" + str(ip) + ":" + str(port)
        with open('usefull_ip.txt', 'a', newline='') as f:
            f.write(proxy + '\n')
        print(proxy)

# 工具方法替换各种健
    def replaceSpace(self, params):
        if params is None:
            return "--"
        else:
            newstr = params.strip()
            newstr = newstr.replace(" ", "")
            newstr = newstr.replace("\r", "")
            newstr = newstr.replace("\n", "")
            newstr = newstr.replace("\t", "")
            return newstr

    # 工具方法给空的list返回默认值
    def getListWithDefault(self, mylist):
        if (isinstance(mylist, list)):
            if mylist:
                return mylist
            else:
                return mylist.insert(0, "no_data")
        else:
            print("mylist is not list>>>", mylist)
            return self.replaceSpace(mylist)

项目完整代码传送门(配置好MongoDB环境,scrapy环境后替换相关代码可以直接运行):爬虫完整代码

附上催庆才大佬 10大付费代理ip评测

scrapy api文档传送门

xpath解析文档

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值