前面文章是基础篇,也是参考网上一些资料做的一个实验,结果效率太差,直接舍弃,第二种方法采用的是scrapy+xpath+mongodb+第三方ip代理做的数据爬取,下面简单做个网页分析
由于从首页进入一个个搜索的话会有很多弹窗,所以考虑过用selenium模拟点击关闭各种弹窗以及验证等等,但是效率实在不敢恭维,虽然封ip没那么严重,但是没那么多时间消耗,所以后来考虑直接通过1688供应商按省市来爬取
这里进去随便搜索一个省,然后分析网页构成,比如搜索“甘肃”,网页的完整链接是这样子
https://s.1688.com/company/company_search.htm?keywords=%B8%CA%CB%E0&button_click=top&earseDirect=false&n=y&netType=1%2C11&pageSize=30&offset=3&beginPage=99
但是对我们来说有用的部分只需要这些,包含关键字以及分页的相关信息即可
https://s.1688.com/company/company_search.htm?keywords=%B8%CA%CB%E0&pageSize=30&offset=3&beginPage=99
然后分析网页构成,检测元素定位到想要的获取数据的地方,比如要获取员工人数和详情页链接
经过xpath解析后,完整的代码如下
companylist = response.xpath('//div[@id="sw_mod_searchlist"]/ul[@class="sm-company-list fd-clr"]/li')
for item in companylist:
detailBlock = item.xpath(
'.//div[@class="list-item-left"]//div[@class="wrap"]//div[@class="list-item-detail"]')
detailurl = detailBlock.xpath('.//div[@class="detail-left"]//div[3]/a[contains(@href, "")]/@href').extract()#详情页的url
comMember = self.getListWithDefault(detailBlock.xpath('string(.//div[@class="detail-left"]//div[3])').extract())[0]
comMember= self.replaceSpace(comMember)#经过工具方法得到员工数
print("detailurl>>>>",detailurl)
yield scrapy.Request(detailurl[0], callback=self.parse_detail,meta={'detailurl': detailurl[0],'comMember':comMember})
数据传递到scrapy的回调函数,进入详情页获取更多数据,元素定位和xpath解析和前面没什么差别,注意的是详情页阿里有几种不同的布局,我这里只处理了两种布局,然后公司法人的相关信息是图片定位像素的方法加载的,我没做处理,最后吧所有数据和上个页面传递来的数据一并返回到item函数
# 解析详情页获取数据
def parse_detail(self, response):
time.sleep(1.5)
detailurl = response.meta['detailurl']
comMember = response.meta['comMember']
print("detailurl>>>",detailurl)
companyTag = response.xpath('//h1[@class="company-name"]')
# 公司名称
companyName = self.getListWithDefault(companyTag.xpath('./span/text()').extract())[0]
companyName = self.replaceSpace(companyName)
# 诚信年限
loyaltyYears = \
self.getListWithDefault(companyTag.xpath('./a[@class="icon icon-chengxintong"]/text()').extract())[0]
# 诚信等级
loyaltyLevel = self.getListWithDefault(companyTag.xpath('./a[last()]/text()').extract())[0]
contactTag = response.xpath('//div[@class="text company-contact"]')
# 联系人 J_STRENGTH_CompanyContact
global contactPerson, telephone, mobile
contactPerson = contactTag.xpath(
'//div[@id="J_COMMON_CompanyContact"]/span[@class="contact-info"]/text()').extract() # self.getListWithDefault()[0]
contactPerson = self.getContactPerson(contactPerson)
# 有两种布局,ID不一样 如果一种没获取到数据换另一种
if len(contactPerson) == 0:
print('second contactPerson')
contactPerson = \
self.getListWithDefault(contactTag.xpath('//span[@id="J_STRENGTH_CompanyContact"]/text()').extract())[0]
contactPerson = self.replaceSpace(contactPerson)
# 固话
telephone = \
self.getListWithDefault(response.xpath('string(//span[@id="J_COMMON_CompanyInfoTelShow"])').extract())[0]
telephone = self.replaceSpace(telephone)
if len(telephone) == 0:
print('second telephone')
telephone = self.getListWithDefault(
response.xpath('string(//span[@id="J_STRENGTH_CompanyInfoTelShow"])').extract())[0]
telephone = self.replaceSpace(telephone)
# 手机号码
mobile = \
self.getListWithDefault(response.xpath('string(//div[@id="J_COMMON_CompanyInfoPhoneShow"])').extract())[0]
mobile = self.replaceSpace(mobile)
if len(mobile) == 0:
print('second mobile')
mobile = \
self.getListWithDefault(
response.xpath('string(//span[@id="J_STRENGTH_CompanyInfoPhoneShow"])').extract())[
0]
mobile = self.replaceSpace(mobile)
# 成交数
translateNum = self.getListWithDefault(
response.xpath('string(//div[@id="J_CompanyTradeCreditRecord"]/ul/li[1])').extract())[0]
translateNum = self.replaceSpace(translateNum)
# 累计买家数
buyerNum = self.getListWithDefault(
response.xpath('string(//div[@id="J_CompanyTradeCreditRecord"]/ul/li[2])').extract())[0]
buyerNum = self.replaceSpace(buyerNum)
# 注册时间
registerTime = self.getListWithDefault(response.xpath(
'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[1])')).extract()[0]
registerTime = self.replaceSpace(registerTime)
# 注册资金
registerMoney = self.getListWithDefault(response.xpath(
'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[2])').extract())[0]
registerMoney = self.replaceSpace(registerMoney)
# 运营范围
operateArea = self.getListWithDefault(response.xpath(
'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[3])').extract())[0]
operateArea = self.replaceSpace(operateArea)
# 地址
address = self.getListWithDefault(response.xpath(
'string(//div[@class="info-bottom"]//div[@class="info-box info-right"]//table/tr[4])').extract())[0]
address = self.replaceSpace(address).replace("查看地图", "")
companyCode = self.getListWithDefault(
response.xpath('string(//div[@class="register-data"]//table/tbody/tr[3])').extract())[0]
companyCode = companyCode.replace("法定代表人:", "")
companyCode = self.replaceSpace(companyCode).replace("\xa0", "")
companyArea = self.getListWithDefault(
response.xpath('string(//li[@id="J_FCA_DepthInspectionTab_product"]/div/ul/li[2])').extract())[0]
companyArea = self.replaceSpace(companyArea)
equipmentNum = self.getListWithDefault(
response.xpath('string(//li[@id="J_FCA_DepthInspectionTab_product"]/div/ul/li[3])').extract())[0]
equipmentNum = self.replaceSpace(equipmentNum)
dataitem = Ali1688SpiderItem()
dataitem['detailurl'] = detailurl
dataitem['companyName'] = companyName
dataitem['loyaltyYears'] = loyaltyYears
dataitem['loyaltyLevel'] = loyaltyLevel
dataitem['contactPerson'] = contactPerson
dataitem['telephone'] = telephone
dataitem['mobile'] = mobile
dataitem['translateNum'] = translateNum
dataitem['buyerNum'] = buyerNum
dataitem['registerTime'] = registerTime
dataitem['registerMoney'] = registerMoney
dataitem['operateArea'] = operateArea
dataitem['address'] = address
dataitem['companyCode'] = companyCode
dataitem['starfNum'] = comMember
dataitem['companyArea'] = companyArea
dataitem['equipmentNum'] = equipmentNum
yield dataitem
前面是爬取一个页面的所有逻辑,但是每个省份都有很多页数据,那我们肯定要做翻页处理
我的思路是获取总页数,然后获取当前页面页码,如果当前页码小于总页数爬完所有列表就可以翻页
代码如下
#爬取完当前所有item的详情页之后翻页
total_page = response.xpath('//span[@class="total-page"]/text()').extract()[0]
totalpage = re.sub("\D", "", total_page)
cur_page = response.xpath('//span[@class="page-cur"]/text()').extract()[0]
next_url = response.xpath('//a[@class="page-next"]/@href').extract_first()
print("**************当前页面>>>第",cur_page+"页*****************")
if(int(cur_page)<int(totalpage)):
time.sleep(1.5)
yield scrapy.Request(url=next_url,callback=self.parse)
else:
print("*********************当前省份爬取完毕**************************")
到此,spider的逻辑已经编写完毕,然后是中间件的处理
首先把切换ip的代理搞定,有两种方法可以切换代理ip,一种是把ip保存到txt文件,第二种是直接请求接口
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
import time
import requests
class MyProxyMiddlleWare(object):
def process_request(self, request, spider):
# 得到地址
proxy = self.get_Random_Proxy()
# 设置代理
request.meta['proxy'] = proxy
def process_response(self, request, response, spider):
# 如果该ip不能使用,更换下一个ip
if response.status != 200:
proxy = self.get_Random_Proxy()
print("this is response ip:" + proxy)
# 对当前reque加上代理
request.meta['proxy'] = proxy
return request
return response
def get_Random_Proxy(self):
# 是从接口获取ip
url = '第三方代理的接口地址'
result = requests.get(url).json()
if(len(result["data"])>0):
ip = result["data"][0]["ip"]
port = result["data"][0]["port"]
proxy = "http://" + str(ip) + ":" + str(port)
print('当前IP》》》',proxy)
return proxy
else:
time.sleep(1)
return self.get_Random_Proxy()
def get_Random_Proxy(self):
'''随机从文件中读取proxy'''
while 1:
with open('usefull_ip.txt', 'r') as f:
proxies = f.readlines()
if proxies:
break
else:
time.sleep(1)
proxy = random.choice(proxies).strip()
return proxy
然后在setting里面配置好就行了
最后是数据的保存,数据拔下来,保存到MongoDB,代码如下,首先在setting里面配置数据库相关信息
mongo_host = 'localhost'
mongo_port = 27017
mongo_name = 'guangdong'
mongo_db_collectio = 'shopinfo'
然后在pipline里面配置
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from Ali1688Spider.settings import mongo_host, mongo_port, mongo_name, mongo_db_collectio
# 保存数据到MongoDB方法1
class Ali1688SpiderPipeline(object):
def __init__(self):
host = mongo_host
port = mongo_port
dbname = mongo_name
sheetname = mongo_db_collectio
client = pymongo.MongoClient(host=host, port=port)
mydb = client[dbname]
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item
# 保存数据到MongoDB方法2
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URL'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self,spider):
self.client.close()
其他的比如MongoDB环境配置可以参考 MongoDB菜鸟教程
最后保存好数据还没完啊,作为一个优秀的coder,还得善后,爬取过程中难免会爬取重复数据,或者爬虫异常中断的情况,数据库的数据会有重复,那我们最后一步就是去重
我这里新建一个数据库dhdshop,然后插入几条数据,使用的命令如下
db.show dbs //显示所有数据库
use dhdshop //切换到创建的数据库
doc ={"companyName":"湖南高新科技有限公司"}}
db.dhdshop.insert(doc) //插入数据,可以连续插入三次,同理再插入一些别的数据
db.dhdshop.find() //查看所有数据
//以companyName为key,删除同样公司名字的数据,保留一条
db.dhdshop.aggregate([
{
$group: { _id: {companyName: '$companyName'},count: {$sum: 1},dups: {$addToSet: '$_id'}}
},
{
$match: {count: {$gt: 1}}
}
]).forEach(function(doc){
doc.dups.shift();
db.dhdshop.remove({_id: {$in: doc.dups}});
})
截图如下
最终结果是去除了所有重复的"华中科技大学高兴科技有限公司"和"深圳高新科技有限公司",“湖南高新科技有限公司”的重复数据,且都只保留一条
记得处理数据之前先备份数据库
其他的还有一些工具类代码如下
//校验接口获取的ip是否可以访问1688,可以正常访问的输出到控制台
def textIpIsUsefullFromJson():
shoplist_url = 'https://s.1688.com/company/company_search.htm?keywords=%BA%A3%C4%CF&pageSize=30&offset=3&beginPage=1'
shopdetail_url = 'https://litree.1688.com/page/creditdetail.htm?spm=b26110380.2178313.result.13.78a3a06dnwTa9o'
url = '代理接口'
result = requests.get(url).json()
if(len(result["data"])>0):
for i in range(0, len(result["data"])):
ip = result["data"][i]["ip"]
port = result["data"][i]["port"]
proxy = "http://" + str(ip) + ":" + str(port)
plist = requests.get(shoplist_url, headers=head, proxies={"http": proxy})
if (plist.status_code == 200):
pdetail = requests.get(shopdetail_url, headers=head, proxies={"http": proxy})
if (pdetail.status_code == 200):
print(proxy)
else:
print('result is None',result)
textIpIsUsefullFromJson()
time.sleep(1)
textIpIsUsefullFromJson()
也可以直接保存到txt文档
#保存到usefull_ip.txt
def writeIpFormJsonToText():
url = '代理ip接口'
result = requests.get(url).json()
for i in range(0, len(result["data"])):
ip = result["data"][i]["ip"]
port = result["data"][i]["port"]
proxy = "http://" + str(ip) + ":" + str(port)
with open('usefull_ip.txt', 'a', newline='') as f:
f.write(proxy + '\n')
print(proxy)
# 工具方法替换各种健
def replaceSpace(self, params):
if params is None:
return "--"
else:
newstr = params.strip()
newstr = newstr.replace(" ", "")
newstr = newstr.replace("\r", "")
newstr = newstr.replace("\n", "")
newstr = newstr.replace("\t", "")
return newstr
# 工具方法给空的list返回默认值
def getListWithDefault(self, mylist):
if (isinstance(mylist, list)):
if mylist:
return mylist
else:
return mylist.insert(0, "no_data")
else:
print("mylist is not list>>>", mylist)
return self.replaceSpace(mylist)
项目完整代码传送门(配置好MongoDB环境,scrapy环境后替换相关代码可以直接运行):爬虫完整代码