自从学习了上个案例(CrawlSpider爬虫之爬取17k小说网列表详情及章节并放在一起(CrawlSpider翻页、MongoDB)-CSDN博客),做这个就简单多了,视频教程里也很简单,毕竟是入门CrawlSpider的实战小demo。这个视频教程真的做的很贴心。
聚美优品上打不开兰蔻品牌的链接啊,显示404啊。是不是爬崩溃了😄……
选择雅诗兰黛这个品牌,而且需要在其他页面,才能选择下拉菜单,看把人家聚美优品折腾的,首页都不敢放下拉菜单了~~~~
废话不多说,我忒忙……上代码
app.py
from typing import Iterable
import scrapy
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import jumei_product
class AppSpider(CrawlSpider):
name = "app"
start_urls = [
"http://search.jumei.com/?filter=0-11-1&search=%E9%9B%85%E8%AF%97%E5%85%B0%E9%BB%9B&bid=4&site=sh"]
rules = (
Rule(LinkExtractor(allow=r"http://item.jumeiglobal.com/(.+).html",
restrict_xpaths=('//div[@class="s_l_pic"]/a')),
callback="parse_detail", follow=False, process_links="process_detail"),
)
def start_requests(self) -> Iterable[Request]:
max_page = 4
for i in range(1, max_page):
url = "http://search.jumei.com/?filter=0-11-" + str(
i) + "&search=%E9%9B%85%E8%AF%97%E5%85%B0%E9%BB%9B&bid=4&site=sh"
yield Request(url)
def process_detail(self, links):
for index, link in enumerate(links):
# 列表页,每页选5个商品
if index < 5:
yield link
else:
return
def parse_detail(self, response):
# 商品详情数据信息
title = response.xpath('//div[@class="deal_con_content"]//tr[1]/td[2]/span/text()').get()
category = response.xpath('//div[@class="deal_con_content"]//tr[4]/td[2]/span/text()').get()
address = response.xpath('//div[@class="deal_con_content"]//tr[6]/td[2]/span/text()').get()
expired = response.xpath('//div[@class="deal_con_content"]//tr[8]/td[2]/span/text()').get()
item = jumei_product()
item["title"] = title
item["category"] = category
item["address"] = address
item["expired"] = expired
yield item
列表页选择5个商品,选择循环3个列表页面。
items.py
import scrapy
class jumei_product(scrapy.Item):
title = scrapy.Field()
category = scrapy.Field()
address = scrapy.Field()
expired = scrapy.Field()
数据库实体类pipelines.py
import pymongo
class Scrapy02Pipeline:
def __init__(self):
print("-" * 10, "开始", "-" * 10)
self.res = None
self.client = pymongo.MongoClient("mongodb://localhost:27017")
self.db = self.client["jumei"]
self.collection = self.db["landai"]
self.collection.delete_many({})
def process_item(self, item, spider):
self.res = self.collection.insert_one(dict(item))
# print(self.res.inserted_id)
return item
def __del__(self):
print("-" * 10, "结束", "-" * 10)
是不是感觉有手就行了?😄
学无止境,学到后期,不仅仅是有手就行,要做到无手自行才行吧……