scrapy多个页面的抓取----猫眼电影

需求分析

抓取猫眼电影的排行榜 top100
https://maoyan.com/board/4
抓取字段name/player/time

1.分析页面—静态页面
2.分析url 结构–分页
第一页,https://maoyan.com/board/4?offset=0
第二页,https://maoyan.com/board/4?offset=10
第三页,https://maoyan.com/board/4?offset=20
。。。。。
共10页

url: https://maoyan.com/board/4?offset={ }

创建项目

  1. scrapy startproject Maoyan
  2. cd Maoyan
  3. scrapy genspider maoyan maoyan.com

定义item

#items.py
class MaoyanItem(scrapy.Item):
	name = scrapy.Field()
	player = scrapy.Field()
	time = scrapy.Field()

定义爬虫文件

#spiders>maoyan.py
import scrapy
from ..items import MaoyanItem

class MaoyanSpider(scrapy.Spider):
    name = 'maoyan'
    allowed_domains = ['maoyan.com']
    #start_urls = ['http://maoyan.com/']
    def start_requests(self):
        for i in range(0,100,10):
            url = "https://maoyan.com/board/4?offset={}".format(i)
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self, response):
        #匹配dd list
        dd_list = response.xpath("//*[@id='app']/div/div/div[1]/dl/dd")
        for dd in dd_list:
            item = MaoyanItem()
            item["name"] = dd.xpath('./div/div/div[1]/p[1]/a/text()').get().strip()
            item["player"] = dd.xpath('./div/div/div[1]/p[2]/text()').get().strip()
            item["time"] = dd.xpath('./div/div/div[1]/p[3]/text()').get().strip()

            #item交给管道处理
            yield item

处理多级页面抓取时,需传输item对象
item[“link”] = new_url
yield scrapy.Request(url=item[“link”],callback=self.detail_page_parse,meta={“item”:item})

def detail_page_parse(self,response):
item = response.meta.get(“item”)
item[“other_key”] = “value”

#交给管道
yield item

定义管道

#pipelines.py
from itemadapter import ItemAdapter
import pymongo
from .settings import MONGO_HOST,MONGO_PORT,MONGO_DB,MONGO_SET

class MaoyanPipeline:
    def open_spider(self,spider):
        # 连接mongodb
        self.conn = pymongo.MongoClient(MONGO_HOST,MONGO_PORT)
        self.db = self.conn[MONGO_DB]
        self.set = self.db[MONGO_SET]

    def process_item(self, item, spider):
        
        #data
        data = dict(item)
        self.set.insert_one(data)

        return item
    def close_spider(self,spider):

        self.conn.close()

配置管道等

settings.py

创建run.py

from scrapy import cmdline
cmdline.execute("scrapy crawl maoyan".split())

执行

代码
提取码:wr92

瓜子二手车抓取的实现

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

laufing

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值