需求分析
抓取猫眼电影的排行榜 top100
https://maoyan.com/board/4
抓取字段name/player/time
1.分析页面—静态页面
2.分析url 结构–分页
第一页,https://maoyan.com/board/4?offset=0
第二页,https://maoyan.com/board/4?offset=10
第三页,https://maoyan.com/board/4?offset=20
。。。。。
共10页
url: https://maoyan.com/board/4?offset={ }
创建项目
- scrapy startproject Maoyan
- cd Maoyan
- scrapy genspider maoyan maoyan.com
定义item
#items.py
class MaoyanItem(scrapy.Item):
name = scrapy.Field()
player = scrapy.Field()
time = scrapy.Field()
定义爬虫文件
#spiders>maoyan.py
import scrapy
from ..items import MaoyanItem
class MaoyanSpider(scrapy.Spider):
name = 'maoyan'
allowed_domains = ['maoyan.com']
#start_urls = ['http://maoyan.com/']
def start_requests(self):
for i in range(0,100,10):
url = "https://maoyan.com/board/4?offset={}".format(i)
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
#匹配dd list
dd_list = response.xpath("//*[@id='app']/div/div/div[1]/dl/dd")
for dd in dd_list:
item = MaoyanItem()
item["name"] = dd.xpath('./div/div/div[1]/p[1]/a/text()').get().strip()
item["player"] = dd.xpath('./div/div/div[1]/p[2]/text()').get().strip()
item["time"] = dd.xpath('./div/div/div[1]/p[3]/text()').get().strip()
#item交给管道处理
yield item
处理多级页面抓取时,需传输item对象
item[“link”] = new_url
yield scrapy.Request(url=item[“link”],callback=self.detail_page_parse,meta={“item”:item})
def detail_page_parse(self,response):
item = response.meta.get(“item”)
item[“other_key”] = “value”
…
#交给管道
yield item
定义管道
#pipelines.py
from itemadapter import ItemAdapter
import pymongo
from .settings import MONGO_HOST,MONGO_PORT,MONGO_DB,MONGO_SET
class MaoyanPipeline:
def open_spider(self,spider):
# 连接mongodb
self.conn = pymongo.MongoClient(MONGO_HOST,MONGO_PORT)
self.db = self.conn[MONGO_DB]
self.set = self.db[MONGO_SET]
def process_item(self, item, spider):
#data
data = dict(item)
self.set.insert_one(data)
return item
def close_spider(self,spider):
self.conn.close()
配置管道等
settings.py
创建run.py
from scrapy import cmdline
cmdline.execute("scrapy crawl maoyan".split())
执行
代码
提取码:wr92
瓜子二手车抓取的实现