Python+Scrapy 爬取豆瓣电影排行榜Top250

环境配置

  • Windows
  • Python 2.7
  • Scrapy
  • PyMongo

创建工程

scrapy startproject douban_movie

目录结构如下

|– douban_movie
| |– init.py
| |– items.py
| |– middlewares.py
| |– pipelines.py
| |– settings.py
| -- spiders
| |-- __init__.py
|
– spiders.py
|– README.md
|– run.py
`– scrapy.cfg

middlewares.py: 设置User-Agent
pipelines.py:处理爬取的内容,插入到mongodb中
items.py:要爬取的数据的结构
spiders.py:具体的爬取的逻辑

spiders.py:

from scrapy.spiders import CrawlSpider
from scrapy.http import Request
from scrapy.selector import Selector
from douban_movie.items import DoubanMovieItem
from bson import ObjectId
import logging
logger = logging.getLogger(‘doubanspider’)
class Spiders(CrawlSpider):
name = “movie”
start_urls = [
https://movie.douban.com/top250/
]
def parse(self,response):
selector = Selector(response)
ol_li = selector.xpath(‘//div[@class=”item”]’)
for li in ol_li:
movie = DoubanMovieItem()
movie[‘_id’] = str(ObjectId())
movie[‘rank’] = li.xpath(‘div[@class=”pic”]/em/text()’).extract_first()
movie[‘link’] = li.xpath(‘div[@class=”pic”]/a/@href’).extract_first()
movie[‘img’] = li.xpath(‘div[@class=”pic”]/a/img/@src’).extract_first()
movie[‘title’] = li.xpath(‘div[@class=”pic”]/a/img/@alt’).extract_first()
movie[‘star’] = li.xpath(‘div[@class=”info”]/div[@class=”bd”]/div[@class=”star”]/span[@class=”rating_num”]/text()’).extract_first()
movie[‘quote’] = li.xpath(‘div[@class=”info”]/div[@class=”bd”]/p[@class=”quote”]/span[@class=”inq”]/text()’).extract_first()
yield movie
next_page = response.xpath(‘//span[@class=”next”]/a/@href’)
if next_page:
url = ‘https://movie.douban.com/top250‘+next_page[0].extract()
yield Request(url=url,callback=self.parse)

具体的请下载 源码

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值