本文目的:使用Scrapy爬虫框架爬取豆瓣top250 https://movie.douban.com/top250,并将爬取结果用MongoDB来储存。
- Scpay
- MongoDB
- xpath文本解析
创建项目
scrapy startproject doubantop
然后在spiders文件夹里新建爬虫主体文件spider_douban.py
文件结构如下:
├── __init__.py
├── doubantop
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── spider_douban.py #新建
└── scrapy.cfg
items.py
我们选择爬取电影的名字,链接,评分,评论人数,然后写进items.py中
import scrapy
class DoubanItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
star = scrapy.Field()
num = scrapy.Field()
代码主体
爬取豆瓣网站我们先要模拟浏览器的头部信息
在setting.py中加入:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
spider_douban.py
# coding:utf-8
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from doubantop.items import DoubanItem
from scrapy.http import Request
import sys
reload(sys)
sys.setdefaultencoding('UTF-8')
class DoubanSpider(CrawlSpider):
# 区别Spider,唯一名字
name = 'douban'
allowed_domains = ["movie.douban.com"]
start_urls = ["http://movie.douban.com/top250"]
def parse(self, response):
# print response.body
item = DoubanItem()
selector = Selector(response)
for sel in selector.xpath('//div[@class="info"]'):
item['title'] = sel.xpath('div[@class="hd"]/a/span/text()').extract()[0]
# print item['title']
item['link'] = sel.xpath('div[@class="hd"]/a/@href').extract()[0]
item['star'] = sel.xpath('div[2]/div/span/text()').extract()[0]
item['num'] = sel.xpath('div[2]/div/span/text()').extract()[1]
yield item
# 下一页
next_link = selector.xpath('//span[@class="next"]/a/@href').extract()[0]
if next_link:
print next_link
yield Request(self.start_urls[0] + next_link, self.parse)
然后运行,保存到json文件
scrapy crawl douban -o items.json
在目录下查看json文件,是否爬取到数据
保存到mongodb
保存到本地mongodb先进行配置,setting.py:
ITEM_PIPELINES = {
'doubantop.pipelines.MongoDBPipeline': 300,
}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "douban"
MONGODB_COLLECTION = "movie"
pipeline.py:
import pymongo
from scrapy.conf import settings
class MongoDBPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
client = pymongo.MongoClient(host=host, port=port)
dbName = settings['MONGODB_DB']
tdb = client[dbName]
self.post = tdb[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
movie_info = dict(item)
self.post.insert(movie_info)
return item
现在运行
scrapy crawl douban
进入mongodb可视化软件连接查看(这里使用RoboMongoDB)
共250条信息,完成!!
2604

被折叠的 条评论
为什么被折叠?



