scrapy抓取豆瓣电影TOP250

scrapy入门级的学习
关于scrapy学习的要点:
(1)CSS选择器选择元素
scrapy提供CSS和Xpath两种选择器来提取HTML元素,因为我对CSS更熟,这里用的就是CSS。
(2)对item和pipeline的理解
item是存放数据的容器,pipline用来处理抓取后放在item中的数据。
环境
Ubuntu14
Python3.5
scrapy 1.3.3
遇到的问题
(1)403无法访问,于是模仿浏览器,在setting配置USER_AGENT,网站会以为你是通过浏览器访问
(2)json文件Unicode乱码,通过引入pipline控制,写入文件时候的编码格式
setting.py

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' 
ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 800,
}

items.py

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()

piplines.py

import json  
import codecs  

class DoubanPipeline(object):  
    def __init__(self):  
        self.file = codecs.open('item.json', 'w', encoding='utf-8')  
    def process_item(self, item, spider):  
        line=json.dumps(dict(item),ensure_ascii=False)+"\n"
        self.file.write(line)  
        return item    

douban250.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from urllib import parse
from douban.items import DoubanItem


class Douban250Spider(scrapy.Spider):
    name = "douban250"
    allowed_domains = ["douban.com"]
    start_urls = ['https://movie.douban.com/top250/']

    def parse(self, response):
        movies = response.css("div.item")
        for movie in movies:
            item = DoubanItem()
            item['title'] = movie.css("span.title::text").extract()[0]
            yield item

        next_url = response.css("span.next a::attr(href)").extract_first("")
        if next_url:
            yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)

结果截图
这里写图片描述

不用框架爬豆瓣电影TOP250
Python2

#!/usr/bin/env python
# encoding=utf-8

"""
爬取豆瓣电影TOP250 - 完整示例代码
"""

import codecs

import requests
from bs4 import BeautifulSoup

DOWNLOAD_URL = 'http://movie.douban.com/top250/'


def download_page(url):
    return requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
    }).content


def parse_html(html):
    soup = BeautifulSoup(html)
    movie_list_soup = soup.find('ol', attrs={'class': 'grid_view'})

    movie_name_list = []

    for movie_li in movie_list_soup.find_all('li'):
        detail = movie_li.find('div', attrs={'class': 'hd'})
        movie_name = detail.find('span', attrs={'class': 'title'}).getText()

        movie_name_list.append(movie_name)

    next_page = soup.find('span', attrs={'class': 'next'}).find('a')
    if next_page:
        return movie_name_list, DOWNLOAD_URL + next_page['href']
    return movie_name_list, None


def main():
    url = DOWNLOAD_URL

    with codecs.open('movies.txt', 'wb', encoding='utf-8') as fp:
        while url:
            html = download_page(url)
            movies, url = parse_html(html)
            fp.write(u'{movies}\n'.format(movies='\n'.join(movies)))


if __name__ == '__main__':
    main()
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值