scrapy入门级的学习
关于scrapy学习的要点:
(1)CSS选择器选择元素
scrapy提供CSS和Xpath两种选择器来提取HTML元素,因为我对CSS更熟,这里用的就是CSS。
(2)对item和pipeline的理解
item是存放数据的容器,pipline用来处理抓取后放在item中的数据。
环境
Ubuntu14
Python3.5
scrapy 1.3.3
遇到的问题
(1)403无法访问,于是模仿浏览器,在setting配置USER_AGENT,网站会以为你是通过浏览器访问
(2)json文件Unicode乱码,通过引入pipline控制,写入文件时候的编码格式
setting.py
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 800,
}
items.py
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
piplines.py
import json
import codecs
class DoubanPipeline(object):
def __init__(self):
self.file = codecs.open('item.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line=json.dumps(dict(item),ensure_ascii=False)+"\n"
self.file.write(line)
return item
douban250.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from urllib import parse
from douban.items import DoubanItem
class Douban250Spider(scrapy.Spider):
name = "douban250"
allowed_domains = ["douban.com"]
start_urls = ['https://movie.douban.com/top250/']
def parse(self, response):
movies = response.css("div.item")
for movie in movies:
item = DoubanItem()
item['title'] = movie.css("span.title::text").extract()[0]
yield item
next_url = response.css("span.next a::attr(href)").extract_first("")
if next_url:
yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
结果截图
不用框架爬豆瓣电影TOP250
Python2
#!/usr/bin/env python
# encoding=utf-8
"""
爬取豆瓣电影TOP250 - 完整示例代码
"""
import codecs
import requests
from bs4 import BeautifulSoup
DOWNLOAD_URL = 'http://movie.douban.com/top250/'
def download_page(url):
return requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
}).content
def parse_html(html):
soup = BeautifulSoup(html)
movie_list_soup = soup.find('ol', attrs={'class': 'grid_view'})
movie_name_list = []
for movie_li in movie_list_soup.find_all('li'):
detail = movie_li.find('div', attrs={'class': 'hd'})
movie_name = detail.find('span', attrs={'class': 'title'}).getText()
movie_name_list.append(movie_name)
next_page = soup.find('span', attrs={'class': 'next'}).find('a')
if next_page:
return movie_name_list, DOWNLOAD_URL + next_page['href']
return movie_name_list, None
def main():
url = DOWNLOAD_URL
with codecs.open('movies.txt', 'wb', encoding='utf-8') as fp:
while url:
html = download_page(url)
movies, url = parse_html(html)
fp.write(u'{movies}\n'.format(movies='\n'.join(movies)))
if __name__ == '__main__':
main()