刚开始接触爬虫,第一次使用scrapy 爬取数据
一、步骤
- 安装scrapy
pip install scrapy
- 创建爬虫项目
scrapy startproject doubanScrapy
- 创建爬虫程序
scrapy genspider doubanmovie movie.douban.com
- 更改setting.py文件,添加user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
5.items.py文件填写,主要写的是爬虫信息情况
class DoubanscrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
movie_name = scrapy.Field()
yanyuan = scrapy.Field()
time_juqing = scrapy.Field()
move_star = scrapy.Field()
evaluate = scrapy.Field()
introduce = scrapy.Field()
6.爬取详情信息脚本编写,在parse方法中填写
def parse(self, response):
movies = response.xpath('//div[@class="item"]')
for movie in movies:
movie_name = movie.xpath('../div/div[2]/div[1]/a/span[1]/text()')[0].extract().replace(' ', "")
yanyuan = movie.xpath('../div/div[2]/div[2]/p[1]/text()')[0].extract().replace(' ', "")
time_juqing = movie.xpath('../div/div[2]/div[2]/p[1]/text()[2]')[0].extract().replace(' ', "")
move_star = movie.xpath('../div/div[2]/div[2]/div/span[2]/text()')[0].extract().replace(' ', "")
evaluate = movie.xpath('../div/div[2]/div[2]/div/span[4]/text()')[0].extract().replace(' ', "").replace(
'人评价', "")
introduce = movie.xpath('../div/div[2]/div[2]/p[2]/span/text()')[0].extract().replace(' ', "")
item = DoubanscrapyItem()
item['movie_name'] = movie_name
item['yanyuan'] = yanyuan
item['time_juqing'] = time_juqing
item['move_star'] = move_star
item['evaluate'] = evaluate
item['introduce'] = introduce
yield item
7.运行程序,在cmd命令中输出
scrapy crawl doubanmovie -o tt.csv -t csv
二、 完整的代码:
import scrapy
from doubanScrapy.items import DoubanscrapyItem
from scrapy import Request
class DoubanmovieSpider(scrapy.Spider):
name = 'doubanmovie'
allowed_domains = ['movie.douban.com']
start_urls = ['http://movie.douban.com/top250']
def parse(self, response):
movies = response.xpath('//div[@class="item"]')
for movie in movies:
movie_name = movie.xpath('../div/div[2]/div[1]/a/span[1]/text()')[0].extract().replace(' ', "")
yanyuan = movie.xpath('../div/div[2]/div[2]/p[1]/text()')[0].extract().replace(' ', "")
time_juqing = movie.xpath('../div/div[2]/div[2]/p[1]/text()[2]')[0].extract().replace(' ', "")
move_star = movie.xpath('../div/div[2]/div[2]/div/span[2]/text()')[0].extract().replace(' ', "")
evaluate = movie.xpath('../div/div[2]/div[2]/div/span[4]/text()')[0].extract().replace(' ', "").replace(
'人评价', "")
introduce = movie.xpath('../div/div[2]/div[2]/p[2]/span/text()')[0].extract().replace(' ', "")
item = DoubanscrapyItem()
item['movie_name'] = movie_name
item['yanyuan'] = yanyuan
item['time_juqing'] = time_juqing
item['move_star'] = move_star
item['evaluate'] = evaluate
item['introduce'] = introduce
yield item
next_page = response.selector.xpath('//span[@class="next"]/link/@href').extract()
if next_page:
next_page = next_page[0]
yield Request(self.start_urls[0] + next_page, callback=self.parse)
输出结果: