一、豆瓣简单爬去
1、爬取文件
这里重点就是翻页了,我们可以发现相连的页面都有只改了一点且有连续性,依着规律可以for循环写出页面link的列表
start_urls = ['http://movie.douban.com/top250?start='+str(i*25) for i in range(10)]
下面可以写你要爬取的东西了
from doubanFilm.items import DoubanfilmItem
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['movie.douban.com']
start_urls = ['http://movie.douban.com/top250?start='+str(i*25) for i in range(10)]
def parse(self, response):
for row in response.xpath('//*[@id="content"]/div/div[1]/ol/li'): # 获取所有li
item = DoubanfilmItem() # 实例化
item["Ranking"] = row.xpath("div/div[1]/em/text()").get() # 获取排名
item["movieName"] = row.xpath("div/div[2]/div[1]/a/span[1]/text()").get() # 获取电影名称
item["director"] = row.xpath("div/div[2]/div[2]/p[1]/text()[1]").get() # 获取导演
item["rating"] = row.xpath("div/div[2]/div[2]/div/span[2]/text()").get() # 获取评分
item["imageURL"] = row.xpath("div/div[1]/a/@href").get() # 获取影片海报图片 url
logger.warning(item)
yield item
2、settings.py文件
1.ROBOTSTXT_OBEY = False
关闭 robots 协议,否则很多页面都无法爬取
2.LOG_LEVEL="WARNING
"#日志为警告以上才显示
3. 添加请示头,模仿用户登录USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
(把请示头里的内容改为最近的)
4. FEED_EXPORT_FIELDS = ["Ranking","movieName","director","rating","imageURL&#