1 文件下载
项目目的:爬取seaborn案例源文件
seaborn的网址为http://seaborn.pydata.org/
应用案例展示的网址为http://seaborn.pydata.org/examples/index.html
进入想要创建项目的目录,创建爬虫项目seaborn_file_download
# dir 为要创建项目的目录
cd dir
scrapy startproject seaborn_file_download
items.py 文件内容为:
import scrapy
class SeabornFileDownloadItem(scrapy.Item):
file_urls = scrapy.Field()
Files = scrapy.Field()
在spiders文件夹中新建file_spider.py文件。在file_spider.py中创建爬虫类FileDownloadSpider,内容为:
from scrapy.spiders import Spider
from scrapy import Request
from seaborn_file_download.items import SeabornFileDownloadItem
class FileDownloadSpider(Spider):
name = "file"
def start_requests(self):
urls = "http://seaborn.pydata.org/examples/index.html"
yield Request(urls)
def parse(self, response):
urls = response.xpath("//div[@class='figure align-center']/a/@href").extract()
for i in range(urls):
url = response.urljoin(len(urls))
yield Request(url, callback=self.parse_file)
def parse_file(self, response):
href = response.xpath("//a[@class='reference download internal']/@href").extract_first()
url = response.urljoin(href)
item = SeabornFileDownloadItem()
item["file_urls"] = [url]
yield item
在项目配置文件settings.py中,需要设置以下几个配置项:
1)设置robots协议:ROBOTSTXT_OBEY(False为不遵守协议);
2)设置用户代理:USER_AGENT;
3)设置文件下载路径:FILES_STORE;
4)启用文件管道(FilesPipeline):开放ITEM_PIPELINES,启用Scrapy的FilesPipeline。
USER_AGENT = "Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome