项目目录介绍
新创建一个目录,按住shift-右键-在此处打开命令窗口
输入:scrapy startproject 项目名
文件夹目录如下:
|-你的项目名称
|-scrapy.cfg
|-__init__.py
|-items.py
|-middlewares.py
|-pipelines.py
|-settings.py
|-spiders
|-__init__.py
文件的功能:
scrapy.cfg:配置文件
spiders:存放你Spider文件,也就是你爬取的py文件
items.py:相当于一个容器,和字典较像
middlewares.py:定义Downloader Middlewares(下载器中间件)和Spider Middlewares(蜘蛛中间件)的实现
pipelines.py:定义Item Pipeline的实现,实现数据的清洗,储存,验证。
settings.py:全局配置
使用 Scrapy 爬去起点中文网数据
首先我们要在 项目目录的 spiders 目录下,运行 scrapy genspider xs "qidian.com"
来创建我一个spider(自己定义的爬虫文件)。下面是我写好的内容:
# -*- coding: utf-8 -*- 这是到正目录下运行 scrapy genspider xs "qidian.com" 命令创建的
import scrapy
from python16.items import Python16Item
class XsSpider(scrapy.Spider):
# 爬虫的名字
name = 'xs'
# 是允许爬取的域名,比如一些网站有相关链接,域名就和本网站不同,这些就会忽略。
allowed_domains = ['qidian.com']
#是Spider爬取的网站,定义初始的请求url,可以多个。
start_urls = ['https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1']
# 是Spider的一个方法,在请求start_url后,之后的方法,这个方法是对网页的解析,与提取自己想要的东西。
# response参数:是请求网页后返回的内容,也就是你需要解析的网页。
def parse(self, response):
# 获取 网页返回的数据
# print(response.body.decode("UTF-8"))
# 解析网页数据 xpath (response 已经提供了这样的方法)
lis = response.xpath("//ul[contains(@class,'all-img-list')]/li")
for li in lis:
item = Python16Item()
# 说明 scrapy 中拿到的 text() 数据是长这样的:[<Selector xpath=".//div[@class='book-mid-info']/h4/a/text()" data='圣墟'>]
# 我们还要通过这样的方式去提取 .extract_first() 拿到第一条数据
name = li.xpath(".//div[@class='book-mid-info']/h4/a/text()").extract_first()
author = li.xpath(".//div[@class='book-mid-info']/p[@class='author']/a/text()").extract_first()
# 去空格
content = str(li.xpath("./div[@class='book-mid-info']/p[@class='intro']/text()").extract_first()).strip()
item['name'] = name
item['author'] = author
item['content'] = content
# 通过这种方式将数据返回给 管道(千万不要用 yield 集合 因为性能不好,毕竟在内存里面)
yield item
# 获取下一页超链接
nextUrl = response.xpath("//a[contains(@class, 'lbf-pagination-next')]/@href").extract_first()
if nextUrl != "javascript:;":
yield scrapy.Request(url="http:" + nextUrl, callback=self.parse)
定义Item
item是保存爬取数据的容器,使用的方法和字典差不多。
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
# 相当于实体类
class Python16Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy .Field()
name = scrapy.Field()
author = scrapy.Field()
content = scrapy.Field()
pipelines.py 这是我们处理数据的地方
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import csv
# 存数据的 类似于实体类
class Python16Pipeline(object):
def __init__(self):
# newline 时防止写的时候发生空一行的问题
self.f = open("起点中文网.csv", "w", newline="")
self.writer = csv.writer(self.f)
self.writer.writerow(["书名", "作者", "简介"])
# 处理 item
def process_item(self, item, spider):
# 保存成 csv 文件
name = item["name"]
author = item["author"]
content = item["content"]
self.writer.writerow([name, author, content])
#如果还有下一个处理的 管道就是交给下一个处理管道的值,当然页可以不返回,那么那边 process_item item 值就是itemNone
return item;
下面贴扫 settings.py 这个文件中的代码:
# -*- coding: utf-8 -*-
# Scrapy settings for python16 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'python16'
SPIDER_MODULES = ['python16.spiders']
NEWSPIDER_MODULE = 'python16.spiders'
# 设置日志输出级别
LOG_LEVEL = "WARNING"
# Crawl responsibly by identifying yourself (and your website) on the user-agent 这就是配置我们自己的 ua值了
#USER_AGENT = 'python16 (+http://www.yourdomain.com)'
# Obey robots.txt rules 觉得要爬去的东西是否符合 rules 协议
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers: 这是默认的请求头我们可以覆盖掉
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'python16.middlewares.Python16SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'python16.middlewares.Python16DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道优先级(例如下面的300)值越小就优先级别越高(取值范围值 0-1000)
# 默认是没有用管道的这个配置所以我们要打开
# 这里可以配置多个管道来,通过优先级别来决定管道的执行顺序
# 每个管道最后 return 的值作为下一个管道 process_item 函数 item 参函接收的值
ITEM_PIPELINES = {
'python16.pipelines.Python16Pipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
测试:
在命令行通过 scrapy crawl xs
就能运行了
当然上面这种方式不是很方便我们可以在项目中编写一个 start.py 文件通过运行这个文件来启动爬虫程序
from scrapy import cmdline
# 这里必须分割 通过 split
cmdline.execute("spiders>scrapy crawl xs".split())