scrapy这个框架
到底有多牛b
首先当然你得需要有啊
安装
你可以使用 conda 安装
conda install -c conda-forge scrapy
你也可以使用 PyPI 安装
pip install Scrapy
scrapy依赖一些相关的库
lxml
parsel
w3lib
twisted
cryptography and pyOpenSSL
如果你在使用 scrapy 的时候发现相关库缺失
把裤子穿上就是了
开始耍
以免玩坏了,可以先创建一个虚拟环境
Python创建虚拟环境的三种方式_镰刀韭菜的博客-CSDN博客
我们要创建一个 handoupo 的爬虫项目
就可以这样
基础语法
scrapy startproject qiushibaike
cd quishibiake
scrapy crawl sina_news
基础文件
在这里面有一些配置文件和预定义的设置文件
而我们的爬虫代码
可以在 spiders 目录下创建
在这里我们就创建一个 qiushibaike_spider 吧
1
我们就需要继承 scrapy.Spider 这个类
为了创建一个Spider,您必须继承 scrapy.Spider 类, 且定义以下三个属性:
name: 用于区别Spider。 该名字必须是唯一的,您不可以为不同的Spider设定相同的名字。
start_urls: 包含了Spider在启动时进行爬取的url列表。 因此,第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
defstart_requests(self):
urls = [
'https://www.qiushibaike.com/text/page/1/',
'https://www.qiushibaike.com/text/page/2/',
]
for url inurls:
yield scrapy.Request(url=url, callback=self.parse)
我们在这个方法 start_requests 需要返回一个 yield 生成的迭代
而其中的参数 callback=self.parse 就是要让它去回调我们的数据解析方法
回调方法是这样的
defparse(self, response):
page = response.url.split("/")[-2]
filename = 'qiushi-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
储存
主要是把我们抓取下来的 html 存入到本地文件,我们来使用 scrapy 运行一下这个爬虫,使用如下命令即可
cd qiushibaike/
scrapy crawl qiushibaike
猫眼电影案例
Python Scrapy爬虫框架实战应用 (biancheng.net)
最后,我们使用 Scrapy 框架完成一个完整的案例:抓取猫眼电影 Top100 榜。
1) 创建项目
scrapy startproject Maoyan100
#进入项目目录
cd Maoyan100
# 创建爬虫文件,注意url 一定要是网站域名
scrapy genspider maoyan www.maoyan.com
2) 定义数据结构
首先在 items.py 中定义要抓取的数据结构,如下所示:
name = scrapy.Field()
star = scrapy.Field()
time = scrapy.Field()
3) 编写爬虫文件
接下来编写爬虫文件 maoyan.py 代码如下所示:
import scrapy
from Maoyan100.items import Maoyan100Item
class Maoyan100Spider(scrapy.Spider):
# name 指定爬虫文件名字
name = 'maoyan'
allowed_domains = ['maoyan.com'] # 网站域名
start_urls = ['https://maoyan.com/board/4?offset=0'] # 第一个要抓取的url
offset = 0 #查询字符串参数
# response 为 start_urls中影响对象
def parse(self,response):
# 基准xpath,匹配电影信息的dd节点对象列表
dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')
# 给items.py 中的类:Maoyan100Item()实例化
item = Maoyan100Item()
for dd in dd_list:
item['name'] = dd.xpath('./a/@title').get().strip() # 1.6以后版本使用 原来用 extract_first()
item['star'] = dd.xpath('.//p[@class="star"]/text()').get().strip()
item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').get().strip()
yield item
if self.offset < 90: # 判断条件
self.offset += 10
url = 'https://maoyan.com/board/4?offset=' + str(self.offset)
# 把url交给secheduer入队列
# response会自动传给 callback 回调的 parse()函数
#Scrapy.request()向url发起请求,并将响应结果交给回调的解析函数
yield scrapy.Request(url=url, callback=self.parse)
4) 实现数据存储
(5条消息) pymysql连接mysql指定端口_用Python操作MySQL(pymysql)_吴国娣的博客-CSDN博客
通过编写管道文件 pipelinse.py 文件实现数据的存储,将抓取的数据存放在 MySQL 数据库,这里需要提前建库、建表,因为前面章节已经创建过,此处不再赘述。代码编写如下:
import pymysql
from Maoyan100.settings import *
class Maoyan100Pipeline(object):
def process_item(self, item, spider):
print(item['name'], item['star'], item['time'])
return item # 多个管道有体现
# 存入mysql数据库的管道
class Maoyan100MysqlPipeline(object):
#开始
def open_spider(self, spider):
# 爬虫项目启动,执行连接数据操作
# 以下常量需要定义在settings配置文件中
self.db = pymysql.connect(
host=MYSQL_HOST,
user=MYSQL_USER,
password=MYSQL_PWD,
database=MYSQL_DB,
charset=MYSQL_CHARSET
)
self.cursor = self.db.cursor()
# 向表中插入数据
def process_item(self, item, spider):
ins = 'insert into movieinfo values(%s,%s,%s)'
L = [
item['name'], item['star'], item['time']
]
self.cursor.execute(ins, L)
self.db.commit()
return item
# 结束存放数据,在项目最后一步执行
def close_spider(self, spider):
# close_spider()函数只在所有数据抓取完毕后执行一次,
self.cursor.close()
self.db.close()
print('执行了close_spider方法,项目已经关闭')
报错
错误回溯(Traceback)信息如下:
sqlCopy codeTraceback (most recent calllast):
File "C:\Users\韩东平\PycharmProjects\Maoyan100\Maoyan100\pipelines.py", line 10, in<module>from Maoyan100.settings import *
ModuleNotFoundError: Nomodule named 'Maoyan100.settings'
这个错误信息表明,在pipelines.py文件中的导入语句存在问题。具体来说,Maoyan100.settings模块无法找到。
为了解决这个问题,可以尝试以下步骤:
确保Maoyan100包已安装并位于Python路径中。您可以在终端中运行pip list来检查它是否已安装。如果没有安装,可以使用pip install Maoyan100进行安装。
检查settings.py文件是否位于Maoyan100包内的正确目录中。它应该位于名为Maoyan100的目录中。
仔细检查pipelines.py文件中的导入语句,确保它正确。应该是from Maoyan100.settings import *。
如果以上步骤都不起作用,可以尝试将Maoyan100包的路径添加到PYTHONPATH环境变量中。您可以在终端中运行以下命令来执行此操作:export PYTHONPATH="/path/to/Maoyan100:$PYTHONPATH"。将/path/to/Maoyan100替换为系统上实际的Maoyan100包路径。
export PYTHONPATH="/path/to/Maoyan100:$PYTHONPATH
export PYTHONPATH="/path/to/maoyan:$PYTHONPATH
5) 定义启动文件
下面定义项目启动文件 run.py, 代码如下:
from scrapy import cmdline
#执行爬虫文件 -o 指定输出文件的格式
cmdline.execute('scrapy crawl maoyan -o maoyan.csv'.split())#执行项目,并且将数据存csv文件格式
修改配置文件
自带的
# Scrapy settings for Maoyan100 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = "Maoyan100"
SPIDER_MODULES = ["Maoyan100.spiders"]
NEWSPIDER_MODULE = "Maoyan100.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "Maoyan100 (+http://www.yourdomain.com)"
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "Maoyan100.middlewares.Maoyan100SpiderMiddleware": 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "Maoyan100.middlewares.Maoyan100DownloaderMiddleware": 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "Maoyan100.pipelines.Maoyan100Pipeline": 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
可以在 Scrapy 的配置文件 settings.py 中进行修改,具体的修改内容如下:
添加日志输出:
LOG_ENABLED = True# 启用日志
LOG_LEVEL = 'INFO'# 日志等级:DEBUG / INFO / WARNING / ERROR / CRITICAL
LOG_FILE = 'scrapy.log'# 日志文件
激活管道 pipelines:
ITEM_PIPELINES = {
'myproject.pipelines.MyPipeline': 300, # 优先级越低,越先执行
}
定义数据库常量:
MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'mydatabase'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
其他常用选项:
ROBOTSTXT_OBEY = True# 是否遵守 robots.txt 规则
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'# 用户代理
DOWNLOAD_DELAY = 3# 下载延迟(秒)