Scrapy爬虫框架

最新推荐文章于 2022-09-24 06:15:00 发布

肥鸡麻猪窝

最新推荐文章于 2022-09-24 06:15:00 发布

阅读量155

点赞数

分类专栏： scrapy python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_43487910/article/details/118753412

版权

python 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

scrapy

1 篇文章 0 订阅

订阅专栏

装个包：

pip install scrapy

Scrapy:爬虫的一个框架，帮你封装了一些功能，你只要按照框架的格式文件，定义好抓取什么内容即可

可以使用命令行来新建一个爬虫工程，这个工程会自动按scrapy的结构创建一个工程目录

核心部分：Scrapy engine scrapy 引擎，是这个框架的核心，负责协调整个框架中的组件部分

第一步：scrapy engine:拿到一个种子url，表示从这个url开始抓取

第二步：scrapy engine种子url发给scheduler（调度器），决定：谁先被抓去，谁后面排队，简单来说：排队的逻辑有此部分负责

第三步：scheduler（调度器）决定下载某个url之后，会把这个下载任务发给Downloader（下载器）负责从网站把对应url的网页远吗下载下来，放到一个reponse的对象里面

第四步：Spider就是您定义爬取的动作及分析某个网页(或者是有些网页)的地方，下载完成后，拿到了reponse的对象，这个对象会发给spiders这个爬虫组件，这个组件负责从源码中，提取你想要的文字、链接等、（parse逻辑是这里完成的）分析后获取的字符串内容，需要存储到item对象中（你在框架中定义的存储字段对象）

第五步：spider存储爬取数据的item对象，会发送给item pipeline组件，item pipeline组件把你的爬取数据的item对象做持久化（存储到文件、数据库等）

1.在E盘新建一个目录

在E盘新建一个目录scrapy_crawler（D:\usopp\python\scrapy_crawler） ##目录上直接敲cmd 可进入此目录的命令提示符

2.执行命令scrapy startproject tutorial

在scrapy_crawler下新建了一个叫tutorial（名字可以自己取）的爬虫工程文件

##首先要配置环境变量（\Python\Python38\Scripts）才能执行scrapy命令

3.进入 tutorial 设置爬取的网站

cd tutorial

scrapy genspider sohu www.sohu.com

会自动新建这个目录

以上三步执行完毕，则工程的框架建立完毕，后续需要自己实现爬虫分析和保存的逻辑。

框架和你干的事的分工：

1.拿到了种子url（通过命令行设定的）

2.下载url拿到网页源码（框架帮忙做的，不需要写）

3.从源码中拿到你想爬取的一部分数据（自己实现的，通过sohu.py来抓取）数据需要保存着制定的字段中（字段定义：items.py定义的），保存到items的过程（sohu.py文件中定义的）

4.存储的items发送给pipelines.py去做持久化处理（发送的过程是框架做的，自己不需要做）

5.pipelines.py拿到了items保存的爬虫数据，你想保存到文件（数据库）中，你需要实现的

6.爬虫的一些设置（settings），你需要设定

4.关键文件作用

D:\usopp\python\scrapy_crawler\tutorial\scrapy.cfg

爬虫工程的配置文件

D:\usopp\python\scrapy_crawler\tutorial\tutorial\items.py

！要改！设定爬取信息保存的字段名称，例如：爬一个新闻网页，一个是标题，一个是正文，就可以在这个文件设定两个字段：标题、正文。

D:\usopp\python\scrapy_crawler\tutorial\tutorial\middlewares.py

！一般不改！爬虫的中间件文件，需要自定义扩展爬虫的时候改这个文件

D:\usopp\python\scrapy_crawler\tutorial\tutorial\pipelines.py

！可改可不改！由程序员负责把爬取到的信息，进行持久化（保存文件、数据库里）

D:\usopp\python\scrapy_crawler\tutorial\tutorial\settings.py

！要改！爬虫组件的设置文件，有很多设置，比如：爬取的优先级别，爬取线程数

D:\usopp\python\scrapy_crawler\tutorial\tutorial\spiders\sohu.py

爬虫分析parser文件，实现了网页源码中，提取你想要的数据的过程，并且把它保存在items.py中定义的字段中，发给pipelines.py进行保存

5.操作

1.items.py文件替换如下内容（存什么）

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    URL = scrapy.Field()   #存放当前网页地址
    TITLE = scrapy.Field() #存放当前网页title
    H1 = scrapy.Field() #存放一级标题
    TEXT = scrapy.Field() #存放正文

2.sohu.py文件修改（怎么爬）

实现了网页源码中，提取你想要的数据的过程，并且把它保存在items.py中定义的字段中，发给pipelines.py进行保存

# -*- coding: utf-8 -*-
import scrapy
import re,os
from tutorial.items import TutorialItem
from scrapy import Request

#在文件主目录下执行抓取：#scrapy crawl sohu 

class SohuSpider(scrapy.Spider):
    name = 'sohu' #项目名称
    # allowed_domains = ['www.sohu.com']  #如果指定爬虫作用范围，则作用于首页之后的页面
    start_urls = ['https://nba.hupu.com/']  #开始url

    def parse(self, response):#response表示源码对象
        all_urls = re.findall('href="(.*?)"',response.xpath("/html").extract_first())#表示把网页所有源码拿到
        for url in all_urls:#遍历所有的url
            item = TutorialItem()
            if re.findall("(\.jpg)|(\.jpeg)|(\.gif)|(\.ico)|(\.png)|(\.js)|(\.css)$",url.strip()):#判断是否是静态图片链接
                pass  #去掉无效链接
            elif url.strip().startswith("http") or url.strip().startswith("//"):#判断是不是http或者//开头
                temp_url = url.strip() if url.strip().startswith('http') else 'http:' + url.strip() #三目运算符获取完整网址，有http去掉空格 ，没有http加上
                item = self.get_all(item,response)
                #判断item中存在正文且不为空，页面一级标题不为空
                if 'TEXT' in item and item['TEXT'] != '' and item['TITLE'] != '':
                    yield item  #发送到管道
                print('发送<' + temp_url + '>到下载器') #提示
                yield Request(temp_url,callback=self.parse) #递归调用，实现了不断使用新的url进行下载
    
    def get_all(self,item,response):#自己封装的方法，从网页中提取四个要爬取的内容放到类似字典item的里面
        #获取当前页面的网址、title、一级标题、正文内容
        item['URL'] = response.url.strip()#当前相应源码对应的url
        item['TITLE'] = response.xpath('/html/head/title/text()').extract()[0].strip()
        contain_h1 = response.xpath('//h1/text()').extract() #获取当前网页所有一级标题
        contain= contain_h1[0] if len(contain_h1) !=0 else "" #获取第一个一级标题
        item["H1"] = contain.strip()
        main_text = []
        #遍历网页中所有p标签和br标签的内容
        for tag in ['p','br']:
            sub_text = self.get_content(response,tag)
            main_text.extend(sub_text)
        #对正文内容去重并判断不为空
        main_text = list(set(main_text))
        if len(main_text) != 0:
            item['TEXT'] = '\n'.join(main_text)
        return item
    
    def get_content(self,response,tag):
        #判断只有大于100个文字的内容才保留
        main_text = []
        contexts = response.xpath('//'+tag+'/text()').extract()
        for text in contexts:
            if len(text.strip()) > 10:
                main_text.append(text.strip())
        return main_text

3.pipelines文件修改（怎么存）

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class TutorialPipeline(object):
    
    def __init__(self):
        self.filename = open("content.txt",'w',encoding="utf-8")
        self.contain = set()  #定义集合用于去重
    
    def process_item(self, item, spider):
        text = json.dumps(dict(item),ensure_ascii=False) + '\n'#抓取后的数据，包含中文的话，可以直接看到中文
        text_dict = eval(text)
        if text_dict['URL'] not in self.contain: #判断url是否被抓过，抓取到新网页，并写入文件
            for _,targetName in text_dict.items():#实现存储数据的逻辑
                if "字母" in targetName:#网页包含人字才会保存，“人”是抓取页面的核心关键字，可以是你任意你关注的关键字
                    self.write_to_txt(text_dict)#把这个字典写到文件里
                    break  #避免重复写入
            self.contain.add(text_dict['URL']) #每次记录文件后把网页url写入集合，重复的url会自动过滤掉
        return item#表示item处理完了
    
    def close_spider(self,spider):#爬虫被关掉的时候，会把文件关掉
        self.filename.close()
    
    def write_to_txt(self,text_dict):#具体把字典写入到文件的方法
        #遍历key,value,按照key+"内容:\n"+value+'\n'写入到文件中
        for key,value in text_dict.items():
            self.filename.write(key+"内容:\n"+value+'\n')
        self.filename.write(50*'='+'\n')#每抓取一个网页，打印50个*号

4.settings文件修改

ITEM_PIPELINES = {
'tutorial.pipelines.TutorialPipeline': 300,
}数字表示的是优先级，必须取消注释，不然打印不了pipelines

# -*- coding: utf-8 -*-

# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tutorial.middlewares.TutorialSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.cmd执行命令开始爬信息

scrapy crawl sohu 或者（scrapy crawl sohu -o items.json）以json格式输出

爬取结果保存在这个文件

肥鸡麻猪窝

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy爬虫框架

装个包：pip install scrapyScrapy:爬虫的一个框架，帮你封装了一些功能，你只要按照框架的格式文件，定义好抓取什么内容即可可以使用命令行来新建一个爬虫工程，这个工程会自动按scrapy的结构创建一个工程目录核心部分：Scrapy engine scrapy 引擎，是这个框架的核心，负责协调整个框架中的组件部分第一步：scrapy engine:拿到一个种子url，表示从这个url开始抓取第二步：scrapy engine种子url发给scheduler（调度器
复制链接

扫一扫

专栏目录