装个包:
pip install scrapy
Scrapy:爬虫的一个框架,帮你封装了一些功能,你只要按照框架的格式文件,定义好抓取什么内容即可
可以使用命令行来新建一个爬虫工程,这个工程会自动按scrapy的结构创建一个工程目录
核心部分:Scrapy engine scrapy 引擎,是这个框架的核心,负责协调整个框架中的组件部分
第一步:scrapy engine:拿到一个种子url,表示从这个url开始抓取
第二步:scrapy engine种子url发给scheduler(调度器),决定:谁先被抓去,谁后面排队,简单来说:排队的逻辑有此部分负责
第三步:scheduler(调度器)决定下载某个url之后,会把这个下载任务发给Downloader(下载器)负责从网站把对应url的网页远吗下载下来,放到一个reponse的对象里面
第四步:Spider就是您定义爬取的动作及分析某个网页(或者是有些网页)的地方,下载完成后,拿到了reponse的对象,这个对象会发给spiders这个爬虫组件,这个组件负责从源码中,提取你想要的文字、链接等、(parse逻辑是这里完成的)分析后获取的字符串内容,需要存储到item对象中(你在框架中定义的存储字段对象)
第五步:spider存储爬取数据的item对象,会发送给item pipeline组件,item pipeline组件把你的爬取数据的item对象做持久化(存储到文件、数据库等)
1.在E盘新建一个目录
在E盘新建一个目录scrapy_crawler(D:\usopp\python\scrapy_crawler) ##目录上直接敲cmd 可进入此目录的命令提示符
2.执行命令scrapy startproject tutorial
在scrapy_crawler下新建了一个叫tutorial(名字可以自己取)的爬虫工程文件
##首先要配置环境变量(\Python\Python38\Scripts)才能执行scrapy命令
3.进入 tutorial 设置爬取的网站
cd tutorial
scrapy genspider sohu www.sohu.com
会自动新建这个目录
以上三步执行完毕,则工程的框架建立完毕,后续需要自己实现爬虫分析和保存的逻辑。
框架和你干的事的分工:
1.拿到了种子url(通过命令行设定的)
2.下载url拿到网页源码(框架帮忙做的,不需要写)
3.从源码中拿到你想爬取的一部分数据(自己实现的,通过sohu.py来抓取)数据需要保存着制定的字段中(字段定义:items.py定义的),保存到items的过程(sohu.py文件中定义的)
4.存储的items发送给pipelines.py去做持久化处理(发送的过程是框架做的,自己不需要做)
5.pipelines.py拿到了items保存的爬虫数据,你想保存到文件(数据库)中,你需要实现的
6.爬虫的一些设置(settings),你需要设定
4.关键文件作用
D:\usopp\python\scrapy_crawler\tutorial\scrapy.cfg
爬虫工程的配置文件
D:\usopp\python\scrapy_crawler\tutorial\tutorial\items.py
!要改!设定爬取信息保存的字段名称,例如:爬一个新闻网页,一个是标题,一个是正文,就可以在这个文件设定两个字段:标题、正文。
D:\usopp\python\scrapy_crawler\tutorial\tutorial\middlewares.py
!一般不改!爬虫的中间件文件,需要自定义扩展爬虫的时候改这个文件
D:\usopp\python\scrapy_crawler\tutorial\tutorial\pipelines.py
!可改可不改!由程序员负责把爬取到的信息,进行持久化(保存文件、数据库里)
D:\usopp\python\scrapy_crawler\tutorial\tutorial\settings.py
!要改!爬虫组件的设置文件,有很多设置,比如:爬取的优先级别,爬取线程数
D:\usopp\python\scrapy_crawler\tutorial\tutorial\spiders\sohu.py
爬虫分析parser文件,实现了网页源码中,提取你想要的数据的过程,并且把它保存在items.py中定义的字段中,发给pipelines.py进行保存
5.操作
1.items.py文件替换如下内容(存什么)
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
URL = scrapy.Field() #存放当前网页地址
TITLE = scrapy.Field() #存放当前网页title
H1 = scrapy.Field() #存放一级标题
TEXT = scrapy.Field() #存放正文
2.sohu.py文件修改(怎么爬)
实现了网页源码中,提取你想要的数据的过程,并且把它保存在items.py中定义的字段中,发给pipelines.py进行保存
# -*- coding: utf-8 -*-
import scrapy
import re,os
from tutorial.items import TutorialItem
from scrapy import Request
#在文件主目录下执行抓取:#scrapy crawl sohu
class SohuSpider(scrapy.Spider):
name = 'sohu' #项目名称
# allowed_domains = ['www.sohu.com'] #如果指定爬虫作用范围,则作用于首页之后的页面
start_urls = ['https://nba.hupu.com/'] #开始url
def parse(self, response):#response表示源码对象
all_urls = re.findall('href="(.*?)"',response.xpath("/html").extract_first())#表示把网页所有源码拿到
for url in all_urls:#遍历所有的url
item = TutorialItem()
if re.findall("(\.jpg)|(\.jpeg)|(\.gif)|(\.ico)|(\.png)|(\.js)|(\.css)$",url.strip()):#判断是否是静态图片链接
pass #去掉无效链接
elif url.strip().startswith("http") or url.strip().startswith("//"):#判断是不是http或者//开头
temp_url = url.strip() if url.strip().startswith('http') else 'http:' + url.strip() #三目运算符获取完整网址,有http去掉空格 ,没有http加上
item = self.get_all(item,response)
#判断item中存在正文且不为空,页面一级标题不为空
if 'TEXT' in item and item['TEXT'] != '' and item['TITLE'] != '':
yield item #发送到管道
print('发送<' + temp_url + '>到下载器') #提示
yield Request(temp_url,callback=self.parse) #递归调用,实现了不断使用新的url进行下载
def get_all(self,item,response):#自己封装的方法,从网页中提取四个要爬取的内容放到类似字典item的里面
#获取当前页面的网址、title、一级标题、正文内容
item['URL'] = response.url.strip()#当前相应源码对应的url
item['TITLE'] = response.xpath('/html/head/title/text()').extract()[0].strip()
contain_h1 = response.xpath('//h1/text()').extract() #获取当前网页所有一级标题
contain= contain_h1[0] if len(contain_h1) !=0 else "" #获取第一个一级标题
item["H1"] = contain.strip()
main_text = []
#遍历网页中所有p标签和br标签的内容
for tag in ['p','br']:
sub_text = self.get_content(response,tag)
main_text.extend(sub_text)
#对正文内容去重并判断不为空
main_text = list(set(main_text))
if len(main_text) != 0:
item['TEXT'] = '\n'.join(main_text)
return item
def get_content(self,response,tag):
#判断只有大于100个文字的内容才保留
main_text = []
contexts = response.xpath('//'+tag+'/text()').extract()
for text in contexts:
if len(text.strip()) > 10:
main_text.append(text.strip())
return main_text
3.pipelines文件修改(怎么存)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
class TutorialPipeline(object):
def __init__(self):
self.filename = open("content.txt",'w',encoding="utf-8")
self.contain = set() #定义集合用于去重
def process_item(self, item, spider):
text = json.dumps(dict(item),ensure_ascii=False) + '\n'#抓取后的数据,包含中文的话,可以直接看到中文
text_dict = eval(text)
if text_dict['URL'] not in self.contain: #判断url是否被抓过,抓取到新网页,并写入文件
for _,targetName in text_dict.items():#实现存储数据的逻辑
if "字母" in targetName:#网页包含人字才会保存,“人”是抓取页面的核心关键字,可以是你任意你关注的关键字
self.write_to_txt(text_dict)#把这个字典写到文件里
break #避免重复写入
self.contain.add(text_dict['URL']) #每次记录文件后把网页url写入集合,重复的url会自动过滤掉
return item#表示item处理完了
def close_spider(self,spider):#爬虫被关掉的时候,会把文件关掉
self.filename.close()
def write_to_txt(self,text_dict):#具体把字典写入到文件的方法
#遍历key,value,按照key+"内容:\n"+value+'\n'写入到文件中
for key,value in text_dict.items():
self.filename.write(key+"内容:\n"+value+'\n')
self.filename.write(50*'='+'\n')#每抓取一个网页,打印50个*号
4.settings文件修改
ITEM_PIPELINES = {
'tutorial.pipelines.TutorialPipeline': 300,
}数字表示的是优先级,必须取消注释,不然打印不了pipelines
# -*- coding: utf-8 -*- # Scrapy settings for tutorial project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'tutorial' SPIDER_MODULES = ['tutorial.spiders'] NEWSPIDER_MODULE = 'tutorial.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'tutorial.middlewares.TutorialSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'tutorial.middlewares.TutorialDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'tutorial.pipelines.TutorialPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
5.cmd执行命令开始爬信息
scrapy crawl sohu 或者(scrapy crawl sohu -o items.json)以json格式输出
爬取结果保存在这个文件