1.简述使用Scrapy进行分布式网络爬虫数据采集和存储以及Gerapy分布式部署的描述 scrapy框架图 ![](https://i-blog.csdnimg.cn/blog_migrate/91e91a51f1b822634cdf34f44de789ad.png) Engine:图中最中间的部分,中文可以称为引擎,用来处理整个系统的数据流和事件,是整个框架的核心,可以理解为整个框架的中央处理器(类似人的大脑),负责数据的流转和逻辑的处理。\n\nItem: 它是一个抽象的数据结构,所以在图中没有体现出来,它定义了爬取结果的数据结构,爬取的数据会被赋值成 Item 对象。每个 Item 就是一个类,类里面定义了爬取结果的数据字段,可以理解为它用来规定爬取数据的存储格式。 Scheduler:图中下方的部分,中文可以称为调度器,它用来接受Engine发过来的Request并将其加入队列中,同时也可以将Request发回给Engine供Downloader执行,它主要维护 Request的调度逻辑,比如先进先出、先进后出、优先级进出等等。 Spiders:图中上方的部分,中文可以称为蜘蛛,Spiders是一个复数的统称,其可以对应多个 Spider,每个Spider里面定义了站点的爬取逻辑和页面的解析规则,它主要负责解析响应并生成 Item和新的请求然后发给 Engine 进行处理。 Downloader:图中右侧部分,中文可以称为下载器,即完成向服务器发送请求,然后拿到响应的过程,得到的响应会再发送给Engine处理。 Item Pipelines:图中左侧部分,中文可以称为项目管道,这也是一个复数统称,可以对应多个Item Pipeline。Item Pipeline主要负责处理由Spider从页面中抽取的Item,做一些数据清洗、验证和存储等工作,比如将Item的某些字段进行规整,将Item存储到数据库等操作都可以由Item Pipeline来完成。 Downloader Middlewares:图中Engine和Downloader之间的方块部分,中文可以称为下载器中间件,同样也是复数统称,其包含多个Downloader Middleware,它是位于Engine和 Downloader之间的Hook框架,负责实现Downloader和Engine之间的请求和响应的处理过程。 Spider Middlewares:图中Engine和Spiders之间的方块部分,中文可以称为蜘蛛中间件,它是位于Engine和Spiders之间的Hook框架,负责实现Spiders和Engine之间的Item,请求和响应的处理过程。 (1)准备工作:创建python版本为3.8的虚拟环境py38,手动创建分布式爬虫项目总文件夹gerapy,命令行激活虚拟环境,安装scrapy爬虫框架、爬虫分布式的页面环境gerapy以及scrapyd、gerapy_auto_extractor; # 创建虚拟环境 conda create --name py38 python==3.8 # 到项目的盘符下 conda activate py38 # 安装三方依赖库 pip install scrapy pip install gerapy pip install gerapy_auto_extractor (2)创建项目myprojrct和爬虫脚本spider: scrapy startproject myproject;scrapy genspider news‘’ (3)系统环境配置,使用pycharm 进入到我们创建的文件夹。 ![](https://i-blog.csdnimg.cn/blog_migrate/8a441f71699d731036ee483ccc93fb92.png)
- items.py:设置抓取信息存入数据库的字段。首先导入scrapy函数包,此为scrapy框架为后面类的定义做好准备,后设置item类名称为nwesdataItem,后分别设置文章标题,文章链接,发布日期,文章正文,站点,栏目,学号,姓名等。为后面爬取数据字段内容做初步部署。
- items.py文件配置
-
import scrapy
class MyprojectItem(scrapy.Item):
title = scrapy.Field() # 新闻标题
url = scrapy.Field() # 原文链接
date = scrapy.Field() # 发布时间
content = scrapy.Field() # 文章正文
news_type = scrapy.Field() # 新闻类别
web_name = scrapy.Field() # 网站名称
channel_name = scrapy.Field() # 频道名称
id_student = scrapy.Field() # 学号
name_student = scrapy.Field() # 姓名 - middlewares.py:进行浏览器header的转换。首先,生成一个middlewares,在其后面添加一个Header。之后调用scrapy包的useragent,以及中间件UserAgentMiddleware,之后读取配置,将其内的信息创建一个类。调取客户端列表之后,随机调取其一放进定义的user-Agent字段之中。最后在引擎之中使用这个类进行数据爬取。
- middlewares.py文件配置
-
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class MyprojectSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
class MyprojectDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
# 添加Header和IP类
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
class RotateUserAgentMiddleware(UserAgentMiddleware):
def process_request(self, request, spider):
referer = request.url
if referer:
request.headers["referer"] = referer
USER_AGENT_LIST = settings.get('USER_AGENT_LIST')
user_agent = random.choice(USER_AGENT_LIST)
if user_agent:
request.headers.setdefault('user-Agent', user_agent)
print(f"user-Agent:{user_agent}")
# 添加随机更换IP代理类(根据实际IP代理情况进行修改获取方式和更改方式)
import random
import sys
import requests
sys.path.append('.')
class MyProxyMiddleware(object):
def process_request(self, request, spider):
url = "这里放购买的代理API地址,进行解析后使用代理访问"
html = requests.get(url).text
ip_list = html.split("\r\n")[:-1]
proxy = random.choice(ip_list)
request.meta['proxy'] = 'http://' + proxy - pipelines.py:首先加载pymongo包,并加载其中的用户名、密码、IP、端口号、地址配置。之后定义一个类,与MongoDB进行链接,完成对数据库的链接。在链接对应的数据库和表之后,使用dict函数定义一个data,再将爬取得到的数据导入到data当中。若是MongoDB没有数据库和表则会在自动创建之后再进行写入,最后返回item之中。
- pipelines.py文件配置
-
from itemadapter import ItemAdapter
import pymongo
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
class MyprojectPipeline:
# class中全部替换
def __init__(self):
host = settings["MONGODB_HOST"]
port = settings["MONGODB_PORT"]
dbname = settings["MONGODB_DATABASE"]
sheetname = settings["MONGODB_SHEETNAME"]
client = pymongo.MongoClient(host=host, port=port)
mydb = client[dbname]
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert_one(data)
return item - settings.py文件配置
-
# Scrapy settings for myproject project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = "myproject"
SPIDER_MODULES = ["myproject.spiders"]
NEWSPIDER_MODULE = "myproject.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "myproject (+http://www.yourdomain.com)"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# "myproject.middlewares.MyprojectSpiderMiddleware": 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.MyprojectDownloaderMiddleware": 543,
'myproject.middlewares.RotateUserAgentMiddleware': 400, # 更换Header优先级
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"myproject.pipelines.MyprojectPipeline": 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
# 添加 设置浏览器Header设置,不够用自行添加
USER_AGENT_LIST = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
# 添加 设置MONGODB数仓
MONGODB_HOST = "localhost"
MONGODB_PORT = 27017
MONGODB_DATABASE = "NewsData"
MONGODB_SHEETNAME = "NewsData" - 设置机器的名称为myproject;创建爬虫文件的模板,设置爬虫文件的存储路径, 以将后来根据模板创建的文件存储在路径之中;定义False即是不接受爬取站点的robot协议;然后是激活中间件middlewares以543的顺序加入DOWNLOADER_MIDDLEWARES之中;设置pipeline字典为300。之后则是将客户端信息存放在USER_AGENT_LIST。添加对MongoDB的数仓设置:数仓IP为本地,数仓端口号为27017,数仓数据库为NewsData,数据仓库表单为NewsData_A。
(4)爬虫代码编写:编写创建的news.py。 代码如下: # -*- coding: utf-8 -*-
import scrapy
from myproject.items import MyprojectItem
from urllib import parse
from gerapy_auto_extractor.extractors import *
from bs4 import BeautifulSoup
class NewsSpider(scrapy.Spider):
name = "news"
allowed_domains = []
def start_requests(self):
data_list = [
# ["能源行业信息", "中国煤炭市场", "煤炭资讯-新闻资讯", "https://www.cctd.com.cn/list-10-1.html"],
#["能源行业信息", "中国煤炭市场", "煤炭资讯-资讯中心", "https://www.cctd.com.cn/list-9-1.html"],
# ["蓟州新闻", "天津市蓟州区人民政府", "新闻中心-蓟州新闻", "https://www.tjjz.gov.cn/xwzx/spzx/"],
#["公示公告","天津市蓟州区人民政府","新闻中心-公式公告","https://www.tjjz.gov.cn/xwzx/GSGG20201207/"]
#["今日关注","天津市西青区人民政府","新闻中心-今日关注","https://www.tjxq.gov.cn/xwzx/jrgz/"]
#["西青要闻","天津市西青区人民政府","新闻中心-西青要闻","https://www.tjxq.gov.cn/xwzx/xqyw/"]
#["公示公告","天津市西青区人民政府","新闻中心-公式公告","https://www.tjxq.gov.cn/xwzx/gsgg/"]
#["区内要闻","常州国家高新区管委会(新北区人民政府)","新视界-区内要闻","https://www.cznd.gov.cn/class/EFBJILJD"]
#["部门动态信息","常州国家高新区管委会(新北区人民政府)","新视界-部门动态","https://www.cznd.gov.cn/class/EFBLILJE"]
#["基层动态信息","常州国家高新区管委会(新北区人民政府)","新视界-基层动态","https://www.cznd.gov.cn/class/EFBOILJF"]
#["新北区政府信息","常州国家高新区管委会(新北区人民政府)","新视界-省政府信息","https://www.cznd.gov.cn/class/QLQPQMQE"]
#["今日重庆","重庆市人民政府","要闻动态-今日重庆","http://www.cq.gov.cn/ywdt/jrcq/"]
#["重庆政务活动信息","重庆市人民政府","要闻动态-政务活动","http://www.cq.gov.cn/ywdt/zwhd/"]
#["重庆重大信息转载","重庆市人民政府","要闻动态-重大信息转载","http://www.cq.gov.cn/ywdt/zdzz/"]
#["回应信息","重庆市人民政府","政务公开-政府信息公开目录-回应关切","http://www.cq.gov.cn/zwgk/zfxxgkml/hygq/"]
#["党中央精神","北京市人民政府门户网站","要闻动态-党中央精神","https://www.beijing.gov.cn/ywdt/dzyjs/index.html"]
#["北京要闻","北京市人民政府门户网站","要闻动态-北京要闻","https://www.beijing.gov.cn/ywdt/yaowen/"]
#["中央部委动态","北京市人民政府门户网站","要闻动态-中央部委动态","https://www.beijing.gov.cn/ywdt/zybwdt/"]
#["北京各区热点","北京市人民政府门户网站","要闻动态-各区热点","https://www.beijing.gov.cn/ywdt/gqrd/"]
#["北京回应关切","北京市人民政府门户网站","政务公开-回应关切","https://www.beijing.gov.cn/gongkai/hygq/"]
#["上海要闻","上海市人民政府","要闻动态-上海要闻","https://www.shanghai.gov.cn/nw4411/index.html"]
["今日关注","天津市西青区人民政府","新闻中心-今日关注","https://www.tjxq.gov.cn/xwzx/jrgz/"]
]
for data in data_list:
# url = data[3]
yield scrapy.Request(
url=data[3], meta={'data': data}, callback=self.parse_static
)
def parse_static(self, response):
title_list = response.xpath(
'//div[@class="mainbg fontyh"]/div[@class="w1200 auto maincon"]/div[@class="common_list"]/ul/li/a/text()').extract()
url_list = response.xpath(
'//div[@class="mainbg fontyh"]/div[@class="w1200 auto maincon"]/div[@class="common_list"]/ul/li/a/@href').extract()
date_list = response.xpath(
'//div[@class="mainbg fontyh"]/div[@class="w1200 auto maincon"]/div[@class="common_list"]/ul/li/span[@class="date"]/text()').extract()
#title_list = response.xpath('//td[@style="padding-left: 10px"]/li/a/text()').extract()
#url_list = response.xpath('//td[@style="padding-left: 10px"]/li/a/@href').extract()
# date_list = response.xpath('//td[@align="center"]/text()').extract()
# print(title_list)
# print(url_list)
# print(date_list)
for n in range(len(title_list)):
item = MyprojectItem()
#item['title'] = title_list[n]
#item['date'] = ''
item['id_student'] = '20215680'
item['name_student'] = '张岚'
item['web_name'] = response.meta["data"][1]
item["news_type"] = response.meta["data"][0]
item['channel_name'] = response.meta["data"][2]
item['url'] = parse.urljoin(response.url, url_list[n])
# print(item)
yield scrapy.Request(item['url'], callback=self.parse_detail, meta={'item': item})
# data_list = extract_list(response.text)
# # print(data_list)
# for data_dict in data_list:
# """
# {
# "title":"xxxxx", "url":"xxxxxx"
# }
# """
# item = MyprojectItem()
# item['title'] = data_dict["title"]
# item['url'] = parse.urljoin(response.url, data_dict["url"])
# item['date'] = data_dict.get("date", "")
# item["news_type"] = response.meta["data"][0]
# item['web_name'] = response.meta["data"][1]
# item['channel_name'] = response.meta["data"][2]
# item['id_student'] = '123456'
# item['name_student'] = '张三'
# yield scrapy.Request(item['url'], callback=self.parse_detail, meta={'item': item})
# 具体内容在parse_detail.py中
def parse_detail(self, response):
item = response.meta['item']
soup = BeautifulSoup(response.text, 'lxml')
#item["date"] = soup.find("div", class_="time").find_all("li")[1].text.replace("<li>发布时间:", "")
item["content"] = soup.find("div", class_="mainbg").text
print(item["content"])
# detail = extract_detail(response.text)
# # print(detail)
# """
# {
# "content":"xxxxx","date":"xxxxx"
# }
# """
# item["date"] = detail["datetime"]
yield item
(5)启动爬虫:命令行运行scrapy crawl news (6)Gerapy分布式部署: 打开新文件夹的cmd窗口 1)创建外部管理环境:进入项目文件夹对项目进行初始化,进入gerapy文件夹。 gerapy init:数据迁移;gerapy migrate:初始化,查看密码和用户名; gerapy initadmin:运行服务。 2)启动服务:gerapy runserver 0.0.0.0:8000,浏览器输入127.0.0.1:8000,输入用户和密码,进入页面;命令行启动scrapyd环境,进行主机管理。创建主机,IP为127.0.0.1,端口号为6800 3)在命令行启动scrapyd 4)将爬虫脚本放入gerapy的project中,再进行项目管理,对项目进行打包 任务管理,输入项目名称及调度方式,确认爬取时间以及间隔
- 运行任务显示爬取成功
2.Scrapy中Spider下的全部代码以及对应解释说明 构建爬虫框架,插入必要包 # -*- coding: utf-8 -*- import scrapy from myproject.items import MyprojectItem#导入配置好的items.py,获取字段信息 from urllib import parse from gerapy_auto_extractor.extractors import * from bs4 import BeautifulSoup 为爬虫命名,并导入目标网址定义函数,使用for循环定义item字典里的内容,与我们要爬取的字段相对应: class NewsSpider(scrapy.Spider): name = "news" allowed_domains = [ ] def start_requests(self): data_list = [ ["能源行业信息", "中国煤炭市场", "煤炭资讯-新闻资讯", "https://www.cctd.com.cn/list-10-1.html"] for data in data_list: yield scrapy.Request( url=data[3], meta={'data': data}, callback=self.parse_static ) 利用Request命令将目标网址的Xpath模糊匹配获取目标网址的url,date,title等元素: def parse_static(self, response): title_list = response.xpath( '//div[@class="viewport"]/div[@class="wrap"]/div[@class="containner clearfix"]/div[@class="right_content fr"]/ul[@class="news_list news_list4"]/li/a/text()').extract() url_list = response.xpath( '//div[@class="viewport"]/div[@class="wrap"]/div[@class="containner clearfix"]/div[@class="right_content fr"]/ul[@class="news_list news_list4"]/li/a/@href').extract() date_list = response.xpath( '//div[@class="viewport"]/div[@class="wrap"]/div[@class="containner clearfix"]/div[@class="right_content fr"]/ul[@class="news_list news_list4"]/li/span[@class="time"]/text()').extract() 解析列表,定义学号和姓名: for n in range(len(title_list)): item = MyprojectItem() #item['title'] = title_list[n] #item['date'] = '' item['id_student'] = '20215680' item['name_student'] = '张岚' item['web_name'] = response.meta["data"][1] item["news_type"] = response.meta["data"][0] item['channel_name'] = response.meta["data"][2] item['url'] = parse.urljoin(response.url, url_list[n]) # print(item) yield scrapy.Request(item['url'], callback=self.parse_detail, meta={'item': item}) 套用循环再获取目标网址的content内容,最后输入到item中,爬取网页内容: def parse_detail(self, response): item = response.meta['item'] soup = BeautifulSoup(response.text, 'lxml') #item["date"] = soup.find("div", class_="time").find_all("li")[1].text.replace("<li>发布时间:", "") item["content"] = soup.find("div", class_="containner").text print(item["content"]) yield item 爬取其他网站时,更新data_list里的内容,并对Xpath模糊匹配路径更新,对正文获取路径更新。 3.MongoDB存储结果截图以及对应解释说明 ![](https://i-blog.csdnimg.cn/blog_migrate/b58861b9a100820a0a8631d6298124a2.png) 选取了中国煤炭市场网、天津市蓟州区人民政府、天津市西青区人民政府、常州国家高新区管委会(新北区人民政府)、重庆市人民政府、北京市人民政府门户网站、上海市人民政府七个网站,共计405条信息,包括学号、姓名、网站名称、网站类别、频道名称、网址、正文。 4.Gerapy分布式部署的项目部署,任务列表以及单一任务执行结果的截图 主机管理设置:点击创建 ![](https://i-blog.csdnimg.cn/blog_migrate/b9048c2ea3252343c79c1299e4e30471.png) 项目部署:将myproject文件复制到gerapy文件下的projects文件中,在浏览器中刷新,进行打包、部署。 ![](https://i-blog.csdnimg.cn/blog_migrate/6c0c045654e47a1b54d9094cac401769.png) 任务列表: ![](https://i-blog.csdnimg.cn/blog_migrate/5af2bfb895b4d1e8e7f2761a000ea1ee.png) 任务执行结果: ![](https://i-blog.csdnimg.cn/blog_migrate/896a68f7d33e7232158b049eba65199e.png) |