爬虫日记-scrapy-day1

使用scrapy获取网页内容

目录

创建爬虫项目

创建模板文件

运行scrapy爬虫

使用py文件运行scrapy爬虫

限制scrapy调试内容输出

获取章节url

拼接url

上传


以笔趣阁网站为例,获取小说每章链接。

创建爬虫项目

首先,新建一个scrapy文件夹

进入目录后,执行

scrapy startproject biqugeSpider

来创建爬虫项目

创建完成后是这样的

创建模板文件

进入biqugeSpider文件夹后执行

scrapy genspider biquge 爬取的网址

创建一个模板文件

修改一下这个模板文件中的start_url,scrapy会自动爬取并返回response

import scrapy


class BiqugeSpider(scrapy.Spider):
    name = 'biquge'
    allowed_domains = ['http://www.biquge.info/']
    start_urls = ['http://www.biquge.info/10_10582/']

    def parse(self, response):
        print(response.text)

运行scrapy爬虫

测试一下,从终端上使用

scrapy crawl biquge

运行爬虫,也可以创建一个py文件来运行,个人认为创建py文件更方便一点

使用py文件运行scrapy爬虫

从模板文件所在文件夹创建一个run.py,输入

from scrapy import cmdline

cmdline.execute('scrapy crawl biquge'.split())

右键运行,就可以启动爬虫,部分结果

D:\py\pythonw.exe "D:/PyCharm 2019.3.3/cx/scrapy-di/biqugeSpider/biqugeSpider/spiders/run.py"
2020-10-27 10:07:10 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: biqugeSpider)
2020-10-27 10:07:10 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-27 10:07:10 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-27 10:07:10 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'biqugeSpider',
 'NEWSPIDER_MODULE': 'biqugeSpider.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['biqugeSpider.spiders']}
2020-10-27 10:07:10 [scrapy.extensions.telnet] INFO: Telnet Password: 3453e37f2d419385
2020-10-27 10:07:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-27 10:07:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-27 10:07:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-27 10:07:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-27 10:07:11 [scrapy.core.engine] INFO: Spider opened
2020-10-27 10:07:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-27 10:07:11 [py.warnings] WARNING: D:\py\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://www.biquge.info/ in allowed_domains.
  warnings.warn(message, URLWarning)

2020-10-27 10:07:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-27 10:07:11 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.biquge.info/robots.txt> (referer: None)
2020-10-27 10:07:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.biquge.info/10_10582/> (referer: None)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<title>三寸人间最新章节列表_三寸人间最新章节目录_笔趣阁</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta name="keywords" content="三寸人间,三寸人间最新章节"/>
<meta name="description" content="三寸人间最新章节由网友提供,《三寸人间》情节跌宕起伏、扣人心弦,是一本情节与文笔俱佳的修真小说小说,笔趣阁免费提供耳根最新清爽干净的文字章节在线阅读."/>
<meta name="mobile-agent" content="format=html5;url=http://m.biquge.info/10_10582/">
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<meta property="og:type" content="novel"/>
<meta property="og:title" content="三寸人间"/>
<meta property="og:description" content="举头三尺无神明,掌心三寸是人间。这是耳根继《仙逆》《求魔》《我欲封天》《一念永恒》后,创作的第五部长篇小说《三寸人间》。"/>
<meta property="og:image" content="http://www.biquge.info/files/article/image/10/10582/10582s.jpg"/>
<meta property="og:novel:category" content="修真小说"/>
<meta property="og:novel:author" content="耳根"/>
<meta property="og:novel:book_name" content="三寸人间"/>
<meta property="og:novel:read_url" content="http://www.biquge.info/10_10582/"/>
<meta property="og:novel:status" content="连载中"/>
<meta property="og:novel:update_time" content="2020-10-26 19:30"/>
<meta property="og:novel:latest_chapter_name" content="第1165章 道,不同!"/>
<meta property="og:novel:latest_chapter_url" content="http://www.biquge.info/10_10582/21667214.html"/>
<link rel="stylesheet" type="text/css" href="/heibing/css/common.css"/>
<link rel="stylesheet" type="text/css" href="/heibing/css/list.css"/>
<script type="text/javascript" src="https://libs.baidu.com/jquery/1.4.2/jquery.min.js"></script>
<script language="javascript" type="text/javascript">var bookid = "10582"; var booktitle = "三寸人间";</script>
<script type="text/javascript" src="/images/bqg.js"></script>
<script type="text/javascript" src="/js/btn.js"></script>
</head>
<body>
<div id="wrapper">
<div class="ywtop">
    <div class="ywtop_con">
        <div class="ywtop_sethome"><a onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.biquge.info/');" href="#">将笔趣阁设为首页</a></div>
        <div class="ywtop_addfavorite"><a href="javascript:window.external.addFavorite('http://www.biquge.info/','笔趣阁_书友最值得收藏的网络小说阅读网')">收藏笔趣阁</a></div>
        <script language="javascript" src="/heibing/js/denglu.js"></script>
    </div>
</div>

<div class="header">
<div class="header_logo"><a href="http://www.biquge.info/">笔趣阁</a></div>
<script>list_panel();</script>
</div>
<div class="nav">
<ul>
<li><a href="http://www.biquge.info/">首页</a></li>
<li><a href="http://www.biquge.info/modules/article/bookcase.php">我的书架</a></li>
<li><a href="http://www.biquge.info/list/1_1.html">玄幻小说</a></li>
<li><a href="http://www.biquge.info/list/2_1.html">修真小说</a></li>
<li><a href="http://www.biquge.info/list/3_1.html">都市小说</a></li>
<li><a href="http://www.biquge.info/list/4_1.html">穿越小说</a></li>
<li><a href="http://www.biquge.info/list/5_1.html">网游小说</a></li>
<li><a href="http://www.biquge.info/list/6_1.html">科幻小说</a></li>
<li><a href="http://www.biquge.info/paihangbang_allvisit/1.html">排行榜单</a></li>
<li><a href="http://www.biquge.info/wanjiexiaoshuo/">全部小说</a></li>
</ul>
</div>
<div class="dahengfu"><script type="text/javascript">list1();</script></div>
<!--顶部广告-->
<div class="box_con">
<div class="con_top">
<div id="bdshare" class="bdshare_b" style="line-height: 12px;"></div>
<a href="http://www.biquge.info/">笔趣阁</a> &gt; <a href="http://www.biquge.info/list/2_1.html">修真小说</a> &gt; 三寸人间最新章节列表
</div>
  
<div id="maininfo">
<div id="info">
<h1>三寸人间</h1>
<p>作&nbsp;&nbsp;&nbsp;&nbsp;者:耳根</p>
<p>类&nbsp;&nbsp;&nbsp;&nbsp;别:修真小说</p>
<p>最后更新&nbsp;&nbsp;:2020-10-26 19:30:00</p>
<p>最&nbsp;&nbsp;&nbsp;&nbsp;新:<a href="21667214.html">第1165章 道,不同!</a></p>
<p>动&nbsp;&nbsp;&nbsp;&nbsp;作:<a href="Javascript:void(0);" onclick="javascript:addbookcase(10582);">加入书架</a>, <a href="Javascript:void(0);" onclick="javascript:vote(10582);">投推荐票</a>, <a href="#footer">直达底部</a></p>

</div>
<div id="intro">
<p>    举头三尺无神明,掌心三寸是人间。这是耳根继《仙逆》《求魔》《我欲封天》《一念永恒》后,创作的第五部长篇小说《三寸人间》。</p>
<p>本站提示:各位书友要是觉得《三寸人间》还不错的话请不要忘记向您QQ群和微博里的朋友推荐哦!</p>
</div>
</div>

<div id="sidebar">
<div id="fmimg">
<img width="120" height="150" src="http://www.biquge.info/files/article/image/10/10582/10582s.jpg" onerror="src='/modules/article/images/nocover.jpg'" alt="三寸人间"/>
<span class="b"></span>
</div>
</div>

<div id="listtj">&nbsp;推荐阅读:<a href="/3_3918/" target="_blank">沧元图</a>、<a href="/0_383/" target="_blank">元尊</a>、<a href="/10_10582/" target="_blank">三寸人间</a>、<a href="/10_10240/" target="_blank">凡人修仙传</a>、<a href="/22_22533/" target="_blank">凡人修仙传仙界篇</a>、<a href="/1_1055/" target="_blank">百炼成仙</a>、<a href="/31_31413/" target="_blank">最强反套路系统</a>、<a href="/74_74132/" target="_blank">我师兄实在太稳健了</a>、<a href="/63_63010/" target="_blank">仙子请自重</a>、<a href="/39_39024/" target="_blank">独步成仙</a>、<a href="/62_62245/" target="_blank">大奉打更人</a>、<a href="/10_10233/" target="_blank">遮天</a>、<a href="/0_329/" target="_blank">莽荒纪</a>、<a href="/3_3787/" target="_blank">仙路至尊</a>    </div>
</div>
<!--中部广告--> 
<script type="text/javascript">list2();</script>
<div class="box_con">
<div id="list">
<dl>


<dd><a href="5103237.html" title="写在连载前">写在连载前</a></dd>
<dd><a href="5103238.html" title="第一章 我要减肥!">第一章 我要减肥!</a></dd>
<dd><a href="4404954.html" title="第二章 王宝乐,你干了什么!">第二章 王宝乐,你干了什么!</a></dd>
<dd><a href="4410172.html" title="第三章 好同学,一切有我!">第三章 好同学,一切有我!</a></dd>
<dd><a href="4411056.html" title="第四章 飘渺道院">第四章 飘渺道院</a></dd>
<dd><a href="4411442.html" title="第五章 特招学子">第五章 特招学子</a></dd>
<dd><a href="4414780.html" title="第六章 麻烦大了">第六章 麻烦大了</a></dd>
<dd><a href="4415687.html" title="第七章 全民矿工">第七章 全民矿工</a></dd>
<dd><a href="4419569.html" title="第八章 才智与反击!">第八章 才智与反击!</a></dd>


由于结果太长,我就省略了一部分

<dd><a href="21569594.html" title="第1155章 逆转裂月!">第1155章 逆转裂月!</a></dd>
<dd><a href="21586989.html" title="第1156章 尘青子的计划!">第1156章 尘青子的计划!</a></dd>
<dd><a href="21589074.html" title="第1157章 不对劲!">第1157章 不对劲!</a></dd>
<dd><a href="21590610.html" title="第1158章 谁是天道!">第1158章 谁是天道!</a></dd>
<dd><a href="21603992.html" title="第1159章 接人!">第1159章 接人!</a></dd>
<dd><a href="21606169.html" title="第1160章 幽冥星系!">第1160章 幽冥星系!</a></dd>
<dd><a href="21621213.html" title="第1161章 师兄的沉默!">第1161章 师兄的沉默!</a></dd>
<dd><a href="21623069.html" title="第1162章 归属感!">第1162章 归属感!</a></dd>
<dd><a href="21648412.html" title="第1163章 再看看吧!">第1163章 再看看吧!</a></dd>
<dd><a href="21665885.html" title="第1164章 逆流!">第1164章 逆流!</a></dd>
<dd><a href="21667214.html" title="第1165章 道,不同!">第1165章 道,不同!</a></dd>

</dl>
</div></div>
<!--底部广告-->
<div class="dahengfu"><script type="text/javascript">list3();</script></div>
<div id="footer" name="footer">
<div class="footer_link">
     新书推荐:
    <a href="/0_383/" target="_blank"><b>元尊</b></a>、
    <a href="/3_3918/" target="_blank"><b>沧元图</b></a>、
    <a href="/97_97216/" target="_blank"><b>枪定山河</b></a>、
    <a href="/97_97214/" target="_blank"><b>这个和尚会化缘</b></a>、
    <a href="/97_97195/" target="_blank"><b>神本凡欲</b></a>、
    <a href="/97_97188/" target="_blank"><b>重莲劫</b></a>、
    <a href="/97_97187/" target="_blank"><b>道为观止</b></a>、
    <a href="/97_97186/" target="_blank"><b>一蓑烟雨孤燕归</b></a>、
    <a href="/97_97185/" target="_blank"><b>剑透江湖</b></a>、
    <a href="/97_97180/" target="_blank"><b>道长有点仙</b></a>、
    <a href="/97_97179/" target="_blank"><b>道斩仙路</b></a>、
    <a href="/97_97173/" target="_blank"><b>清怜记</b></a>、
</div>
<div class="footer_cont">
<p>《三寸人间》情节跌宕起伏、扣人心弦,是一本情节与文笔俱佳的修真小说,笔趣阁转载收集三寸人间最新章节。</p>
<script type="text/javascript">footer();</script>
<div style="display:none"><script type="text/javascript">tongji();</script></div>
</div>
</div>
<script>recordedclick(10582);</script>
</body>
</html>

2020-10-27 10:07:13 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-27 10:07:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 457,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 29175,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 2.161882,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 27, 2, 7, 13, 296059),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 10, 27, 2, 7, 11, 134177)}
2020-10-27 10:07:13 [scrapy.core.engine] INFO: Spider closed (finished)

可以看出,爬虫将调试内容一块显示了,这样子看的话显得结果很长,而且我们所需要的结果也夹杂在里面

限制scrapy调试内容输出

打开爬虫文件夹中的settings.py

添加

LOG_LEVEL='ERROR'  #设置输出信息级别为erroe
LOG_FILE='spider.log' #设置输出日志的文件

这样的话,报错的信息都会保存到spider.log中,当然,如果不写第二条,输出的信息会显示在终端上

这样,我们就获取到了网页的源码

获取章节url

scrapy支持xpath,直接通过xpath获取内容

import scrapy


class BiqugeSpider(scrapy.Spider):
    name = 'biquge'
    allowed_domains = ['http://www.biquge.info/']
    start_urls = ['http://www.biquge.info/10_10582/']

    def parse(self, response):
        url = response.xpath('//*[@id="list"]/dl/dd/a/@href')
        print(url)

部分结果

D:\py\pythonw.exe "D:/PyCharm 2019.3.3/cx/scrapy-di/biqugeSpider/biqugeSpider/spiders/run.py"
[<Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='5103237.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='5103238.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4404954.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4410172.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4411056.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4411442.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4414780.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4415687.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4419569.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4420520.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='4424625.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18725232.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18740568.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18743783.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18760437.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18764035.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18826021.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18829053.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18841392.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18844556.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18858608.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18861769.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18874143.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18876736.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18889707.html'>, <Selector xpath='//*

省略部分结果

[@id="list"]/dl/dd/a/@href' data='18893195.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18957473.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18962641.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18980000.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='18984691.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19003254.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19007427.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19026936.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19032856.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19053495.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19057252.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19124370.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19129976.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19150095.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19152642.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19166865.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19170915.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19193264.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19199169.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19217600.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19220928.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19275589.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19280374.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19295135.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19298378.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19314268.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19317299.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19381230.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19384504.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19399790.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19403631.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19418639.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19421996.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19436078.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19439628.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19452873.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='19456871.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='21606169.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='21621213.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='21623069.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='21648412.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='21665885.html'>, <Selector xpath='//*[@id="list"]/dl/dd/a/@href' data='21667214.html'>]

进程已结束,退出代码 0

可以看出,这是一个列表

看到结果,发现我们所需要的内容被省略掉了,而且输出了xpath的信息,这时,只需要加上getall()就可以

url = response.xpath('//*[@id="list"]/dl/dd/a/@href').getall()

部分结果

['5103237.html', '5103238.html', '4404954.html', '4410172.html', '4411056.html', '4411442.html', '4414780.html', '4415687.html', '4419569.html', '4420520.html', '4424625.html', '4426521.html', '4430493.html', '4431390.html', '4435310.html', '4436160.html', '4440062.html', '4440924.html', '4444653.html', '4445499.html', '4449209.html', '4450078.html', '4453700.html', '4454604.html', '4458295.html', '4459171.html', '4462967.html', '4463834.html', '4469919.html', '4470754.html', '4474809.html', '4475785.html', '4479271.html', '4480126.html', '4483634.html', '4484730.html', '4488235.html', '4489066.html', '4492562.html', '4493423.html', '4496935.html', '4497711.html', '4501293.html', '4501961.html', '4506353.html', '4507228.html', '4510805.html', '4511617.html', '4515268.html', '4515963.html', '4519468.html', '4520282.html', '4523829.html', '4524665.html', '4528177.html', '4528977.html', '4532706.html', '4533512.html', '4537068.html', '4537884.html', '4541483.html', '4542324.html', '4545901.html', '4546973.html', '4550480.html', '4551252.html', '4555239.html', '4555240.html', '4558975.html', '4559793.html', '4563283.html', '4564085.html', '4567544.html', '4568320.html', '4571602.html', '4572368.html', '4575689.html', '4576488.html', '4579746.html', '4580487.html', '4583806.html', '4584541.html', '4588022.html', '4589790.html', '4593146.html', '4593866.html', '4597111.html', '4597836.html', '4601214.html', '4601948.html', '4605134.html', '4607031.html', '4617162.html', '4619342.html', '4627413.html', '4630185.html', '4639777.html', '4641890.html', '4649823.html', '4651838.html', '4658543.html', '4661208.html', '4671736.html', '4673810.html', '4681148.html', '4683551.html', '4688482.html', '4689388.html', '4696514.html', '4697583.html', '4701669.html', '4702623.html', '4706551.html', '4707453.html', '4711655.html', '4712984.html', '4717542.html', '4718933.html', '4723615.html', '4724815.html', '4729720.html', '4730930.html', '4735906.html', '4737400.html', '4742619.html', '4743925.html', '4747089.html', '4746975.html', '4747090.html', '4747091.html', '4749069.html', '4750343.html', '4750347.html', '4756136.html', '4757538.html', '4762490.html', '4763826.html', '4769017.html', '4770393.html', '4770396.html', '4771382.html', '4775489.html', '4776667.html', '4781947.html', '4783247.html', '4788787.html', '4790155.html', '4795148.html', '4796454.html', '4801386.html', '4802803.html', '4808388.html', '4809658.html', '4814772.html', '4815986.html', '4821395.html', '4822618.html', '4827593.html', '4828864.html', '4829903.html', '4833681.html', '4834849.html', '4840105.html', '4841379.html', '4848211.html', '4855923.html', '4880406.html', '4888018.html', '4906587.html', '4909342.html', '4914463.html', '4915575.html', '4920662.html', '4921950.html', '4927516.html', '4929181.html', '4947255.html', '4953583.html', '4968295.html', '4974075.html', '4991578.html', '5003805.html', '5042249.html', '5052438.html', '5077905.html', '5084527.html', '5088662.html', '5110490.html', '5116655.html', '5138551.html', '5145745.html', '5157570.html', '5161794.html', '5180584.html', '5183794.html', '5199862.html', '5203531.html', '5209026.html', '5216561.html', '5220703.html', '5227721.html', '5239959.html', '5244902.html', '5260725.html', '5264214.html', '5279084.html', '5284052.html', '5297648.html', '5301383.html', '5319139.html', '5323393.html', '5350918.html', '5359954.html', '5376352.html', '5380771.html', '5396800.html', '5401792.html', '5413623.html', '5417873.html', '5429099.html', '5431507.html', '5454811.html', '5459288.html', '5479885.html', '5487694.html', '5508112.html', '5516250.html', '5532833.html', '5543693.html', '5564337.html', '5570220.html', '5585284.html', '5588888.html', '5599397.html', '5604608.html', '5636738.html', '5644345.html', '5670634.html', '5676381.html', '5710975.html', '5717433.html', '5739948.html', '5748098.html', '5762761.html', '5769123.html', '5792114.html', '5800725.html', '5820992.html', '5833448.html', '5862985.html', '5871998.html', '5902480.html', '5908242.html', '5936521.html', '5949642.html', '5972459.html', '5978190.html', '6002777.html', '6007512.html', '6026277.html', '6032156.html', '6053271.html', '6057436.html', '6073844.html', '6078897.html', '6102496.html', '6109417.html', '6130930.html', '6137663.html', '6161613.html', '6167034.html', '6185168.html', '6189578.html', '6206239.html', '6212650.html', '6228878.html', '6234109.html', '6256356.html', '6266786.html', '6289156.html', '6294180.html', '6320922.html', '6340582.html', '6363063.html', '6370024.html', '6397706.html', '6411648.html', '6442243.html', '6449406.html', '6467793.html', '6472387.html', '6492943.html', '6499533.html', '6528870.html', '6536922.html', '6567858.html', '6582656.html', '6636798.html', '6643250.html', '6658123.html', '6673257.html', '6706661.html', '6709562.html', '6721910.html', '6728795.html', '6757165.html', '6767427.html', '6816325.html', '6833806.html', '6887572.html', '6896767.html', '6927610.html', '6939541.html', '6981738.html', '6993066.html', '7017935.html', '7021805.html', '7043503.html', '7052413.html', '7087354.html', '7095216.html', '7120235.html', '7127697.html', '7149921.html', '7157590.html', '7176747.html', '7188415.html', '7206968.html', '7216493.html', '7232871.html', '7236483.html', '7251974.html', '7256696.html', '7273715.html', '7283662.html', '7308824.html', '7315267.html', '7338036.html', '7343323.html', '7361485.html', '7366189.html', '7384207.html', '7391453.html', '7408580.html', '7415907.html', '7437798.html', '7443771.html', '7466022.html', '7473125.html', '7492010.html', '7497296.html', '7515626.html', '7520486.html', '7538617.html', '7543702.html', '7567222.html', '7572093.html', '7598329.html', '7603690.html', '7621670.html', '7625406.html', '7639004.html', '7645806.html', '7663340.html', '7670127.html', '7692322.html', '7794726.html', '7825673.html', '7835436.html', '7860093.html', '7864682.html', '7884203.html', '7889978.html', '7907177.html', '7911488.html', '7936011.html', '7942582.html', '7956169.html', '7966378.html', '7989358.html', '7994149.html', '8010840.html', '8014903.html', '8030957.html', '8035472.html', '8056231.html', '8066657.html', '8082315.html', '8086161.html', '8103075.html', '8127440.html', '8159258.html', '8180070.html', '8199807.html', '8217202.html', '8240023.html', '8263523.html', '8292493.html', '8316456.html', '8340377.html', '8357387.html', '8385766.html', '8407737.html', '8435024.html', '8468277.html', '8498890.html', '8528417.html', '8562490.html', '8595420.html', '8618418.html', '8642598.html', '8668893.html', '8704790.html', '8733745.html', '8759621.html', '8783973.html', '8804645.html', '8824348.html', '8827298.html', '8841143.html', '8846732.html', '8865486.html', '8873446.html', '8892894.html', '8900353.html', '8918179.html', '8925605.html', '8941721.html', '8947196.html', '8967317.html', '8975309.html', '8995549.html', '9000929.html', '9018551.html', '9022624.html', '9045667.html', '9050602.html', '9075844.html', '9096995.html', '9118882.html', '9145651.html', '9173914.html', '9200491.html', '9222998.html', '9244134.html', '9250971.html', '9264704.html', '9269329.html', '9287420.html', '9297388.html', '9311370.html', '9319077.html', '9339940.html', ]

拼接url

列表表达式

url_new = ['http://www.biquge.info/10_10582/'+ i for i in url]

部分结果

['http://www.biquge.info/10_10582/5103237.html', 'http://www.biquge.info/10_10582/5103238.html', 'http://www.biquge.info/10_10582/4404954.html', 'http://www.biquge.info/10_10582/4410172.html', 'http://www.biquge.info/10_10582/4411056.html', 'http://www.biquge.info/10_10582/4411442.html', 'http://www.biquge.info/10_10582/4414780.html', 'http://www.biquge.info/10_10582/4415687.html', 'http://www.biquge.info/10_10582/4419569.html', 'http://www.biquge.info/10_10582/4420520.html', 'http://www.biquge.info/10_10582/4424625.html', 'http://www.biquge.info/10_10582/4426521.html', 'http://www.biquge.info/10_10582/4430493.html', 'http://www.biquge.info/10_10582/4431390.html', 'http://www.biquge.info/10_10582/4435310.html', 'http://www.biquge.info/10_10582/4436160.html', 'http://www.biquge.info/10_10582/4440062.html', 'http://www.biquge.info/10_10582/4440924.html', 'http://www.biquge.info/10_10582/4444653.html', 'http://www.biquge.info/10_10582/4445499.html', 'http://www.biquge.info/10_10582/4449209.html', 'http://www.biquge.info/10_10582/4450078.html', 'http://www.biquge.info/10_10582/4453700.html', 'http://www.biquge.info/10_10582/4454604.html', 'http://www.biquge.info/10_10582/4458295.html', 'http://www.biquge.info/10_10582/4459171.html', 'http://www.biquge.info/10_10582/4462967.html', 'http://www.biquge.info/10_10582/4463834.html', ]

章节标题同理

爬取完成后可以保存在mysql数据库中,也可以保存到本地

上传

下一篇将讲述保存到数据库中

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值