Scrapy爬虫入门教程 安装和基本使用

<div class="markdown_views">

<p><a href="http://blog.csdn.net/inke88/article/details/59761696" target="_blank">Python版本管理:pyenv和pyenv-virtualenv</a> <br>




<p><strong>开发环境:</strong> <br>
<code><a href="http://lib.csdn.net/base/python" class="replace_word" title="Python知识库" target="_blank" style="color:#df3434; font-weight:bold;">Python</a> 3.6.0 版本</code> (当前最新) <br>
<code>Scrapy 1.3.2 版本</code> (当前最新)</p>


<p></p><div class="toc">
<ul>
<li><ul>
<li><ul>
<li><a href="#scrapy安装" target="">Scrapy安装</a></li>
<li><a href="#创建项目" target="">创建项目</a></li>
<li><a href="#如何运行我们爬虫" target="">如何运行我们爬虫</a></li>
<li><a href="#提取数据" target="">提取数据</a><ul>
<li><a href="#css选择元素" target="">CSS选择元素</a></li>
<li><a href="#提取标题" target="">提取标题</a></li>
<li><a href="#xpath选择元素" target="">XPath选择元素</a></li>
<li><a href="#提取引号和作者" target="">提取引号和作者</a></li>
</ul>
</li>
<li><a href="#存取数据" target="">存取数据</a></li>
<li><a href="#链接界面包含的链接" target="">链接界面包含的链接</a></li>
<li><a href="#更多示例和模式" target="">更多示例和模式</a></li>
<li><a href="#使用爬虫参数" target="">使用爬虫参数</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<p></p>






<h3 id="scrapy安装"><a name="t0" target="_blank"></a>Scrapy安装</h3>


<p>Scrapy在<a href="http://lib.csdn.net/base/python" class="replace_word" title="Python知识库" target="_blank" style="color:#df3434; font-weight:bold;">python</a> 2.7和Python 3.3或更高版本上运行(除了在Windows 3上不支持Python 3)。</p>


<p>通用方式:可以从pip安装Scrapy及其依赖: <br>
<code>pip install Scrapy</code></p>






<h3 id="创建项目"><a name="t1" target="_blank"></a>创建项目</h3>


<p><code>scrapy startproject tutorial</code> <br>
<img src="http://om2o4m4w0.bkt.clouddn.com/14912226418048.gif" alt="-w200" title=""></p>


<p>项目结构:</p>






<pre class="prettyprint" name="code"><code class="hljs avrasm has-numbering">tutorial/
    scrapy<span class="hljs-preprocessor">.cfg</span>            <span class="hljs-preprocessor"># 部署配置文件</span>


    tutorial/             <span class="hljs-preprocessor"># Python模块,代码写在这个目录下</span>
        __init__<span class="hljs-preprocessor">.py</span>


        items<span class="hljs-preprocessor">.py</span>          <span class="hljs-preprocessor"># 项目项定义文件</span>


        pipelines<span class="hljs-preprocessor">.py</span>      <span class="hljs-preprocessor"># 项目管道文件</span>


        settings<span class="hljs-preprocessor">.py</span>       <span class="hljs-preprocessor"># 项目设置文件</span>


        spiders/          <span class="hljs-preprocessor"># 我们的爬虫/蜘蛛 目录</span>
            __init__<span class="hljs-preprocessor">.py</span>
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul><div class="save_code tracking-ad" data-mod="popu_249" style="display: none;"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul></pre>


<p>我们第一个爬虫 <br>
创建第一个爬虫类:tutorial/spiders/quotes_spider.py</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start_requests</span><span class="hljs-params">(self)</span>:</span>
        urls = [
            <span class="hljs-string">'http://quotes.toscrape.com/page/1/'</span>,
            <span class="hljs-string">'http://quotes.toscrape.com/page/2/'</span>,
        ]
        <span class="hljs-keyword">for</span> url <span class="hljs-keyword">in</span> urls:
            <span class="hljs-keyword">yield</span> scrapy.Request(url=url, callback=self.parse)


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        page = response.url.split(<span class="hljs-string">"/"</span>)[-<span class="hljs-number">2</span>]
        filename = <span class="hljs-string">'quotes-%s.html'</span> % page
        <span class="hljs-keyword">with</span> open(filename, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            f.write(response.body)
        self.log(<span class="hljs-string">'Saved file %s'</span> % filename)</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li></ul><div class="save_code tracking-ad" data-mod="popu_249" style="display: none;"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li></ul></pre>


<ul>
<li><p>必须继承 scrapy.Spider</p></li>
<li><p>name:标识爬虫。它在项目中必须是唯一的,也就是说,您不能为不同的Spider设置相同的名称。</p></li>
<li><p>start_requests():必须返回一个迭代的Requests(你可以返回请求列表或写一个生成器函数),Spider将开始抓取。后续请求将从这些初始请求连续生成。</p></li>
<li><p>parse():将被调用来处理为每个请求下载的响应的方法。 response参数是一个TextResponse保存页面内容的实例,并且具有更多有用的方法来处理它。</p>


<p>该parse()方法通常解析响应,提取抓取的数据作为词典,并且还找到要跟踪的新网址并从中创建新的请求(Request)。</p></li>
</ul>






<h3 id="如何运行我们爬虫"><a name="t2" target="_blank"></a>如何运行我们爬虫</h3>


<p>进入项目根目录,也就是上面的tutorial目录  <br>
<code>cd tutorial</code> <br>
执行爬虫: <br>
<code>scrapy crawl quotes</code></p>


<blockquote>
  <p>quotes是上文写的爬虫名称</p>
</blockquote>






<pre class="prettyprint" name="code"><code class="hljs avrasm has-numbering">... (omitted for brevity)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] INFO: Spider opened
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.extensions</span><span class="hljs-preprocessor">.logstats</span>] INFO: Crawled <span class="hljs-number">0</span> pages (at <span class="hljs-number">0</span> pages/min), scraped <span class="hljs-number">0</span> items (at <span class="hljs-number">0</span> items/min)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.extensions</span><span class="hljs-preprocessor">.telnet</span>] DEBUG: Telnet console listening on <span class="hljs-number">127.0</span><span class="hljs-number">.0</span><span class="hljs-number">.1</span>:<span class="hljs-number">6023</span>
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] DEBUG: Crawled (<span class="hljs-number">404</span>) &lt;GET http://quotes<span class="hljs-preprocessor">.toscrape</span><span class="hljs-preprocessor">.com</span>/robots<span class="hljs-preprocessor">.txt</span>&gt; (referer: None)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] DEBUG: Crawled (<span class="hljs-number">200</span>) &lt;GET http://quotes<span class="hljs-preprocessor">.toscrape</span><span class="hljs-preprocessor">.com</span>/page/<span class="hljs-number">1</span>/&gt; (referer: None)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] DEBUG: Crawled (<span class="hljs-number">200</span>) &lt;GET http://quotes<span class="hljs-preprocessor">.toscrape</span><span class="hljs-preprocessor">.com</span>/page/<span class="hljs-number">2</span>/&gt; (referer: None)
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [quotes] DEBUG: Saved file quotes-<span class="hljs-number">1.</span>html
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [quotes] DEBUG: Saved file quotes-<span class="hljs-number">2.</span>html
<span class="hljs-number">2016</span>-<span class="hljs-number">12</span>-<span class="hljs-number">16</span> <span class="hljs-number">21</span>:<span class="hljs-number">24</span>:<span class="hljs-number">05</span> [scrapy<span class="hljs-preprocessor">.core</span><span class="hljs-preprocessor">.engine</span>] INFO: Closing spider (finished)
...</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li></ul><div class="save_code tracking-ad" data-mod="popu_249" style="display: none;"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li></ul></pre>


<p>现在,检查当前目录中的文件。您应该注意到,已经创建了两个新文件:quotes-1.html和quotes-2.html,以及相应URL的内容,parse方法解析的内容。</p>


<p><img src="http://om2o4m4w0.bkt.clouddn.com/14885332051166.jpg" alt="-w300" title=""> <br>
上图用的是pycharm的IDE。</p>






<h3 id="提取数据"><a name="t3" target="_blank"></a>提取数据</h3>


<p>学习如何使用Scrapy提取数据的最好方法是尝试使用shell Scrapy shell的选择器。</p>


<p><code>scrapy shell 'http://quotes.toscrape.com/page/1/'</code></p>


<blockquote>
  <p>记住,当从命令行运行Scrapy shell时,总是用引号引起url,否则包含参数的urls(即。&amp;字符)将不起作用。 <br>
  在Windows上,请使用双引号: <br>
  scrapy shell “<a href="http://quotes.toscrape.com/page/1/" target="_blank">http://quotes.toscrape.com/page/1/</a>”</p>
</blockquote>


<p>你会看到类似:</p>






<pre class="prettyprint" name="code"><code class="hljs r has-numbering">[<span class="hljs-keyword">...</span> Scrapy log here <span class="hljs-keyword">...</span>]
<span class="hljs-number">2016</span>-<span class="hljs-number">09</span>-<span class="hljs-number">19</span> <span class="hljs-number">12</span>:<span class="hljs-number">09</span>:<span class="hljs-number">27</span> [scrapy.core.engine] DEBUG:Crawled(<span class="hljs-number">200</span>)&lt;GET http://quotes.toscrape.com/page/<span class="hljs-number">1</span>/&gt;(referer:None)
[s]可用Scrapy对象:
[s] scrapy scrapy模块(包含scrapy.Request,scrapy.Selector等)
[s] crawler &lt;scrapy.crawler.Crawler object at <span class="hljs-number">0x7fa91d888c90</span>&gt;
[s] item {}
[s] request &lt;GET http://quotes.toscrape.com/page/<span class="hljs-number">1</span>/&gt;
[s] response &lt;<span class="hljs-number">200</span> http://quotes.toscrape.com/page/<span class="hljs-number">1</span>/&gt;
[s] settings &lt;scrapy.settings.Settings object at <span class="hljs-number">0x7fa91d888c10</span>&gt;
[s] spider &lt;DefaultSpider<span class="hljs-string">'default'</span>at <span class="hljs-number">0x7fa91c8af990</span>&gt;
[s]有用的快捷键:
[s] shelp()Shell帮助(打印此帮助)
[s] fetch(req_or_url)Fetch请求(或URL)并更新本地对象
[s] view(response)在浏览器中查看响应
&gt;&gt;&gt;</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul></pre>






<h4 id="css选择元素"><a name="t4" target="_blank"></a>CSS选择元素</h4>






<h4 id="提取标题"><a name="t5" target="_blank"></a>提取标题</h4>


<p>尝试使用带有响应对象的CSS选择元素:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title'</span>)
[&lt;Selector xpath=<span class="hljs-string">'descendant-or-self::title'</span> data=<span class="hljs-string">'&lt;title&gt;Quotes to Scrape&lt;/title&gt;'</span>&gt;]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>返回一个Selector 的集合。</p>


<p>从上面的标题中提取文本,您可以:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).extract()
[<span class="hljs-string">'Quotes to Scrape'</span>]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>这里有两个要注意的事情:一个是我们添加::text到CSS查询,意味着我们要直接在\元素内部选择文本元素 。如果我们不指定::text,我们将获得完整的title元素,包括其标签:</p>






<pre class="prettyprint" name="code"><code class="hljs vbnet has-numbering">&gt;&gt;&gt; response.css(<span class="hljs-comment">'title').extract()</span>
[<span class="hljs-comment">'<span class="hljs-xmlDocTag">&lt;title&gt;</span>Quotes to Scrape<span class="hljs-xmlDocTag">&lt;/title&gt;</span>']</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>另一件事是调用的结果.extract()是一个列表,因为我们处理的是一个实例SelectorList。当你知道你只想要第一个结果,在这种情况下,你可以做:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).extract_first()
<span class="hljs-string">'Quotes to Scrape'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>也可以这样写:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>)[<span class="hljs-number">0</span>].extract()
<span class="hljs-string">'Quotes to Scrape'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>但是,使用.extract_first()避免了IndexError,并且None在找不到与选择匹配的任何元素时返回 。</p>


<p>除了extract()和 extract_first()方法,您还可以使用该re()方法使用正则表达式提取:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).re(<span class="hljs-string">r'Quotes.*'</span>)
[<span class="hljs-string">'Quotes to Scrape'</span>]
<span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).re(<span class="hljs-string">r'Q\w+'</span>)
[<span class="hljs-string">'Quotes'</span>]
<span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'title::text'</span>).re(<span class="hljs-string">r'(\w+) to (\w+)'</span>)
[<span class="hljs-string">'Quotes'</span>, <span class="hljs-string">'Scrape'</span>]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul></pre>


<p>了找到合适的CSS选择器使用,您可以用chrome和Firefox 的调试工具查看css。</p>






<h4 id="xpath选择元素"><a name="t6" target="_blank"></a>XPath选择元素</h4>


<p>除了CSS,Scrapy选择器还支持使用XPath表达式:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.xpath(<span class="hljs-string">'//title'</span>)
[&lt;Selector xpath=<span class="hljs-string">'//title'</span> data=<span class="hljs-string">'&lt;title&gt;Quotes to Scrape&lt;/title&gt;'</span>&gt;]
<span class="hljs-prompt">&gt;&gt;&gt; </span>response.xpath(<span class="hljs-string">'//title/text()'</span>).extract_first()
<span class="hljs-string">'Quotes to Scrape'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li></ul></pre>


<p>XPath表达式非常强大,是Scrapy选择器的基础。事实上,CSS选底层也是用XPath。</p>


<p>虽然也许不像CSS选择器那么流行,XPath表达式提供了更多的功能,因为除了导航结构之外,它还可以查看内容。使用XPath,您可以选择以下内容:选择包含文本“下一页”的链接。这使得XPath非常适合于抓取任务,我们鼓励你学习XPath,即使你已经知道如何构建CSS选择器,它会使刮除更容易。</p>


<p><strong>大家不要着急一下子把所以东西都介绍到,具体细节后面都会写到。</strong></p>


<ul>
<li>xpath 资料: <br>
<ul><li>使用XPath与Scrapy选择器在这里:<a href="http://scrapy.readthedocs.io/en/latest/topics/selectors.html#topics-selectors" target="_blank">http://scrapy.readthedocs.io/en/latest/topics/selectors.html#topics-selectors</a></li></ul></li>
</ul>






<h4 id="提取引号和作者"><a name="t7" target="_blank"></a>提取引号和作者</h4>


<p><a href="http://quotes.toscrape.com" target="_blank">http://quotes.toscrape.com</a>都由以下HTML元素表示:</p>






<pre class="prettyprint" name="code"><code class="hljs livecodeserver has-numbering">&lt;<span class="hljs-operator">div</span> class=<span class="hljs-string">"quote"</span>&gt;
    &lt;span class=<span class="hljs-string">"text"</span>&gt;“The world <span class="hljs-keyword">as</span> we have created <span class="hljs-keyword">it</span> is <span class="hljs-operator">a</span> <span class="hljs-built_in">process</span> <span class="hljs-operator">of</span> our
    thinking. It cannot be changed <span class="hljs-keyword">without</span> changing our thinking.”&lt;/span&gt;
    &lt;span&gt;
        <span class="hljs-keyword">by</span> &lt;small class=<span class="hljs-string">"author"</span>&gt;Albert Einstein&lt;/small&gt;
        &lt;<span class="hljs-operator">a</span> href=<span class="hljs-string">"/author/Albert-Einstein"</span>&gt;(about)&lt;/<span class="hljs-operator">a</span>&gt;
    &lt;/span&gt;
    &lt;<span class="hljs-operator">div</span> class=<span class="hljs-string">"tags"</span>&gt;
        Tags:
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/change/page/1/"</span>&gt;change&lt;/<span class="hljs-operator">a</span>&gt;
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/deep-thoughts/page/1/"</span>&gt;deep-thoughts&lt;/<span class="hljs-operator">a</span>&gt;
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/thinking/page/1/"</span>&gt;thinking&lt;/<span class="hljs-operator">a</span>&gt;
        &lt;<span class="hljs-operator">a</span> class=<span class="hljs-string">"tag"</span> href=<span class="hljs-string">"/tag/world/page/1/"</span>&gt;world&lt;/<span class="hljs-operator">a</span>&gt;
    &lt;/<span class="hljs-operator">div</span>&gt;
&lt;/<span class="hljs-operator">div</span>&gt;</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li></ul></pre>


<p>打开scrapy shell <br>
<code>$ scrapy shell'http://quotes.toscrape.com'</code> <br>
网站内容,可能需要翻墙,截图如下: <br>
<img src="http://om2o4m4w0.bkt.clouddn.com/14912244561352.jpg" alt="" title=""></p>


<p>获取selectors元素列表 <br>
<code>&gt;&gt;&gt; response.css("div.quote")</code></p>


<p>每个选择器允许我们对它们的子元素执行进一步的查询。 <br>
将第一个选择器分配给一个变量,以便我们可以直接对特定的引用运行我们的CSS选择器: <br>
<code>&gt;&gt;&gt; quote = response.css("div.quote")[0]</code></p>


<p>现在,从刚刚创建的对象的quote对象,提取title、author、tags:</p>






<pre class="prettyprint" name="code"><code class="hljs applescript has-numbering">&gt;&gt;&gt; title = <span class="hljs-constant">quote</span>.css(<span class="hljs-string">"span.text::text"</span>).extract_first()
&gt;&gt;&gt; title
'“The world <span class="hljs-keyword">as</span> we have created <span class="hljs-keyword">it</span> <span class="hljs-keyword">is</span> a process <span class="hljs-keyword">of</span> our thinking. It cannot be changed <span class="hljs-keyword">without</span> changing our thinking.”'
&gt;&gt;&gt; author = <span class="hljs-constant">quote</span>.css(<span class="hljs-string">"small.author::text"</span>).extract_first()
&gt;&gt;&gt; author
'Albert Einstein'</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul></pre>


<p>鉴于tags是字符串列表,我们可以使用该.extract()方法来获取所有的:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>tags = quote.css(<span class="hljs-string">"div.tags a.tag::text"</span>).extract()
<span class="hljs-prompt">&gt;&gt;&gt; </span>tags
[<span class="hljs-string">'change'</span>, <span class="hljs-string">'deep-thoughts'</span>, <span class="hljs-string">'thinking'</span>, <span class="hljs-string">'world'</span>]</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li></ul></pre>


<p>现在可以遍历所有的引号元素,并将它们放在一起成为一个Python字典:</p>






<pre class="prettyprint" name="code"><code class="hljs r has-numbering">&gt;&gt;&gt; <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">"div.quote"</span>):
<span class="hljs-keyword">...</span>     text = quote.css(<span class="hljs-string">"span.text::text"</span>).extract_first()
<span class="hljs-keyword">...</span>     author = quote.css(<span class="hljs-string">"small.author::text"</span>).extract_first()
<span class="hljs-keyword">...</span>     tags = quote.css(<span class="hljs-string">"div.tags a.tag::text"</span>).extract()
<span class="hljs-keyword">...</span>     print(dict(text=text, author=author, tags=tags))
{<span class="hljs-string">'tags'</span>: [<span class="hljs-string">'change'</span>, <span class="hljs-string">'deep-thoughts'</span>, <span class="hljs-string">'thinking'</span>, <span class="hljs-string">'world'</span>], <span class="hljs-string">'author'</span>: <span class="hljs-string">'Albert Einstein'</span>, <span class="hljs-string">'text'</span>: <span class="hljs-string">'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'</span>}
{<span class="hljs-string">'tags'</span>: [<span class="hljs-string">'abilities'</span>, <span class="hljs-string">'choices'</span>], <span class="hljs-string">'author'</span>: <span class="hljs-string">'J.K. Rowling'</span>, <span class="hljs-string">'text'</span>: <span class="hljs-string">'“It is our choices, Harry, that show what we truly are, far more than our abilities.”'</span>}
    <span class="hljs-keyword">...</span> a few more of these, omitted <span class="hljs-keyword">for</span> brevity
&gt;&gt;&gt;</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li></ul></pre>


<p>通过上面的demo,我们学会了一些基本的提取数据方法,现在我们尝试集成到我们上面的创建的爬虫中。</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>
    start_urls = [
        <span class="hljs-string">'http://quotes.toscrape.com/page/1/'</span>,
        <span class="hljs-string">'http://quotes.toscrape.com/page/2/'</span>,
    ]


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'div.quote'</span>):
            <span class="hljs-keyword">yield</span> {
                <span class="hljs-string">'text'</span>: quote.css(<span class="hljs-string">'span.text::text'</span>).extract_first(),
                <span class="hljs-string">'author'</span>: quote.css(<span class="hljs-string">'small.author::text'</span>).extract_first(),
                <span class="hljs-string">'tags'</span>: quote.css(<span class="hljs-string">'div.tags a.tag::text'</span>).extract(),
            }
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li></ul></pre>


<p>如果你运行这个爬虫,它将输出提取的数据与日志:</p>






<pre class="prettyprint" name="code"><code class="hljs cs has-numbering"><span class="hljs-number">2016</span>-<span class="hljs-number">09</span>-<span class="hljs-number">19</span> <span class="hljs-number">18</span>:<span class="hljs-number">57</span>:<span class="hljs-number">19</span> [scrapy.core.scraper] DEBUG:Scraped <span class="hljs-keyword">from</span> &lt;<span class="hljs-number">200</span> http:<span class="hljs-comment">//quotes.toscrape.com/page/1/&gt;</span>
{<span class="hljs-string">'tags'</span>:[<span class="hljs-string">'life'</span>,<span class="hljs-string">'love'</span>],<span class="hljs-string">'author'</span>:<span class="hljs-string">'AndréGide'</span>,<span class="hljs-string">'text'</span>:<span class="hljs-string">'“最好不要因为你的爱而被恨。 “'</span>}
<span class="hljs-number">2016</span>-<span class="hljs-number">09</span>-<span class="hljs-number">19</span> <span class="hljs-number">18</span>:<span class="hljs-number">57</span>:<span class="hljs-number">19</span> [scrapy.core.scraper] DEBUG:Scraped <span class="hljs-keyword">from</span> &lt;<span class="hljs-number">200</span> http:<span class="hljs-comment">//quotes.toscrape.com/page/1/&gt;</span>
{<span class="hljs-string">'tags'</span>:[<span class="hljs-string">'edison'</span>,<span class="hljs-string">'failure'</span>,<span class="hljs-string">'inspirational'</span>,<span class="hljs-string">'paraphrased'</span>],<span class="hljs-string">'author'</span>:<span class="hljs-string">'Thomas A. Edison'</span>,<span class="hljs-string">'text'</span>:“”我没有失败, <span class="hljs-number">10</span>,<span class="hljs-number">000</span>种方式将无法工作。“”}
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li></ul></pre>






<h3 id="存取数据"><a name="t8" target="_blank"></a>存取数据</h3>


<p>最简单方法是直接制定导出文件: <br>
<code>scrapy crawl quotes -o quotes.json</code></p>


<p>这将生成一个quotes.json包含所有被抓取的数据,以JSON序列化的文件。</p>


<p>出于历史原因,<strong>Scrapy会附加到给定文件,而不是覆盖其内容。如果你运行这个命令两次,没有在第二次之前删除文件,你会得到一个破碎的JSON文件</strong>。</p>


<p>您还可以使用其他格式: <br>
<code>scrapy crawl quotes -o quotes.jl</code></p>


<p><br></p>






<h3 id="链接界面包含的链接"><a name="t9" target="_blank"></a>链接界面包含的链接</h3>


<p>让我们说,不要只是从<a href="http://quotes.toscrape.com" target="_blank">http://quotes.toscrape.com</a>的前两个页面抓取东西,你想要从网站的所有页面的报价。</p>


<p>现在,您知道如何从页面中提取数据,让我们看看如何跟踪他们的链接。</p>


<p>首先是提取我们要关注的网页的链接。检查我们的页面,我们可以看到有一个链接到下一页与下面的标记:</p>






<pre class="prettyprint" name="code"><code class="hljs xml has-numbering"><span class="hljs-tag">&lt;<span class="hljs-title">ul</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"pager"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-title">li</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"next"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-title">a</span> <span class="hljs-attribute">href</span>=<span class="hljs-value">"/page/2/"</span>&gt;</span>Next <span class="hljs-tag">&lt;<span class="hljs-title">span</span> <span class="hljs-attribute">aria-hidden</span>=<span class="hljs-value">"true"</span>&gt;</span>&amp;rarr;<span class="hljs-tag">&lt;/<span class="hljs-title">span</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">a</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-title">li</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">ul</span>&gt;</span>
</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul></pre>


<p>我们可以尝试在shell中提取它:</p>






<pre class="prettyprint" name="code"><code class="hljs xml has-numbering">&gt;&gt;&gt; response.css('li.next a').extract_first()
'<span class="hljs-tag">&lt;<span class="hljs-title">a</span> <span class="hljs-attribute">href</span>=<span class="hljs-value">"/page/2/"</span>&gt;</span>Next <span class="hljs-tag">&lt;<span class="hljs-title">span</span> <span class="hljs-attribute">aria-hidden</span>=<span class="hljs-value">"true"</span>&gt;</span>→<span class="hljs-tag">&lt;/<span class="hljs-title">span</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">a</span>&gt;</span>'</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>这得到锚点元素,但我们想要的属性href。为此,Scrapy支持一个CSS扩展,让您选择属性内容,如下所示:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">&gt;&gt;&gt; </span>response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
<span class="hljs-string">'/page/2/'</span></code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li></ul></pre>


<p>让我们看看现在我们的爬虫被修改为递归的跟随到下一页的链接,从中提取数据:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>
    start_urls = [
        <span class="hljs-string">'http://quotes.toscrape.com/page/1/'</span>,
    ]


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'div.quote'</span>):
            <span class="hljs-keyword">yield</span> {
                <span class="hljs-string">'text'</span>: quote.css(<span class="hljs-string">'span.text::text'</span>).extract_first(),
                <span class="hljs-string">'author'</span>: quote.css(<span class="hljs-string">'small.author::text'</span>).extract_first(),
                <span class="hljs-string">'tags'</span>: quote.css(<span class="hljs-string">'div.tags a.tag::text'</span>).extract(),
            }


        next_page = response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
        <span class="hljs-keyword">if</span> next_page <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            next_page = response.urljoin(next_page)
            <span class="hljs-keyword">yield</span> scrapy.Request(next_page, callback=self.parse)</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li></ul></pre>


<p>现在,在提取数据之后,该parse()方法寻找到下一页的链接,使用该urljoin()方法构建完整的绝对URL (因为链接可以是相对的)并且产生对下一页的新请求,将其注册为回调以处理针对下一页的数据提取,以及保持爬行通过所有页面。</p>


<p>这里看到的是Scrapy的向下链接的机制:当你在回调方法中产生一个请求时,Scrapy会调度要发送的请求,并注册一个回调方法,在上次请求完成时执行。</p>






<h3 id="更多示例和模式"><a name="t10" target="_blank"></a>更多示例和模式</h3>


<p>这里是另一个爬虫,说明回调和以下链接,这一次提取作者信息:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AuthorSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">'author'</span>


    start_urls = [<span class="hljs-string">'http://quotes.toscrape.com/'</span>]


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-comment"># follow links to author pages</span>
        <span class="hljs-keyword">for</span> href <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'.author + a::attr(href)'</span>).extract():
            <span class="hljs-keyword">yield</span> scrapy.Request(response.urljoin(href),
                                 callback=self.parse_author)


        <span class="hljs-comment"># follow pagination links</span>
        next_page = response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
        <span class="hljs-keyword">if</span> next_page <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            next_page = response.urljoin(next_page)
            <span class="hljs-keyword">yield</span> scrapy.Request(next_page, callback=self.parse)


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse_author</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_with_css</span><span class="hljs-params">(query)</span>:</span>
            <span class="hljs-keyword">return</span> response.css(query).extract_first().strip()


        <span class="hljs-keyword">yield</span> {
            <span class="hljs-string">'name'</span>: extract_with_css(<span class="hljs-string">'h3.author-title::text'</span>),
            <span class="hljs-string">'birthdate'</span>: extract_with_css(<span class="hljs-string">'.author-born-date::text'</span>),
            <span class="hljs-string">'bio'</span>: extract_with_css(<span class="hljs-string">'.author-description::text'</span>),
        }</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li><li>25</li><li>26</li><li>27</li><li>28</li><li>29</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li><li>25</li><li>26</li><li>27</li><li>28</li><li>29</li></ul></pre>


<p>这个爬虫将从主页开始,它将跟随所有指向作者页面的链接parse_author,每个链接都调用它们的回调,并且还有parse我们之前看到的回调链接。</p>


<p>该parse_author回调定义了一个辅助函数从一个CSS查询提取和清理数据,并产生了Python字典与作者的数据。</p>


<p>即使有很多来自同一作者的爬虫,我们不需要担心访问同一作者页多次。默认情况下,Scrapy会过滤掉已访问过的网址的重复请求,从而避免由于编程错误而导致服务器过多的问题。这可以通过设置进行配置 DUPEFILTER_CLASS。</p>


<p>此外,一个常见的模式是使用来自多个页面的数据构建项目,使用一个技巧将附加数据传递给回调。</p>


<p><strong>大家不要着急一下子把所以东西都介绍到,具体细节后面都会写到。</strong></p>


<p><br></p>






<h3 id="使用爬虫参数"><a name="t11" target="_blank"></a>使用爬虫参数</h3>


<p>您可以通过-a 在运行它们时使用该选项为您的爬虫提供命令行参数: <br>
<code>scrapy crawl quotes -o quotes-humor.json -a tag=humor</code></p>


<p>这些参数传递给Spider的<strong>init</strong>方法,默​​认情况下成为spider属性。</p>


<p>在此示例中,为tag参数提供的值将通过self.tag。您可以使用它来使您的蜘蛛仅抓取带有特定标记的引号,根据参数构建网址:</p>






<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">import</span> scrapy




<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QuotesSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
    name = <span class="hljs-string">"quotes"</span>


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start_requests</span><span class="hljs-params">(self)</span>:</span>
        url = <span class="hljs-string">'http://quotes.toscrape.com/'</span>
        tag = getattr(self, <span class="hljs-string">'tag'</span>, <span class="hljs-keyword">None</span>)
        <span class="hljs-keyword">if</span> tag <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            url = url + <span class="hljs-string">'tag/'</span> + tag
        <span class="hljs-keyword">yield</span> scrapy.Request(url, self.parse)


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
        <span class="hljs-keyword">for</span> quote <span class="hljs-keyword">in</span> response.css(<span class="hljs-string">'div.quote'</span>):
            <span class="hljs-keyword">yield</span> {
                <span class="hljs-string">'text'</span>: quote.css(<span class="hljs-string">'span.text::text'</span>).extract_first(),
                <span class="hljs-string">'author'</span>: quote.css(<span class="hljs-string">'small.author::text'</span>).extract_first(),
            }


        next_page = response.css(<span class="hljs-string">'li.next a::attr(href)'</span>).extract_first()
        <span class="hljs-keyword">if</span> next_page <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            next_page = response.urljoin(next_page)
            <span class="hljs-keyword">yield</span> scrapy.Request(next_page, self.parse)</code><ul class="pre-numbering" style="opacity: 0;"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li></ul><div class="save_code tracking-ad" data-mod="popu_249"><a href="javascript:;" target="_blank"><img src="http://static.blog.csdn.net/images/save_snippets.png"></a></div><ul class="pre-numbering"><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li><li>15</li><li>16</li><li>17</li><li>18</li><li>19</li><li>20</li><li>21</li><li>22</li><li>23</li><li>24</li></ul></pre>


<p>如果您将tag=humor参数传递给此蜘蛛,您会注意到它只会访问humor代码中的网址,例如 <a href="http://quotes.toscrape.com/tag/humor" target="_blank">http://quotes.toscrape.com/tag/humor</a>。</p></div>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值