使用Scrapy来爬取自己的CSDN文章

前言

爬虫作为一中数据搜集获取手段,在大数据的背景下,更加得到应用。我在这里只是记录学习的简单的例子。大牛可以直接使用python的url2模块直接抓下来页面,然后自己使用正则来处理,我这个技术屌丝只能依赖于框架,在这里我使用的是Scrapy。

install

首先是python的安装和pip的安装。 
sudo apt-get install python python-pip python-dev 
然后安装Scrapy 
sudo pip install Scrapy 
在安装Scrapy的过程中,其依赖于cryptography,在自动安装cryptography编译的过程中,其缺少了libffi库,导致Scrapy安装失败。在安装过程中,库缺少是主要的问题,只要根据安装失败的提示,安装缺少的库就ok了。 
sudo apt-get install libffi-dev 
我们使用mongodb来保存我们爬取的数据。 
sudo apt-get install mongodb 
使用pymongo2.7.2 
sudo pip install pymongo==2.7.2

过程

创建项目

使用下面的命令来创建一个项目 
scrapy startproject csdn 
csdn为项目名,其会产生如下结构的目录csdn 

├── csdn 
│ ├── init.py 
│ ├── items.py 
│ ├── pipelines.py 
│ ├── settings.py 
│ └── spiders 
│ └── init.py 
└── scrapy.cfg

2 directories, 6 files 
我们接着在csdn目录下使用crawl模板来创建一个名为csdn_crawler的爬虫。 
scrapy genspider csdn_crawler blod.csdn.net -t crawl

编写爬虫程序

在Scrapy中使用Item来保存一个爬取的项。我们的Item为CsdnItem,其包含两个域(Field),一个为title,另一个为url。

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scrapy.item <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> Item, Field

<span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">CsdnItem</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(Item)</span>:</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># define the fields for your item here like:</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># name = scrapy.Field()</span>
    title = Field()
    url = Field()
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">pass</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

我们的爬虫

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> scrapy
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scrapy.contrib.linkextractors <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> LinkExtractor
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scrapy.contrib.spiders <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> CrawlSpider, Rule

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> csdn.items <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> CsdnItem


<span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">CsdnCrawlerSpider</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(CrawlSpider)</span>:</span>
    name = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'csdn_crawler'</span>
    allowed_domains = [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'blog.csdn.net'</span>]
    start_urls = [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'http://blog.csdn.net/zhx6044'</span>]

    rules = (
        Rule(LinkExtractor(allow=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">r'article/list/[0-9]{1,20}'</span>), callback=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'parse_item'</span>, follow=<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">True</span>),
    )

    <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">parse_item</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self, response)</span>:</span>
        i = CsdnItem()
        <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()</span>
        <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#i['name'] = response.xpath('//div[@id="name"]').extract()</span>
        <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#i['description'] = response.xpath('//div[@id="description"]').extract()</span>
        i[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'title'</span>] = response.xpath(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'//*[@id="article_list"]/div[@class="list_item article_item"]/div[@class="article_title"]/h1/span/a/text()'</span>).extract()
        i[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'url'</span>] = response.xpath(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'//*[@id="article_list"]/div[@class="list_item article_item"]/div[@class="article_title"]/h1/span/a/@href'</span>).extract()
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> i</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li></ul>

name为这个爬虫的名字,在开始运行爬虫的时候开始运行。 
allowed_domains为爬虫针对的域名。 
start_urls为爬虫开始的URL,后续的URL从这里开始。 
rules为继续爬取的规则。这里的规则来之 
这里写图片描述 
下一页的链接为:article/list/xxxxxx为第几页。callback为有符合规则的链接时该调用的方法。 
parse_item是爬虫默认处理页面内容的方法,只需重写即可。 
下面最大的问题,就是怎么提取到文章的Title和URL。这里我们使用了Xpath,简单的来说这就通过规则匹配来提取到我们想要的内容。 
下面,我们就要分析页面。 
首先通过chromium的开发者工具找到我们需要的部分。 
这里写图片描述 
然后在

<code class="language-html hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"><<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">a</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(102, 0, 102);">href</span>=<span class="hljs-value" style="box-sizing: border-box; color: rgb(0, 136, 0);">"/zhx6044/article/details/45649045"</span>></span>
        qml+opencv(三)人脸检测与识别
        <span class="hljs-tag" style="color: rgb(0, 102, 102); box-sizing: border-box;"></<span class="hljs-title" style="box-sizing: border-box; color: rgb(0, 0, 136);">a</span>></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>

上右键选择Copy XPath,得到的是//*[@id=”article_list”]/div[1]/div[1]/h1/span/a,这个是第一条文章记录,第二条就是//*[@id=”article_list”]/div[2]/div[1]/h1/span/a,显然这个是不行的,我们不能改div[x]中的数值来索引条目,那么我们可以使用其class名来,它们都是一样的,所以就有了

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">   i[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'title'</span>] = response.xpath(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'//*[@id="article_list"]/div[@class="list_item article_item"]/div[@class="article_title"]/h1/span/a/text()'</span>).extract()
        i[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'url'</span>] = response.xpath(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'//*[@id="article_list"]/div[@class="list_item article_item"]/div[@class="article_title"]/h1/span/a/@href'</span>).extract()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

text()得出的文本就是标题,@herf得到的就是链接,extract以unicode编码返回。 
现在的到了CsdnItem,我们需要将他们保存起来,那么就需要用到PiepLine

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> pymongo

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scrapy.conf <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> settings
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scrapy.exceptions <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> DropItem
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scrapy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> log

<span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">class</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">CsdnPipeline</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(object)</span>:</span>
    <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">__init__</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self)</span>:</span>
        connection = pymongo.Connection(settings[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'MONGODB_SERVER'</span>], settings[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'MONGODB_PORT'</span>])
        db = connection[settings[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'MONGODB_DB'</span>]]
        self.collection = db[settings[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'MONGODB_COLLECTION'</span>]]

    <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">process_item</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self, item, spider)</span>:</span>
        valid = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">True</span>
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> data <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> item:
            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">not</span> data:
                valid = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">False</span>
                <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">raise</span> DropItem(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Missing data!"</span>)
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> valid:
            <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#self.collection.update({'url':item['url']}, dict(item), upsert = True)</span>
            self.collection.insert(dict(item))
            log.msg(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Article add to mongodb database!"</span>,level = log.DEBUG, spider = spider)
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> item
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li></ul>

其process_item方法处理的Item就是csdn_crawler中得到的。 
还有一个settings.py,我们在系统保存参数

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">BOT_NAME = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'csdn'</span>
DOWNLOAD_DELAY = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>
SPIDER_MODULES = [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'csdn.spiders'</span>]
NEWSPIDER_MODULE = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'csdn.spiders'</span>
ITEM_PIPELINES = {<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'csdn.pipelines.CsdnPipeline'</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1000</span>,}
MONGODB_SERVER=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"localhost"</span>
MONGODB_PORT=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">27017</span>
MONGODB_DB=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"csdn"</span>
MONGODB_COLLECTION=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"article"</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

DOWNLOAD_DELAY避免一直爬去这个域名,导致其负载较大,但是对我们这样小规模的爬去,没什么作用。

运行

使用scrapy crawl csdn_crawler运行爬虫。 
这是我们其中的一些输出

<code class="language-log hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">01</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">02</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">02</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">02</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [scrapy] INFO: Enabled item pipelines: CsdnPipeline
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">02</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] INFO: Spider opened
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">02</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] INFO: Crawled <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pages (at <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pages/min), scraped <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> items (at <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> items/min)
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">02</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [scrapy] DEBUG: Telnet console listening on <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">127.0</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">.0</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">.1</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6023</span>
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">02</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [scrapy] DEBUG: Web service listening on <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">127.0</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">.0</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">.1</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6080</span>
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">04</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Crawled (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span>) <GET http://blog.csdn.net/zhx6044> (referer: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">None</span>)
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">09</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Crawled (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span>) <GET http://blog.csdn.net/zhx6044/article/list/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>> (referer: http://blog.csdn.net/zhx6044)
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">09</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Article add to mongodb database!
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">09</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Scraped <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> <<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span> http://blog.csdn.net/zhx6044/article/list/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>>
    {<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'title'</span>: [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        PHP\u4f7f\u7528CURL\u8fdb\u884cPOST\u64cd\u4f5c\u65f6\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        \u4ecestoryboard\u52a0\u8f7d\u89c6\u56fe\u63a7\u5236\u5668\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        PHP\u5f97\u5230POST\u4e0a\u6765\u7684JSON\u6570\u636e\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        Docker\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        Haproxy+nginx+php\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        php curl\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        \u5de5\u4f5c\u534a\u5e74\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        \u57fa\u4e8enginx_http_push_module\u6a21\u5757\u8ba9nginx\u53d8\u6210Comet Server\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        \u5728\u5185\u7f51\u67b6\u8bbe\u4e00\u4e2a\u53ef\u4f9b\u5916\u7f51\u767b\u5f55\u7684ftp\u670d\u52a1\u5668\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        REST,http,\u670d\u52a1\u5668\u5f00\u53d1\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        \u9634\u5dee\u9633\u9519\u53c8\u505a\u8d77linux\u6765\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        \u57fa\u4e8eTCP\u534f\u8bae\u7684\u89c6\u9891\u4f20\u8f93\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        \u5bb6\u4eba\u91cd\u75c5\u4ec0\u4e48\u5fc3\u60c5\u90fd\u6ca1\u4e86\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        opencv\u5b9e\u73b0\u8fb9\u7f18\u68c0\u6d4b\r\n        '</span>,
               <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'\r\n        opencv\u5b9e\u73b0\u591a\u8def\u64ad\u653e\r\n        '</span>],
     <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'url'</span>: [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/43418115'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/43418011'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/43417923'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/43417701'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/43240585'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/42804333'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/42646439'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/42040943'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/41051061'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/40789909'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/40592187'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/40016929'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/39970089'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/39256473'</span>,
             <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">u'/zhx6044/article/details/39033141'</span>]}
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">09</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Filtered duplicate request: <GET http://blog.csdn.net/zhx6044/article/list/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Crawled (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span>) <GET http://blog.csdn.net/zhx6044/article/list/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">14</span>> (referer: http://blog.csdn.net/zhx6044)
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Article add to mongodb database!
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2015</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">05</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15</span>+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0800</span> [csdn_crawler] DEBUG: Scraped <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> <<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span> http://blog.csdn.net/zhx6044/article/list/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">14</span>></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li></ul>

mongodb中的数据 
这里写图片描述

整个代码可以在这里获得。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值