scrapy 中文教程（爬cnbeta实例）

最新推荐文章于 2024-11-08 13:43:44 发布

weixin_34223655

最新推荐文章于 2024-11-08 13:43:44 发布

阅读量293

点赞数

文章标签： python 爬虫 json

原文链接：http://blog.51cto.com/wsky09/1352271

版权

scrapy 中文的资料挺少的，写文章记录一下，以爬cnbeta新闻为例子，
抓取cnbeta的新闻标题+链接。

1.新建scrapy项目

      scrapy startproject cnbeta 
    

目录结构：

2.定义数据结构
编辑cnbeta/items.py

 
            1 
          
            2 
          
            3 
          
            4 
          
            5 
          
            6 
          
            7 
          
            8 
          
            9 
          
            10 
          
            11 
          
            12 
          
           # Define here the models for your scraped items 
          
           # 
          
           # See documentation in: 
          
           # http://doc.scrapy.org/en/latest/topics/items.html 
          
           from 
            
           scrapy 
           . 
           item  
           import 
            
           Item 
           , 
            
           Field 
          
           class 
            
           CnbetaItem 
           ( 
           Item 
           ) 
           : 
          
           # define the fields for your item here like: 
          
           # name = Field() 
          
           title 
            
           = 
            
           Field 
           ( 
           ) 
          
           url 
            
           = 
            
           Field 
           ( 
           )

定义了两个字段，分别存储标题和链接

3.编写spider（爬虫）
编辑cnbeta/spiders/cb.py

 
            1 
          
            2 
          
            3 
          
            4 
          
            5 
          
            6 
          
            7 
          
            8 
          
            9 
          
            10 
          
            11 
          
            12 
          
            13 
          
            14 
          
            15 
          
            16 
          
            17 
          
            18 
          
            19 
          
            20 
          
            21 
          
            22 
          
           from  
           scrapy 
           . 
           contrib 
           . 
           spiders  
           import  
           CrawlSpider 
           , 
            
           Rule 
          
           from  
           scrapy 
           . 
           contrib 
           . 
           linkextractors 
           . 
           sgml  
           import  
           SgmlLinkExtractor 
          
           from  
           scrapy 
           . 
           selector  
           import  
           Selector 
          
           from  
           cnbeta 
           . 
           items  
           import  
           CnbetaItem 
          
           class 
            
           CBSpider 
           ( 
           CrawlSpider 
           ) 
           : 
          
           name 
            
           = 
            
           'cnbeta' 
          
           allowed_domains 
            
           = 
            
           [ 
           'cnbeta.com' 
           ] 
          
           start_urls 
            
           = 
            
           [ 
           'http://www.cnbeta.com' 
           ] 
          
           rules 
            
           = 
            
           ( 
          
           Rule 
           ( 
           SgmlLinkExtractor 
           ( 
           allow 
           = 
           ( 
           '/articles/.*\.htm' 
           , 
            
           ) 
           ) 
           , 
          
           callback 
           = 
           'parse_page' 
           , 
            
           follow 
           = 
           True 
           ) 
           , 
          
           ) 
          
           def  
           parse_page 
           ( 
           self 
           , 
            
           response 
           ) 
           : 
          
           item 
            
           = 
            
           CnbetaItem 
           ( 
           ) 
          
           sel 
            
           = 
            
           Selector 
           ( 
           response 
           ) 
          
           item 
           [ 
           'title' 
           ] 
            
           = 
            
           sel 
           . 
           xpath 
           ( 
           '//title/text()' 
           ) 
           . 
           extract 
           ( 
           ) 
          
           item 
           [ 
           'url' 
           ] 
            
           = 
            
           response 
           . 
           url 
          
           return 
            
           item

rules指定了含有/articles/.*\.htm的链接都会被匹配.

4.运行爬虫

将结果输出到result.json -t json指定文件格式为json

5.结果

 
    [ 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268661.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u53ea\u97001\u5143\u4e0d\u62a2\u624d\u75af\uff01\u5c0f\u5ea6Wi-Fi\u5957\u88c5\u9707\u64bc\u4ef716\u65e5\u5f00\u62a2_Baidu \u767e\u5ea6_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268872.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u8c37\u6b4c\u667a\u80fd\u5bb6\u5c45\u8ba1\u5212\u6216\u72af\u7684\u9519\uff1a\u5c01\u95edNest_Google / \u8c37\u6b4c_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268865.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u4e2d\u56fd\u624b\u673a\u7f51\u6c11\u89c4\u6a21\u8fbe5\u4ebf \u5e74\u589e\u957f8009\u4e07\u4eba_cnBeta \u89c6\u70b9\u89c2\u5bdf_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268869.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u524d\u82f9\u679c\u9ad8\u7ea7\u526f\u603b\u88c1\u4e3a\u5927\u5b66\u4ee3\u8a00\uff1f\u539f\u662f\u56fe\u7247\u88ab\u76d7\u7528_cnBeta \u4eba\u7269_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268866.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u6bd4\u7279\u5e01\u4ea4\u6613\u5e73\u53f0\u906d\u9047\u751f\u5b58\u5371\u673a_cnBeta \u89c6\u70b9\u89c2\u5bdf_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268870.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u76db\u5927\u6e38\u620f\u4f20\u5947\u88ab\u4fb5\u6743\u6848\u65b0\u8fdb\u5c55\uff1a\u8ffd\u52a0\u56db\u540d\u88ab\u544a_cnBeta \u89c6\u70b9\u89c2\u5bdf_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268867.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u91d1\u878d\u65f6\u62a5\uff1a\u4e2d\u56fd\u8d2b\u56f0\u5730\u533a\u7f51\u8d2d\u589e\u901f\u8d85\u53d1\u8fbe\u5730\u533a_\u7535\u5b50\u5546\u52a1 - B2C / B2B_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268868.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "\u79fb\u52a84G\u7248iPhone 5s/5c\u4e0a\u624b\uff1a\u901f\u5ea6\u6539\u53d8\u4f53\u9a8c_Apple iPhone_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268871.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "[\u7ec4\u56fe]\u7247\u573a\u63a2\u79d8\u544a\u8bc9\u4f60\u7d22\u5c3c4K\u5f71\u7247\u662f\u600e\u4e48\u70bc\u6210\u7684_SONY \u7d22\u5c3c_cnBeta.COM" 
    ] 
    } 
    , 
   

 
    { 
    "url" 
    : 
     
    "http://www.cnbeta.com/articles/268875.htm" 
    , 
     
    "title" 
    : 
     
    [ 
    "USB 3.0\u548c\u534