使用python进行web抓取

最新推荐文章于 2024-05-11 11:28:36 发布

刘星石

最新推荐文章于 2024-05-11 11:28:36 发布

阅读量2.1k

点赞数

分类专栏： Python

Python 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

http://cxy.liuzhihengseo.com/462.html

原文出处：磁针石

本文摘要自Web Scraping with Python – 2015

书籍下载地址：https://bitbucket.org/xurongzhong/python-chinese-library/downloads

源码地址：https://bitbucket.org/wswp/code

演示站点：http://example.webscraping.com/

演示站点代码：http://bitbucket.org/wswp/places

推荐的python基础教程： http://www.diveintopython.net

HTML和JavaScript基础：

http://www.w3schools.com

web抓取简介

为什么要进行web抓取？

网购的时候想比较下各个网站的价格，也就是实现惠惠购物助手的功能。有API自然方便，但是通常是没有API，此时就需要web抓取。

web抓取是否合法？

抓取的数据，个人使用不违法，商业用途或重新发布则需要考虑授权，另外需要注意礼节。根据国外已经判决的案例，一般来说位置和电话可以重新发布，但是原创数据不允许重新发布。

http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html

http://caselaw.findlaw.com/us-supreme-court/499/340.html

背景研究

robots.txt和Sitemap可以帮助了解站点的规模和结构，还可以使用谷歌搜索和WHOIS等工具。

比如：http://example.webscraping.com/robots.txt

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
        # section 1 
       
        User 
        - 
        agent 
        :  
        BadCrawler 
       
        Disallow 
        :  
        / 
       
        # section 2 
       
        User 
        - 
        agent 
        :  
        * 
       
        Crawl 
        - 
        delay 
        :  
        5 
       
        Disallow 
        :  
        / 
        trap  
       
        # section 3 
       
        Sitemap 
        :  
        http 
        : 
        //example.webscraping.com/sitemap.xml

更多关于web机器人的介绍参见 http://www.robotstxt.org。
Sitemap的协议： http://www.sitemaps.org/protocol.html，比如：

 
         1 
       
         2 
       
         3 
       
         4 
       
        http 
        : 
        //example.webscraping.com/view/Afghanistan-1 
       
        http 
        : 
        //example.webscraping.com/view/Aland-Islands-2 
       
        http 
        : 
        //example.webscraping.com/view/Albania-3 
       
        . 
        . 
        .

站点地图经常不完整。

站点大小评估：
通过google的site查询比如：site:automationtesting.sinaapp.com

站点技术评估：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
        # pip install builtwith 
       
        # ipython 
       
        In  
        [ 
        1 
        ] 
        :  
        import  
        builtwith 
       
        In  
        [ 
        2 
        ] 
        :  
        builtwith 
        . 
        parse 
        ( 
        'http://automationtesting.sinaapp.com/' 
        ) 
       
        Out 
        [ 
        2 
        ] 
        :  
       
        { 
        u 
        'issue-trackers' 
        :  
        [ 
        u 
        'Trac' 
        ] 
        , 
       
        u 
        'javascript-frameworks' 
        :  
        [ 
        u 
        'jQuery' 
        ] 
        , 
       
        u 
        'programming-languages' 
        :  
        [ 
        u 
        'Python' 
        ] 
        , 
       
        u 
        'web-servers' 
        :  
        [ 
        u 
        'Nginx' 
        ] 
        }

分析网站所有者：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
        # pip install python-whois 
       
        # ipython 
       
        In  
        [ 
        1 
        ] 
        :  
        import  
        whois 
       
        In  
        [ 
        2 
        ] 
        :  
        print  
        whois 
        . 
        whois 
        ( 
        'http://automationtesting.sinaapp.com' 
        ) 
       
        { 
       
        "updated_date" 
        :  
        "2016-01-07 00:00:00" 
        ,  
       
        "status" 
        :  
        [ 
       
        "serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited" 
        ,  
       
        "serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited" 
        ,  
       
        "serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited" 
       
        ] 
        ,  
       
        "name" 
        :  
        null 
        ,  
       
        "dnssec" 
        :  
        null 
        ,  
       
        "city" 
        :  
        null 
        ,  
       
        "expiration_date" 
        :  
        "2021-06-29 00:00:00" 
        ,  
       
        "zipcode" 
        :  
        null 
        ,  
       
        "domain_name" 
        :  
        "SINAAPP.COM" 
        ,  
       
        "country" 
        :  
        null 
        ,  
       
        "whois_server" 
        :  
        "whois.paycenter.com.cn" 
        ,  
       
        "state" 
        :  
        null 
        ,  
       
        "registrar" 
        :  
        "XIN NET TECHNOLOGY CORPORATION" 
        ,  
       
        "referral_url" 
        :  
        "http://www.xinnet.com" 
        ,  
       
        "address" 
        :  
        null 
        ,  
       
        "name_servers" 
        :  
        [ 
       
        "NS1.SINAAPP.COM" 
        ,  
       
        "NS2.SINAAPP.COM" 
        ,  
       
        "NS3.SINAAPP.COM" 
        ,  
       
        "NS4.SINAAPP.COM" 
       
        ] 
        ,  
       
        "org" 
        :  
        null 
        ,  
       
        "creation_date" 
        :  
        "2009-06-29 00:00:00" 
        ,  
       
        "emails" 
        :  
        null 
       
        }

抓取第一个站点

简单的爬虫(crawling)代码如下：

 
    
 
    
 
    
 
    
 
    
 
    
Python
 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       

         7 
       

         8 
       

         9 
       

         10 
       
 
        import 
        & 
        nbsp 
        ; 
        urllib2 
       

           
       
 
        def 
        & 
        nbsp 
        ; 
        download 
        ( 
        url 
        ) 
        : 
       
 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        print 
        & 
        nbsp 
        ; 
        'Downloading:' 
        , 
        & 
        nbsp 
        ; 
        url 
       
 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        try 
        : 
       
 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        html 
        & 
        nbsp 
        ; 
        =& 
        nbsp 
        ; 
        urllib2 
        . 
        urlopen 
        ( 
        url 
        ) 
        . 
        read 
        ( 
        ) 
       
 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        except 
        & 
        nbsp 
        ; 
        urllib2 
        . 
        URLError 
        & 
        nbsp 
        ; 
        as 
        & 
        nbsp 
        ; 
        e 
        : 
       
 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        print 
        & 
        nbsp 
        ; 
        'Download&nbsp;error:' 
        , 
        & 
        nbsp 
        ; 
        e 
        . 
        reason 
       
 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        html 
        & 
        nbsp 
        ; 
        =& 
        nbsp 
        ; 
        None 
       
 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        & 
        nbsp 
        ; 
        return 
        & 
        nbsp 
        ; 
        html 
       
 
 

可以基于错误码重试。HTTP状态码：https://tools.ietf.org/html/rfc7231#section-6。4**没必要重试，5**可以重试下。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
         42 
       
         43 
       
         44 
       
         45 
       
         46 
       
         47 
       
         48 
       
         49 
       
         50 
       
         51 
       
         52 
       
         53 
       
         54 
       
         55 
       
         56 
       
         57 
       
         58 
       
         59 
       
         60 
       
         61 
       
         62 
       
         63 
       
         64 
       
         65 
       
         66 
       
         67 
       
         68 
       
         69 
       
         70 
       
         71 
       
         72 
       
         73 
       
         74 
       
         75 
       
         76 
       
         77 
       
         78 
       
         79 
       
         80 
       
         81 
       
         82 
       
         83 
       
         84 
       
         85 
       
         86 
       
         87 
       
         88 
       
         89 
       
         90 
       
         91 
       
         92 
       
         93 
       
         94 
       
         95 
       
         96 
       
         97 
       
         98 
       
         99 
       
         100 
       
         101 
       
         102 
       
         103 
       
         104 
       
         105 
       
         106 
       
         107 
       
         108 
       
         109 
       
        import  
        urllib2 
       
        def  
        download 
        ( 
        url 
        ,  
        num_retries 
        = 
        2 
        ) 
        : 
       
        print  
        'Downloading:' 
        ,  
        url 
       
        try 
        : 
       
        html  
        =  
        urllib2 
        . 
        urlopen 
        ( 
        url 
        ) 
        . 
        read 
        ( 
        ) 
       
        except  
        urllib2 
        . 
        URLError  
        as  
        e 
        : 
       
        print  
        'Download error:' 
        ,  
        e 
        . 
        reason 
       
        html  
        =  
        None 
       
        if  
        num 
        _retries  
        >  
        0 
        : 
       
        if  
        hasattr 
        ( 
        e 
        ,  
        'code' 
        )  
        and  
        500  
       
        http 
        : 
        //httpstat.us/500 会返回500，可以用它来测试下： 
       
        >>>  
        download 
        ( 
        'http://httpstat.us/500' 
        ) 
       
        Downloading 
        :  
        http 
        : 
        //httpstat.us/500 
       
        Download  
        error 
        :  
        Internal  
        Server  
        Error 
       
        Downloading 
        :  
        http 
        : 
        //httpstat.us/500 
       
        Download  
        error 
        :  
        Internal  
        Server  
        Error 
       
        Downloading 
        :  
        http 
        : 
        //httpstat.us/500 
       
        Download  
        error 
        :  
        Internal  
        Server  
        Error 
       
         设置  
        user  
        agent： 
       
        urllib2默认的 
        user  
        agent是“ 
        Python 
        - 
        urllib 
        / 
        2.7”，很多网站会对此进行拦截 
        , 
         推荐使用接近真实的 
        agent，比如 
       
        Mozilla 
        / 
        5.0  
        ( 
        X11 
        ;  
        Linux  
        x86_64 
        ;  
        rv 
        : 
        38.0 
        )  
        Gecko 
        / 
        20100101  
        Firefox 
        / 
        38.0 
       
         为此我们增加 
        user  
        agent设置： 
       
        import  
        urllib2 
       
        def  
        download 
        ( 
        url 
        ,  
        user_agent 
        = 
        'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0' 
        ,  
        num_retries 
        = 
        2 
        ) 
        : 
       
        print  
        'Downloading:' 
        ,  
        url 
       
        headers  
        =  
        { 
        'User-agent' 
        :  
        user_agent 
        } 
       
        request  
        =  
        urllib2 
        . 
        Request 
        ( 
        url 
        ,  
        headers 
        = 
        headers 
        )     
       
        try 
        : 
       
        html  
        =  
        urllib2 
        . 
        urlopen 
        ( 
        request 
        ) 
        . 
        read 
        ( 
        ) 
       
        except  
        urllib2 
        . 
        URLError  
        as  
        e 
        : 
       
        print  
        'Download error:' 
        ,  
        e 
        . 
        reason 
       
        html  
        =  
        None 
       
        if  
        num 
        _retries  
        >  
        0 
        : 
       
        if  
        hasattr 
        ( 
        e 
        ,  
        'code' 
        )  
        and  
        500  
       
         爬行站点地图： 
       
        def  
        crawl_sitemap 
        ( 
        url 
        ) 
        : 
       
        # download the sitemap file 
       
        sitemap  
        =  
        download 
        ( 
        url 
        ) 
       
        # extract the sitemap links 
       
        links  
        =  
        re 
        . 
        findall 
        ( 
        '(.*?)' 
        ,  
        sitemap 
        ) 
       
        # download each link 
       
        for  
        link  
        in  
        links 
        : 
       
        html  
        =  
        download 
        ( 
        link 
        ) 
       
        # scrape html here 
       
        # ... 
       
        ID循环爬行：•  
          
        http 
        : 
        //example.webscraping.com/view/Afghanistan-1•  http://example.webscraping.com/view/Australia-2•  http://example.webscraping.com/view/Brazil-3上面几个网址仅仅是最后面部分不同，通常程序员喜欢用数据库的id，比如：http://example.webscraping.com/view/1 ，这样我们就可以数据库的id抓取网页。 
       
        for  
        page  
        in  
        itertools 
        . 
        count 
        ( 
        1 
        ) 
        : 
       
        url  
        =  
        'http://example.webscraping.com/view/-%d'  
        %  
        page 
       
        html  
        =  
        download 
        ( 
        url 
        ) 
       
        if  
        html  
        is  
        None 
        : 
       
        break 
       
        else 
        : 
       
        # success - can scrape the result 
       
        pass 
       
          当然数据库有可能删除了一条记录，为此我们改进成如下： 
       
        # maximum number of consecutive download errors allowed 
       
        max 
        _errors  
        =  
        5 
       
        # current number of consecutive download errors 
       
        num 
        _errors  
        =  
        0 
       
        for  
        page  
        in  
        itertools 
        . 
        count 
        ( 
        1 
        ) 
        : 
       
        url  
        =  
        'http://example.webscraping.com/view/-%d'  
        %  
        page 
       
        html  
        =  
        download 
        ( 
        url 
        ) 
       
        if  
        html  
        is  
        None 
        : 
       
        # received an error trying to download this webpage 
       
        num 
        _errors  
        +=  
        1 
       
        if  
        num 
        _errors  
        ==  
        max_errors 
        : 
       
        # reached maximum number of 
       
        # consecutive errors so exit 
       
        break 
       
        else 
        : 
       
        # success - can scrape the result 
       
        # ... 
       
        num 
        _errors  
        =  
        0 
       
         有些网站不存在的时候会返回 
        404，有些网站的 
        ID不是这么有规则的，比如亚马逊使用 
        ISBN。   
              
         分析网页 
       
         一般的浏览器都有 
        "查看页面源码"的功能，在 
        Firefox， 
        Firebug尤其方便。以上工具都可以邮件点击网页调出。抓取网页数据主要有 
        3种方法：正则表达式、 
        BeautifulSoup和 
        lxml。正则表达式示例： 
       
        In  
        [ 
        1 
        ] 
        :  
        import  
        re 
       
        In  
        [ 
        2 
        ] 
        :  
        import  
        common 
       
        In  
        [ 
        3 
        ] 
        :  
        url  
        =  
        'http://example.webscraping.com/view/UnitedKingdom-239' 
       
        In  
        [ 
        4 
        ] 
        :  
        html  
        =  
        common 
        . 
        download 
        ( 
        url 
        ) 
       
        Downloading 
        :  
        http 
        : 
        //example.webscraping.com/view/UnitedKingdom-239 
       
        In  
        [ 
        5 
        ] 
        :  
        re 
        . 
        findall 
        ( 
        '(.*?)' 
        ,  
        html 
        ) 
       
        Out 
        [ 
        5 
        ] 
        :  
       
        [ 
        '' 
        , 
       
        '244,820 square kilometres' 
        , 
       
        '62,348,447' 
        , 
       
        'GB' 
        , 
       
        'United Kingdom' 
        , 
       
        'London' 
        , 
       
        'EU' 
        , 
       
        '.uk' 
        , 
       
        'GBP' 
        , 
       
        'Pound' 
        , 
       
        '44' 
        , 
       
        '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA' 
        , 
       
        '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2}[A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2})|([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$' 
        , 
       
        'en-GB,cy-GB,gd' 
        , 
       
        'IE ' 
        ] 
       
        In  
        [ 
        6 
        ] 
        :  
        re 
        . 
        findall 
        ( 
        '(.*?)' 
        ,  
        html 
        ) 
        [ 
        1 
        ] 
       
        Out 
        [ 
        6 
        ] 
        :  
        '244,820 square kilometres'

维护成本比较高。
Beautiful Soup：

 
Python
 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
        In 
          
        [ 
        7 
        ] 
        : 
          
        from 
          
        bs4  
        import 
          
        BeautifulSoup 
       
        In 
          
        [ 
        8 
        ] 
        : 
          
        broken_html 
          
        = 
          
        '<ul class=country><li>Area<li>Population</ul>' 
       
        In 
          
        [ 
        9 
        ] 
        : 
          
        # parse the HTML 
       
        In 
          
        [ 
        10 
        ] 
        : 
          
        soup 
          
        = 
          
        BeautifulSoup 
        ( 
        broken_html 
        , 
          
        'html.parser' 
        ) 
       
        In 
          
        [ 
        11 
        ] 
        : 
          
        fixed_html 
          
        = 
          
        soup 
        . 
        prettify 
        ( 
        ) 
       
        In 
          
        [ 
        12 
        ] 
        : 
          
        print 
          
        fixed_html 
       
        < 
        ul  
        class 
        = 
        "country" 
        > 
       
        < 
        li 
        > 
       
        Area 
       
        < 
        li 
        > 
       
        Population 
       
        < 
        / 
        li 
        > 
       
        < 
        / 
        li 
        > 
       
        < 
        / 
        ul 
        > 
       
        In 
          
        [ 
        13 
        ] 
        : 
          
        ul 
          
        = 
          
        soup 
        . 
        find 
        ( 
        'ul' 
        , 
          
        attrs 
        = 
        { 
        'class' 
        : 
        'country' 
        } 
        ) 
       
        In 
          
        [ 
        14 
        ] 
        : 
          
        ul 
        . 
        find 
        ( 
        'li' 
        ) 
          
        # returns just the first match 
       
        Out 
        [ 
        14 
        ] 
        : 
          
        < 
        li 
        > 
        Area 
        < 
        li 
        > 
        Population 
        < 
        / 
        li 
        > 
        < 
        / 
        li 
        > 
       
        In 
          
        [ 
        15 
        ] 
        : 
          
        ul 
        . 
        find_all 
        ( 
        'li' 
        ) 
          
        # returns all matches 
       
        Out 
        [ 
        15 
        ] 
        : 
          
        [ 
        < 
        li 
        > 
        Area 
        < 
        li 
        > 
        Population 
        < 
        / 
        li 
        > 
        < 
        / 
        li 
        > 
        , 
          
        < 
        li 
        > 
        Population 
        < 
        / 
        li 
        > 
        ]

完整的例子：

 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       

         7 
       

         8 
       

         9 
       

         10 
       

         11 
       

         12 
       

         13 
       

         14 
       

         15 
       

         16 
       

         17 
       

         18 
       

         19 
       

         20 
       

         21 
       

         22 
       

         23 
       

         24 
       

         25 
       

         26 
       

         27 
       

         28 
       

         29 
       

         30 
       

         31 
       

         32 
       

         33 
       

         34 
       

         35 
       

         36 
       

         37 
       
 
        In  
        [ 
        1 
        ] 
        :  
        from  
        bs4  
        import  
        BeautifulSoup 
       

           
       
 
        In  
        [ 
        2 
        ] 
        :  
        url  
        =  
        'http://example.webscraping.com/places/view/United-Kingdom-239' 
       

           
       
 
        In  
        [ 
        3 
        ] 
        :  
        import  
        common 
       

           
       
 
        In  
        [ 
        5 
        ] 
        :  
        html  
        =  
        common 
        . 
        download 
        ( 
        url 
        ) 
       
 
        Downloading 
        :  
        http 
        : 
        //example.webscraping.com/places/view/United-Kingdom-239 
       

           
       
 
        In  
        [ 
        6 
        ] 
        :  
        soup  
        =  
        BeautifulSoup 
        ( 
        html 
        ) 
       
 
        / 
        usr 
        / 
        lib 
        / 
        python2 
        . 
        7 
        / 
        site 
        - 
        packages 
        / 
        bs4 
        / 
        __init__ 
        . 
        py 
        : 
        166 
        : 
       

           
        UserWarning 
        :  
        No  
        parser  
        was  
        explicitly  
        specified 
        ,  
        so  
        I 
        'm using the best  
       
 
        available HTML parser for this system ("lxml"). This usually isn' 
        t  
        a  
       
 
        problem 
        ,  
        but  
        if  
        you  
        run  
        this  
        code  
        on  
        another  
        system 
        ,  
        or  
        in  
        a  
        different  
       
 
        virtual  
        environment 
        ,  
        it  
        may  
        use  
        a  
        different  
        parser  
        and  
        behave  
       
 
        differently 
        . 
       

           
       
 
        To  
        get  
        rid  
        of  
        this  
        warning 
        ,  
        change  
        this 
        : 
       

           
       

           
        BeautifulSoup 
        ( 
        [ 
        your  
        markup 
        ] 
        ) 
       

           
       
 
        to  
        this 
        : 
       

           
       

           
        BeautifulSoup 
        ( 
        [ 
        your  
        markup 
        ] 
        ,  
        "lxml" 
        ) 
       

           
       

            
        markup_type 
        = 
        markup_type 
        ) 
        ) 
       

           
       
 
        In  
        [ 
        7 
        ] 
        :  
        # locate the area row 
       

           
       
 
        In  
        [ 
        8 
        ] 
        :  
        tr  
        =  
        soup 
        . 
        find 
        ( 
        attrs 
        = 
        { 
        'id' 
        : 
        'places_area__row' 
        } 
        ) 
       

           
       
 
        In  
        [ 
        9 
        ] 
        :  
        td  
        =  
        tr 
        . 
        find 
        ( 
        attrs 
        = 
        { 
        'class' 
        : 
        'w2p_fw' 
        } 
        )  
        # locate the area tag 
       

           
       
 
        In  
        [ 
        10 
        ] 
        :  
        area  
        =  
        td 
        . 
        text  
        # extract the text from this tag 
       

           
       
 
        In  
        [ 
        11 
        ] 
        :  
        print  
        area 
       
 
        244 
        , 
        820  
        square  
        kilometres 
       
 
 

Lxml基于 libxml2(c语言实现)，更快速，但是有时更难安装。网址：http://lxml.de/installation.html。

 
Python
 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
        In 
          
        [ 
        1 
        ] 
        : 
          
        import 
          
        lxml 
        . 
        html 
       
        In 
          
        [ 
        2 
        ] 
        : 
          
        broken_html 
          
        = 
          
        '<ul class=country><li>Area<li>Population</ul>' 
       
        In 
          
        [ 
        3 
        ] 
        : 
          
        tree 
          
        = 
          
        lxml 
        . 
        html 
        . 
        fromstring 
        ( 
        broken_html 
        ) 
          
        # parse the HTML 
       
        In 
          
        [ 
        4 
        ] 
        : 
          
        fixed_html 
          
        = 
          
        lxml 
        . 
        html 
        . 
        tostring 
        ( 
        tree 
        , 
          
        pretty_print 
        = 
        True 
        ) 
       
        In 
          
        [ 
        5 
        ] 
        : 
          
        print 
          
        fixed_html 
       
        < 
        ul  
        class 
        = 
        "country" 
        > 
       
        < 
        li 
        > 
        Area 
        < 
        / 
        li 
        > 
       
        < 
        li 
        > 
        Population 
        < 
        / 
        li 
        > 
       
        < 
        / 
        ul 
        >

lxml的容错能力也比较强，少半边标签通常没事。

下面使用css选择器，注意安装cssselect。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
        In  
        [ 
        1 
        ] 
        :  
        import  
        common 
       
        In  
        [ 
        2 
        ] 
        :  
        import  
        lxml 
        . 
        html 
       
        In  
        [ 
        3 
        ] 
        :  
        url  
        =  
        'http://example.webscraping.com/places/view/United-Kingdom-239' 
       
        In  
        [ 
        4 
        ] 
        :  
        html  
        =  
        common 
        . 
        download 
        ( 
        url 
        ) 
       
        Downloading 
        :  
        http 
        : 
        //example.webscraping.com/places/view/United-Kingdom-239 
       
        In  
        [ 
        5 
        ] 
        :  
        tree  
        =  
        lxml 
        . 
        html 
        . 
        fromstring 
        ( 
        html 
        ) 
       
        In  
        [ 
        6 
        ] 
        :  
        td  
        =  
        tree 
        . 
        cssselect 
        ( 
        'tr#places_area__row > td.w2p_fw' 
        ) 
        [ 
        0 
        ] 
       
        In  
        [ 
        7 
        ] 
        :  
        area  
        =  
        td 
        . 
        text_content 
        ( 
        ) 
       
        In  
        [ 
        8 
        ] 
        :  
        print  
        area 
       
        244 
        , 
        820  
        square  
        kilometres

在 CSS 中，选择器是一种模式，用于选择需要添加样式的元素。

“CSS” 列指示该属性是在哪个 CSS 版本中定义的。（CSS1、CSS2 还是 CSS3。）

选择器	例子	例子描述	CSS
.class	.intro	选择 class=”intro” 的所有元素。	1
#id	#firstname	选择 id=”firstname” 的所有元素。	1
*	*	选择所有元素。	2
element	p	选择所有元素。	1
element,element	div,p	选择所有元素和所有元素。	1
element element	div p	选择元素内部的所有元素。	1
element>element	div>p	选择父元素为元素的所有元素。	2
element+element	div+p	选择紧接在元素之后的所有元素。	2
[attribute]	[target]	选择带有 target 属性所有元素。	2
[attribute=value]	[target=_blank]	选择 target=”_blank” 的所有元素。	2
[attribute~=value]	[title~=flower]	选择 title 属性包含单词 “flower” 的所有元素。	2
[attribute\|=value]	[lang\|=en]	选择 lang 属性值以 “en” 开头的所有元素。	2
:link	a:link	选择所有未被访问的链接。	1
:visited	a:visited	选择所有已被访问的链接。	1
:active	a:active	选择活动链接。	1
:hover	a:hover	选择鼠标指针位于其上的链接。	1
:focus	input:focus	选择获得焦点的 input 元素。	2
:first-letter	p:first-letter	选择每个元素的首字母。	1
:first-line	p:first-line	选择每个元素的首行。	1
:first-child	p:first-child	选择属于父元素的第一个子元素的每个元素。	2
:before	p:before	在每个元素的内容之前插入内容。	2
:after	p:after	在每个元素的内容之后插入内容。	2
:lang(language)	p:lang(it)	选择带有以 “it” 开头的 lang 属性值的每个元素。	2
element1~element2	p~ul	选择前面有元素的每个元素。	3
[attribute^=value]	a[src^="https"]	选择其 src 属性值以 “https” 开头的每个元素。	3
[attribute$=value]	a[src$=".pdf"]	选择其 src 属性以 “.pdf” 结尾的所有元素。	3
[attribute*=value]	a[src*="abc"]	选择其 src 属性中包含 “abc” 子串的每个元素。	3
:first-of-type	p:first-of-type	选择属于其父元素的首个元素的每个元素。	3
:last-of-type	p:last-of-type	选择属于其父元素的最后元素的每个元素。	3
:only-of-type	p:only-of-type	选择属于其父元素唯一的元素的每个元素。	3
:only-child	p:only-child	选择属于其父元素的唯一子元素的每个元素。	3
:nth-child(n)	p:nth-child(2)	选择属于其父元素的第二个子元素的每个元素。	3
:nth-last-child(n)	p:nth-last-child(2)	同上，从最后一个子元素开始计数。	3
:nth-of-type(n)	p:nth-of-type(2)	选择属于其父元素第二个元素的每个元素。	3
:nth-last-of-type(n)	p:nth-last-of-type(2)	同上，但是从最后一个子元素开始计数。	3
:last-child	p:last-child	选择属于其父元素最后一个子元素每个元素。	3
:root	:root	选择文档的根元素。	3
:empty	p:empty	选择没有子元素的每个元素（包括文本节点）。	3
:target	#news:target	选择当前活动的 #news 元素。	3
:enabled	input:enabled	选择每个启用的 <input>元素。	3
:disabled	input:disabled	选择每个禁用的 <input>元素	3
:checked	input:checked	选择每个被选中的 <input>元素。	3
:not(selector)	:not(p)	选择非<p>元素的每个元素。	3
::selection	::selection	选择被用户选取的元素部分。	3

CSS 选择器参见：http://www.w3school.com.cn/cssref/css_selectors.ASP 和 https://pythonhosted.org/cssselect/#supported-selectors。

下面通过提取如下页面的国家数据来比较性能：

比较代码：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
         42 
       
         43 
       
         44 
       
         45 
       
         46 
       
         47 
       
         48 
       
         49 
       
         50 
       
         51 
       
         52 
       
         53 
       
         54 
       
         55 
       
        import  
        urllib2 
       
        import  
        itertools 
       
        import  
        re 
       
        from  
        bs4  
        import  
        BeautifulSoup 
       
        import  
        lxml 
        . 
        html 
       
        import  
        time 
       
        FIELDS  
        =  
        ( 
        'area' 
        ,  
        'population' 
        ,  
        'iso' 
        ,  
        'country' 
        ,  
        'capital' 
        , 
       
        'continent' 
        ,  
        'tld' 
        ,  
        'currency_code' 
        ,  
        'currency_name' 
        ,  
        'phone' 
        , 
       
        'postal_code_format' 
        ,  
        'postal_code_regex' 
        ,  
        'languages' 
        , 
       
        'neighbours' 
        ) 
       
        def  
        download 
        ( 
        url 
        ,  
        user_agent 
        = 
        'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0' 
        ,  
        num_retries 
        = 
        2 
        ) 
        : 
       
        print  
        'Downloading:' 
        ,  
        url 
       
        headers  
        =  
        { 
        'User-agent' 
        :  
        user_agent 
        } 
       
        request  
        =  
        urllib2 
        . 
        Request 
        ( 
        url 
        ,  
        headers 
        = 
        headers 
        )     
       
        try 
        : 
       
        html  
        =  
        urllib2 
        . 
        urlopen 
        ( 
        request 
        ) 
        . 
        read 
        ( 
        ) 
       
        except  
        urllib2 
        . 
        URLError  
        as  
        e 
        : 
       
        print  
        'Download error:' 
        ,  
        e 
        . 
        reason 
       
        html  
        =  
        None 
       
        if  
        num 
        _retries  
        >  
        0 
        : 
       
        if  
        hasattr 
        ( 
        e 
        ,  
        'code' 
        )  
        and  
        500  
        ( 
        . 
        * 
        ? 
        ) 
        ' % field, html.replace(' 
        n 
        ',' 
        ')).groups()[0] 
       
            return results 
       
        def bs_scraper(html): 
       
            soup = BeautifulSoup(html, ' 
        html 
        . 
        parser 
        ') 
       
            results = {} 
       
            for field in FIELDS: 
       
                results[field] = soup.find(' 
        table 
        ').find(' 
        tr 
        ',id=' 
        places_ 
        % 
        s_ 
        _row 
        ' % field).find(' 
        td 
        ',class_=' 
        w2p 
        _fw 
        ').text 
       
            return results 
       
        def lxml_scraper(html): 
       
            tree = lxml.html.fromstring(html) 
       
            results = {} 
       
            for field in FIELDS: 
       
                results[field] = tree.cssselect(' 
        table  
        >  
        tr 
        #places_%s__row> td.w2p_fw' % field)[0].text_content() 
       
        return  
        results 
       
        NUM 
        _ITERATIONS  
        =  
        1000  
        # number of times to test each scraper 
       
        html  
        =  
        download 
        ( 
        'http://example.webscraping.com/places/view/United-Kingdom-239' 
        ) 
       
        for  
        name 
        ,  
        scraper  
        in  
        [ 
        ( 
        'Regular expressions' 
        ,  
        re_scraper 
        ) 
        , 
        ( 
        'BeautifulSoup' 
        ,  
        bs_scraper 
        ) 
        , 
        ( 
        'Lxml' 
        ,  
        lxml_scraper 
        ) 
        ] 
        : 
       
        # record start time of scrape 
       
        start  
        =  
        time 
        . 
        time 
        ( 
        ) 
       
        for  
        i  
        in  
        range 
        ( 
        NUM_ITERATIONS 
        ) 
        : 
       
        if  
        scraper  
        ==  
        re_scraper 
        : 
       
        re 
        . 
        purge 
        ( 
        ) 
       
        result  
        =  
        scraper 
        ( 
        html 
        ) 
       
        # check scraped result is as expected 
       
        assert 
        ( 
        result 
        [ 
        'area' 
        ]  
        ==  
        '244,820 square kilometres' 
        ) 
       
        # record end time of scrape and output the total 
       
        end  
        =  
        time 
        . 
        time 
        ( 
        ) 
       
        print  
        '%s: %.2f seconds'  
        %  
        ( 
        name 
        ,  
        end  
        -  
        start 
        )

Windows执行结果：

 
         1 
       
         2 
       
         3 
       
         4 
       
        Downloading 
        :  
        http 
        : 
        //example.webscraping.com/places/view/United-Kingdom-239 
       
        Regular  
        expressions 
        :  
        11.63  
        seconds 
       
        BeautifulSoup 
        :  
        92.80  
        seconds 
       
        Lxml 
        :  
        7.25  
        seconds

Linux执行结果:

 
         1 
       
         2 
       
         3 
       
         4 
       
        Downloading 
        :  
        http 
        : 
        //example.webscraping.com/places/view/United-Kingdom-239 
       
        Regular  
        expressions 
        :  
        3.09  
        seconds 
       
        BeautifulSoup 
        :  
        29.40  
        seconds 
       
        Lxml 
        :  
        4.25  
        seconds