urllib库初体验以及中文编码问题的探讨


提出问题:如何简单抓取一个网页的源码
  • 解决方法:利用urllib库,抓取一个网页的源代码

    ------------------------------------------------------------------------------------

    • 代码示例
    #python3.4
    import urllib.request
    
    response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
    print(response.read())
    • 运行结果
    复制代码
    复制代码
    b'\n<!DOCTYPE html>\n<html>\n<head>\n    <meta charset="utf-8"/>\n    <title>\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b - \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</title>    \n    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>\n    <meta content="\xe6\x8a\x80\xe6\x9c\xaf\xe6\x90\x9c\xe7\xb4\xa2,IT\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe6\x90\x9c\xe7\xb4\xa2,\xe4\xbb\xa3\xe7\xa0\x81\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e" name="keywords" />\n    <meta content="\xe9\x9d\xa2\xe5\x90\x91\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe7\x9a\x84\xe4\xb8\x93\xe4\xb8\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x82\xe9\x81\x87\xe5\x88\xb0\xe6\x8a\x80\xe6\x9c\xaf\xe9\x97\xae\xe9\xa2\x98\xe6\x80\x8e\xe4\xb9\x88\xe5\x8a\x9e\xef\xbc\x8c\xe5\x88\xb0\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b..." name="description" />\n    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />\n    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>\n    <script src="/Scripts/Common.js" type="text/javascript"></script>\n    <script src="/Scripts/Home.js" type="text/javascript"></script>\n</head>\n<body>\n    <div class="top">\n        \n        <div class="top_tabs">\n            <a href="http://www.cnblogs.com">\xc2\xab \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe9\xa6\x96\xe9\xa1\xb5 </a>\n        </div>\n        <div id="span_userinfo" class="top_links">\n        </div>\n    </div>\n    <div style="clear: both">\n    </div>\n    <center>\n        <div id="main">\n            <div class="logo_index">\n                <a href="http://zzk.cnblogs.com">\n                    <img alt="\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8blogo" src="/images/logo.gif" /></a>\n            </div>\n            <div class="index_sozone">\n                <div class="index_tab">\n                    <a href="/n" οnclick="return  channelSwitch(&#39;n&#39;);">\xe6\x96\xb0\xe9\x97\xbb</a>\n<a class="tab_selected" href="/b" οnclick="return  channelSwitch(&#39;b&#39;);">\xe5\x8d\x9a\xe5\xae\xa2</a>                    <a href="/k" οnclick="return  channelSwitch(&#39;k&#39;);">\xe7\x9f\xa5\xe8\xaf\x86\xe5\xba\x93</a>\n                    <a href="/q" οnclick="return  channelSwitch(&#39;q&#39;);">\xe5\x8d\x9a\xe9\x97\xae</a>\n                </div>\n                <div class="search_block">\n                    <div class="index_btn">\n                        <input type="button" class="btn_so_index" οnclick="Search();" value="&nbsp;\xe6\x89\xbe\xe4\xb8\x80\xe4\xb8\x8b&nbsp;" />\n                        <span class="help_link"><a target="_blank" href="/help">\xe5\xb8\xae\xe5\x8a\xa9</a></span>\n                    </div>\n                    <input type="text" οnkeydοwn="searchEnter(event);" class="input_index" name="w" id="w" />\n                </div>\n            </div>\n        </div>\n        <div class="footer">\n            &copy;2004-2016 <a href=http://www.22pq.com/read/6317.html>\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</a>\n        </div>\n    </center>\n</body>\n</html>\n'
    复制代码
    复制代码
    • 附上python2.7的实现代码:
    #python2.7
    import urllib2
     
    response = urllib2.urlopen("http://zzk.cnblogs.com/b")
    print response.read()
    • 可见,python3.4和python2.7的代码存在差异性。

     

    ----------@_@? 问题出现!----------------------------------------------------------------------

    • 发现问题:查看上面的运行结果,会发现中文并没有正常显示。
    • 解决问题:处理中文编码问题

    --------------------------------------------------------------------------------------------------

     

    • 处理源码中的中文问题!!!
    • 修改代码,如下:
    #python3.4
    import urllib.request
    
    response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
    print(response.read().decode('UTF-8'))
    • 运行,结果显示:
    复制代码
    复制代码
    C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
    
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="utf-8"/>
        <title>找找看 - 博客园</title>    
        <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
        <meta content="技术搜索,IT搜索,程序搜索,代码搜索,程序员搜索引擎" name="keywords" />
        <meta content="面向程序员的专业搜索引擎。遇到技术问题怎么办,到博客园找找看..." name="description" />
        <link type="text/css" href="/Content/Style.css" rel="stylesheet" />
        <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>
        <script src="/Scripts/Common.js" type="text/javascript"></script>
        <script src="/Scripts/Home.js" type="text/javascript"></script>
    </head>
    <body>
        <div class="top">
            
            <div class="top_tabs">
                <a href="http://www.cnblogs.com">« 博客园首页 </a>
            </div>
            <div id="span_userinfo" class="top_links">
            </div>
        </div>
        <div style="clear: both">
        </div>
        <center>
            <div id="main">
                <div class="logo_index">
                    <a href="http://zzk.cnblogs.com">
                        <img alt="找找看logo" src="/images/logo.gif" /></a>
                </div>
                <div class="index_sozone">
                    <div class="index_tab">
                        <a href="/n" οnclick="return  channelSwitch(&#39;n&#39;);">新闻</a>
    <a class="tab_selected" href="/b" οnclick="return  channelSwitch(&#39;b&#39;);">博客</a>                    <a href="/k" οnclick="return  channelSwitch(&#39;k&#39;);">知识库</a>
                        <a href="/q" οnclick="return  channelSwitch(&#39;q&#39;);">博问</a>
                    </div>
                    <div class="search_block">
                        <div class="index_btn">
                            <input type="button" class="btn_so_index" οnclick="Search();" value="&nbsp;找一下&nbsp;" />
                            <span class="help_link"><a target="_blank" href="/help">帮助</a></span>
                        </div>
                        <input type="text" οnkeydοwn="searchEnter(event);" class="input_index" name="w" id="w" />
                    </div>
                </div>
            </div>
            <div class="footer">
                &copy;2004-2016 <a href="http://www.cnblogs.com">博客园</a>
            </div>
        </center>
    </body>
    </html>
    
    
    Process finished with exit code 0
    复制代码
    复制代码
    • 结果显示:处理完编码后,网页源码中中文可以正常显示了

     

     

    -----------@_@! 探讨一个新的中文编码问题 ----------------------------------------------------------

       问题:“如果url中出现中文,那么应该如果解决呢?”

       例如:url = "http://zzk.cnblogs.com/s?w=python爬虫&t=b"

      

    -----------------------------------------------------------------------------------------------------

     

    • 接下来,我们来解决url中出现中文的问题!!!

    (1)测试1:保留原来的格式,直接访问,不做任何处理

    • 代码示例:
    复制代码
    复制代码
    #python3.4
    import urllib.request
    
    url="http://zzk.cnblogs.com/s?w=python爬虫&t=b"
    resp = urllib.request.urlopen(url)
    print(resp.read().decode('UTF-8'))
    复制代码
    复制代码
    • 运行结果:
    复制代码
    复制代码
    C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
    Traceback (most recent call last):
      File "E:/pythone_workspace/mydemo/spider/demo.py", line 9, in <module>
        response = urllib.request.urlopen(url)
      File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
        return opener.open(url, data, timeout)
      File "C:\Python34\lib\urllib\request.py", line 463, in open
        response = self._open(req, data)
      File "C:\Python34\lib\urllib\request.py", line 481, in _open
        '_open', req)
      File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
        result = func(*args)
      File "C:\Python34\lib\urllib\request.py", line 1210, in http_open
        return self.do_open(http.client.HTTPConnection, req)
      File "C:\Python34\lib\urllib\request.py", line 1182, in do_open
        h.request(req.get_method(), req.selector, req.data, headers)
      File "C:\Python34\lib\http\client.py", line 1088, in request
        self._send_request(method, url, body, headers)
      File "C:\Python34\lib\http\client.py", line 1116, in _send_request
        self.putrequest(method, url, **skips)
      File "C:\Python34\lib\http\client.py", line 973, in putrequest
        self._output(request.encode('ascii'))
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
    
    Process finished with exit code 1
    复制代码
    复制代码

      果然不行!!!

     

    (2)测试2:中文单独处理

    • 代码示例:
    复制代码
    复制代码
    import urllib.request
    import urllib.parse
    
    url = "http://zzk.cnblogs.com/s?w=python"+ urllib.parse.quote("爬虫")+"&t=b"
    resp = urllib.request.urlopen(url)
    print(resp.read().decode('utf-8'))
    复制代码
    复制代码
    • 运行结果:
      运行结果
    • 结果显示:对url中的中文进行单独处理,url对应内容可以正常抓取了

     

    ------@_@! 又有一个新的问题-----------------------------------------------------------

    • 问题:如果把url的中英文一起进行处理呢?还能成功抓取吗?

    ----------------------------------------------------------------------------------------

    (3)于是,测试3出现了!测试3:url中,中英文一起进行处理

    • 代码示例:
    复制代码
    复制代码
    #python3.4
    import urllib.request
    import urllib.parse
    
    url = urllib.parse.quote("http://zzk.cnblogs.com/s?w=python爬虫&t=b")
    resp = urllib.request.urlopen(url)
    print(resp.read().decode('utf-8'))
    复制代码
    复制代码
    • 运行结果:
    复制代码
    复制代码
    C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
    Traceback (most recent call last):
      File "E:/pythone_workspace/mydemo/spider/demo.py", line 21, in <module>
        resp = urllib.request.urlopen(url)
      File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
        return opener.open(url, data, timeout)
      File "C:\Python34\lib\urllib\request.py", line 448, in open
        req = Request(fullurl, data)
      File "C:\Python34\lib\urllib\request.py", line 266, in __init__
        self.full_url = url
      File "C:\Python34\lib\urllib\request.py", line 292, in full_url
        self._parse()
      File "C:\Python34\lib\urllib\request.py", line 321, in _parse
        raise ValueError("unknown url type: %r" % self.full_url)
    ValueError: unknown url type: 'http%3A//zzk.cnblogs.com/s%3Fw%3Dpython%E7%88%AC%E8%99%AB%26t%3Db'
    
    Process finished with exit code 1
    复制代码
    复制代码
    • 结果显示:ValueError!无法成功抓取网页!

     

    • 结合测试1、2、3,可得到下面结果:

    (1)在python3.4中,如果url中包含中文,可以用 urllib.parse.quote("爬虫") 进行处理。

    (2)url中的中文需要单独处理,不能中英文一起处理。

     

    • Tips:如果想了解一个函数的参数传值
    #python3.4
    import urllib.request
    
    help(urllib.request.urlopen)
    • 运行上面代码,控制台输出
    复制代码
    复制代码
    C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
    Help on function urlopen in module urllib.request:
    
    urlopen(url, data=None, timeout=<object object at 0x00A50490>, *, cafile=None, capath=None, cadefault=False, context=None)
    
    Process finished with exit code 0
    复制代码
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值