爬虫学习笔记【1】使用 urllib 获取 www 资源

最新推荐文章于 2023-01-17 19:49:52 发布

机灵鹤

最新推荐文章于 2023-01-17 19:49:52 发布

阅读量6.7k

点赞数 2

分类专栏：网络爬虫笔记文章标签： Crawler Python

本文链接：https://blog.csdn.net/wenxuhonghe/article/details/83033252

版权

网络爬虫笔记专栏收录该内容

10 篇文章 26 订阅

订阅专栏

1. 掌握普通网页的获取方法

查看 urllib.request 的基本信息

urllib.request 中最常用的方法是 urlopen() ,它也是我们使用 urllib 获取普通网页的基本方法。在应用之前，我们先看一下 urllib 的源代码，这是从事IT软件类技术工作要养成的职业习惯。由于 urllib 是 python3 内置库，所以无需安装。源代码的路径可以在 import urllib 或 import.request 后，使用 file属性查看。

import urllib.request

#可以使用语句查看摘要信息
print(urllib.request.__all__)

#可以使用语句查看urllib的本地位置
print(urllib.request.__file__)

urllib.request.urlopen() 方法的应用

从头部注释中可以了解 urlopen 方法需要传入一个字符串参数：页面的URL ，然后它会打开这个URL，返回类文件对象的响应对象。

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

查看上面 urlopen 方法原型，了解它的功能和调用方法。可以看到， url 是必须给定的参数，其他参数可以默认。下面我们尝试使用 urlopen 打开百度网页。

我们使用了 with...as... 语句调用，这样会更有利于在不使用时正常关闭连接。返回的结果是 HTTPResponse 对象。调用这个对象的 read() 方法，可以访问具体的文件内容。

"""简单网页获取"""
import urllib.request

url = 'http://www.smartcrane.club'

#使用urlopen打开网页
#使用 with ... as ... 语句调用，有利于在不适用的时候正常关闭连接
with urllib.request.urlopen(url) as resp:
    print(resp.read().decode('utf-8'))

urllib.request.urlretrieve() 方法的应用

urllib.request.urlretrieve() 方法能够以另一种形式获取页面内容，它会将页面内容存为临时文件并获取 response 头。

可以查看 urlretrieve 方法的原型：

def urlretrieve(url, filename=None, reporthook=None, data=None)

"""urlretrieve()方法试用"""
import urllib.request

tempfile,header = urllib.request.urlretrieve(url)

print(header)
print('--'*10)
print(tempfile)

理解 HTTPResponse 对象

HTTPResponse 对象是一种类文件对象，除了可以文件的 read() 方法读取它的内容外，还有别的属性和方法。

例如： r.code 与 r.status 属性存放本次请求的响应码; r.headers 属性存放响应头； r.url 属性存放了发出响应的服务器URL；还可以尝试 info() 和 geturl() 方法。

(使用 response 的 geturl() 和 info 方法来验证请求与响应是否如我们希望的一样。有时会出现请求发往的服务器与应答服务器不是同一台主机的情况。)

"""理解 HTTPResponse 对象"""

import urllib.request

url = 'http://www.smartcrane.club'

with urllib.request.urlopen(url) as resp:
    print(resp)
    print(resp.code)
    print(resp.status)
    print(resp.headers)
    print(resp.url)
    print(resp.info())
    print(resp.geturl())

2. 掌握使爬虫更像浏览器的方法

默认情况下，urllib发出的请求头大致如下所示：

GET / HTTP/1.1
Accept-Encoding: identity 
Host: 10.10.10.135 
User-Agent: Python-urllib/3.6 
Connection: close

大多数网站的服务器端会进行内容审查，检查客户端类型，一方面是为了满足多样化的需求；另一方面也可以限制一些网络爬虫程序的访问。

上面内容中的 User-Agent 就是一个内容审查重点，一般的浏览器发出的请求头如下所示：

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Cookie: _ga=GA1.2.1750953090.1536588381; _gid=GA1.2.14861119.1539324601; _gat=1
Host: www.smartcrane.club
If-Modified-Since: Tue, 09 Oct 2018 06:29:24 GMT
If-None-Match: W/"5bbc4ac4-82e1"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36

服务器发现不是正常浏览器可以拒绝提供服务，例如访问 www.z.cn 时，使用下面代码会报出 HTTP Error 503: Service Unavailable:()

with urllib.request.urlopen(url) as response: 
    print(response.status)

这时我们可以定制请求对象 HTTPRequest ，是指更像是浏览器发出的。

"""定制 request 对象，使爬虫更像浏览器"""

import urllib.request

url = 'http://www.smartcrane.club'

header = {
    'Accept': 'text/html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

request = urllib.request.Request(url = url, headers = header)

with urllib.request.urlopen(request) as resp:
    print(resp.status)

3. 掌握向服务器传递参数的方法

许多 HTTP 方法都可以用来向服务器提供数据，最常见的 GET 和 POST 方法都可以，但方式不同。

使用 GET 方法向服务器提供数据

"""向服务器提交参数"""
import urllib.parse
import urllib.request

#1 利用 Get 方法通过 URL 参数提交信息

url = 'http://www.baidu.com/s?'

wd = {'wd':'北航'}

wdcoded = urllib.parse.urlencode({'wd':'北航'})

url1 =  url + wdcoded


header = {
    'Accept': 'text/html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

request = urllib.request.Request(url = url, headers = header)

with urllib.request.urlopen(request) as resp:
    print(resp.status)
    print(resp.headers)
    print(resp.read().decode('utf-8'))

使用 POST 方法向服务器提供数据

"""
利用 POST 方法，向 http://httpbin.org 提交
事先应在该网站进行设置，启动试用链接
"""

import urllib.request
import urllib.parse

url = 'http://www.httpbin.org/post'
payload = {'key1':'value1','key2':'value'}
header = {
    'Accept': 'text/html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}

request = urllib.request.Request(url=url,data=urllib.parse.urlencode(payload).encode('utf-8'),headers=header)

with urllib.request.urlopen(request) as resp:
    print(resp.read().decode('utf-8'))

4. 掌握设置超时访问限制和处理异常的方法

urllib.error 处理异常,两个常用异常类：urllib.error.URLError 和 HTTPError

设置 Time-Out 超时访问限制

"""设置 Time-Out"""
import urllib.request
import socket

# tineout in seconds
timeout = 3
socket.setdefaulttimeout(timeout)

# this call to urllib.request.urloopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request("http://www.python.org/")
a = urllib.request.urlopen(req).read()
print(a)

使用 urllib.error 处理异常

URLError 继承自 OSError ，是 urllib 的异常的基础类。
HTTPError 是验证 HTTP Response 实例的一个异常类。

HTTP protocol errors 是有效的 response ，有 状态码 、 header 、 body 。

一个成熟的程序需要管理所有的输出，不仅有预期的输出，还有有意料之外的异常。
logging 的使用可以参考 https://docs.python.org/3.5/howto/logging.html

"""使用urlllib.error处理异常"""

import urllib.request
import urllib.error
import urllib.parse
import logging

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                   datefmt='%Y-%m-%d %H:%M:%S',
                   filename='C:\\Users\\Wen Xuhonghe\\Documents\\Crawler\\CrawlerLesson1_crawler.log',
                   level=logging.DEBUG)
try:
    url = 'http://www.baidu11.com'   # A wrong website url
    headers = {
        'Accept': 'text/html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    request = urllib.request.Request(url,headers=headers)
    with urllib.request.urlopen(request) as response:
        print(response.status)
        print(response.read().decode('utf-8'))

except urllib.error.HTTPError as e:
    import http.server
    #print(http.server.BaseHTTPRequestHandler.response[e.code])
    logging.error('HTTPError code: %s and Messages: %s' %(str(e.code),http.server.BaseHTTPRequestHandler.response[e.code]))
    logging.info('HTTPError headers: ' + str(e.headers))
    logging.info(e.read().decode('utf-8'))
    print('Error : urllib.error.HTTPError')
except urllib.error.URLError as e:
    logging.error(e.reason)
    print('Error : urllib.error.URLError')

5. 实例：从百度贴吧下载多页话题内容

先了解一下百度贴吧（ http://tieba.baidu.com/f?）我们定义几个函数：

loadPage(url) 用于获取网页
writePage(html,filename) 用于将已获得的网页存储为本地文件
tiebaCrawler(url,beginpPage,endPage,keyword)用于调度，提供需要抓取的页面URLs
main：程序主控模块，完成基本命令行交互接口

import urllib.request
import urllib.error
import urllib.parse
import logging

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                   datefmt='%Y-%m-%d %H:%M:%S',
                   filename='C:\\Users\\Wen Xuhonghe\\Documents\\Crawler\\CrawlerLesson1_crawler.log',
                   level=logging.DEBUG)

    
def loadPage(url):
    try:
        headers = {
            'Accept': 'text/html',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
        }
        request = urllib.request.Request(url,headers=headers)
        with urllib.request.urlopen(request) as response:
            print(response.status)
            #print(response.read().decode('utf-8'))
            return response.read()
            
    except urllib.error.HTTPError as e:
        import http.server
        #print(http.server.BaseHTTPRequestHandler.response[e.code])
        logging.error('HTTPError code: %s and Messages: %s' %(str(e.code),http.server.BaseHTTPRequestHandler.response[e.code]))
        logging.info('HTTPError headers: ' + str(e.headers))
        logging.info(e.read().decode('utf-8'))
        print('Error : urllib.error.HTTPError')
    except urllib.error.URLError as e:
        logging.error(e.reason)
        print('Error : urllib.error.URLError')

def WritePage(html, filename):
    fp=open(filename,'w+b')
    fp.write(html)
    fp.close()

if __name__ == "__main__":   
    for i in range(3):
        url = 'http://tieba.baidu.com/p/5872199831?pn=' + str(i+1)
        response = loadPage(url)
        filename = 'C:\\Users\\Wen Xuhonghe\\Documents\\Crawler\\CrawlerLesson1_Tieba' + str(i+1) + '.html'
        WritePage(response,filename)