Python3 网络爬虫（一） urllib模块

最新推荐文章于 2023-10-19 22:13:43 发布

Jaichg

最新推荐文章于 2023-10-19 22:13:43 发布

阅读量459

点赞数

分类专栏：数据检索与网络爬虫文章标签： urllib 网络爬虫 python3

本文链接：https://blog.csdn.net/jiaach/article/details/81672064

版权

数据检索与网络爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

通过urllib内置模块直接获取页面html数据，利用程序执行HTTP请求。

Urllib分为四个模块
urllib.request 请求模块
urllib.error 异常处理模块比如404
urllib.parse url 解析模块
urllib.robotparser robots.txt解析模块

写一个简单的例子：

from urllib import request
from urllib import error
user_agent = 'M*533.1'
cookie = '***'
headers = {
    'User-Agent':user_agent,
    'Cookie': cookie
}

url = "https://***.com/"
req = urllib.request.Request(url, headers=headers)
    try:
        response = request.urlopen(req)
        data = response.read()
        page = data.decode('utf-8')
        print (page)
    except error.HTTPError as e:
        print('HttpError', e.code)
    except error.URLError as e:
        print ('URLError', e.reason)

常见问题：
1.浏览器模拟，设置User-Agent
2.需要Cookie，维持登陆用户信息
3.需要设置Referer，解决“反盗链”
4.HTTPError和URLError同时捕获异常，需要将HTTPError放在URLError的前面，因为HTTPError是URLError的子类。URLError放在前面会先响应URLError，这样HTTPError就无法捕获错误信息。

Reference：
http://blog.51cto.com/shangdc/2090763