Python爬虫入门：Urllib库使用详解（模拟CSDN登录）

最新推荐文章于 2024-07-14 20:20:34 发布

philos3

最新推荐文章于 2024-07-14 20:20:34 发布

阅读量5.3k

点赞数 5

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/philos3/article/details/76762639

版权

本文详细介绍了Python的urllib库在爬虫中的应用，包括构造Request、发送数据、添加头部、处理HTTP错误、异常处理、HTTP认证、使用代理和设置超时等。通过模拟CSDN登录过程，展示了如何使用urllib进行POST请求并处理登录过程中的参数。

摘要由CSDN通过智能技术生成

urllib是基于http的高层库，它有以下三个主要功能：
（1）request处理客户端的请求
（2）response处理服务端的响应
（3）parse会解析url

一、爬取网页内容

我们知道，网页上呈现的优美页面，本质都是一段段的HTML代码，加上JS 、CSS等，本人也是刚开始学python，这个文章也比较小白，资深老鸟请忽略~~。

本文所说的代码都是基于python3的，使用phython2的请注意

python 3.x中urllib库和urilib2库合并成了urllib库
其中urllib2.urlopen()变成了urllib.request.urlopen()
urllib2.Request()变成了urllib.request.Request()

那么获取网页有哪一些方法呢？这里列举了三种方法，具体查看代码。

import urllib.request
import http.cookiejar

url = 'http://www.baidu.com'

#直接通过url来获取网页数据
print('第一种 ：直接通过url来获取网页数据')
response = urllib.request.urlopen(url)
html = response.read()
mystr = html.decode("utf8")
response.close()
print(mystr)

#构建request对象进行网页数据获取
print('第二种 :构建request对象进行网页数据获取')
request = urllib.request.Request(url)
request.add_header('user-agent', 'Mozilla/5.0')
response2 = urllib.request.urlopen(request)
html2 = response2.read()
mystr2 = html2.decode("utf8")
response2.close()
print(mystr2)


#使用cookies来获取 需要import http.cookiejar
print('第三种：使用cookies来获取')
cj = http.cookiejar.LWPCookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen(url)
print(cj)
html3 = response3.read()
mystr3 = html3.decode("utf8")
response3.close()
print(mystr3)