Python3爬虫（一）：请求库之urllib

最新推荐文章于 2021-11-27 15:36:56 发布

Song_Lynn

最新推荐文章于 2021-11-27 15:36:56 发布

阅读量494

点赞数

分类专栏： python 文章标签： python 爬虫 urllib

本文链接：https://blog.csdn.net/Song_Lynn/article/details/82926573

版权

python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Python3爬虫（一）：请求库之urllib

urllib是python3中用于操作url的内置库。在python2中分为urllib和urllib2

简单的爬取网页

urllib.request.urlopen(url, data, timeout)

url：请求地址，格式：http://host[:port][path]
data：上传数据
- 转换格式：urllib.parse.urlencode(dict_name).encode(‘utf8’)
timeout：超时时间（由于网络不好、服务器端异常、请求慢、请求异常，设置超时时间不让程序已知等待）

步骤：

导入模块后，使用urllib.request.urlopen(’…’)打开并爬取一个网页
返回一个文件对象，对象的操作：
- read(), readline(), readlines(), fileno(), close()等：类似文件对象的操作
- info()：返回httplib.HTTPMessage对象，表示远程服务器返回的头信息
- getcode()：返回Http状态码
- geturl()：返回请求的url

import urllib.request		# 引入模块
res = urllib.request.urlopen('http://www.baidu.com')
data = res.read()		# 读取文件内容
code = res.getcode()		# 200
url = res.geturl()		# www.baidu.com

模拟浏览器请求——Headers信息

上一种方法请求很容易被识别为爬虫，所以对设置了反爬虫的网页进行爬虫时，可以设置一些Headers信息，模拟为浏览器请求
方法：设置Headers信息（User-Agent）
如何找User-Agent：打开任意一个网页 -> 开发工具窗口 -> Network标签页 -> 在网页中点击任意链接 -> 点击任意请求 -> Headers标签 -> 找到User-Agent

我的User-Agent：

Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER

步骤：

设置爬取网址
调用urllib.request.Request创建一个请求对象
- 参数1：url
- 参数2：
  - 传入数据，默认传入0个数据
  - 传入头部，默认不传任何头部，格式：dict对象
使用urlopen打开request对象

import urllib.request

url = 'http://www.baidu.com'
header = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER'
}
req = urllib.request.Request(url, headers=header)		# 创建一个request对象
res = urllib.request.urlopen(req)		# 返回爬取的网页

使用代理服务器

原因：使用同一个IP爬取同一网站上的网页，长时间后会被该网站的服务器屏蔽
解决方法：使用代理服务器（显示的不是我们真实的IP地址，而是代理服务器的IP地址）

import urllib.request

def use_proxy(proxy_address, url):
	# 设置代理服务器的IP地址
	proxy = urllib.request.ProxyHandler({'http': proxy_address})
	opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
	urllib.request.install_opener(opener)		# 将opener安装为全局
	
	data = urllib.request.urlopen(url)
	
	# opener不安装为全局
	#proxy = urllib.request.ProxyHandler({'http': proxy_address})
	#opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
	#data = opner.open(url)
	return data

proxy_address = '61.163.39.70:9999'
data = use_proxy(proxy_address, 'http://www.baidu.com')

使用Cookie

原因：网页涉及登录信息

import urllib.request
import urllib.parse
import http.cookiejar

url = 'http://xxxxxxxxx.com'
data = {
	'username': '123456',
	'password': '123456'
}
postdata = urllib.parse.urlencode(data).encode('utf8')
header = {'User-Agent': 'xxxxxxxxxx'}

req = urllib.request.Request(url, postdata, headers=header)
cookie = http.cookiejar.CookieJar()		# 创建CookieJar对象
handler = urllib.request.HTTPCookieProcessor(cookie)		# 创建cookie处理器
opener = urllib.request.build_opener(handler)		# 构建opener对象
res = opener.open(req)

GET请求示例

get请求的信息传递是通过url传递的
结构：url?key1=value1&key2=value2…

import urllib.parse
import urllib.request

# http://www.xxx.com?key1=value1&key2=value2
url = 'http://www.xxx.com?'
data = {
	'key1': 'value1',
	'key2': 'value2'
}
params = urllib.parse.urlencode(data)		# key1=value1&key2=value2，已编码
header = {'User-Agent': 'xxxxxx'}
req = urllib.request.Request(url+params, headers=header)
res = urllib.request.urlopen(req)

POST请求示例

post请求是通过表单传递数据的

import urllib.request
import urllib.parse

url = 'http://www.xxx.com?'
header = {'User-Agent': 'xxxxxx'}
data = {
	'name': '123456',
	'password': '123456'
}
postdata = urllib.parse.urlencode(data).encode('utf8')
req = urllib.request.Request(url, postdata)		# 传入数据，但头信息呢？
res = urllib.request.urlopen(req)