urllib，urllib3，爬虫一般开发流程

最新推荐文章于 2024-07-14 20:20:34 发布

xiaogeldx

最新推荐文章于 2024-07-14 20:20:34 发布

阅读量2.4k

点赞数 2

分类专栏：爬虫 python

本文链接：https://blog.csdn.net/xiaogeldx/article/details/86106132

版权

python 同时被 2 个专栏收录

25 篇文章 1 订阅

订阅专栏

爬虫

12 篇文章 0 订阅

订阅专栏

urllib

urllib是一个用来处理网络请求的Python标准库，包含四个模块
- urllib.requests：请求模块，用于发起网络请求
- urllib.parse：解析模块，用于解析URL
- urllib.error：异常处理模块，用于处理request引起的异常
- urllib.robotparse：用于解析robots.txt文件

urllib.request模块

request模块主要负责构造和发起网络请求，并在其中添加Headers，Proxy等
利用它可以模拟浏览器的请求发起过程
作用：
1. 发起网络请求
2. 添加Headers
3. 操作cookie
4. 使用代理

urlopen方法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

urlopen是一个简单发送网络请求的方法，他接收一个字符串格式的url，他会向传入的url发送网络请求，然后返回结果，返回的是 http.client.HttpResponse 对象，这个对象是类文件句柄对象
from urllib import request,parse #urllib是包
#发送一个get请求 ok
response = request.urlopen(url="http://httpbin.org/get") #测试接口
urloprn默认会发送get请求，当传入data参数时，则会发起POST请求，data参数是字节类型，或者类文件对象或者可迭代对象
#发送post请求
print(response.getcode()) #状态码
print(response.info())
print(response.read()) #读取网页源代码，以字节形式返回
print(response.readline()) #读一行
print(response.readlines()) #读多行
response2 = request.urlopen(
url = ‘http://httpbin.org/post’,
data = b’username=xiaoge&password=123456’
)
还可以设置超时，如果请求超过设置时间，则抛出异常
timeout没有指定则用系统默认设置，timeout只对http，https以及ftp连接起作用
它以秒为单位，比如可以设置timeout=0.1超过时间为0.1秒
response = request.urlopen(url=“http://httpbin.org/get”,timeout=0.1)

urlretrieve 方法

下载 html 页面到本地

from urllib import request
request.urlretrieve('http://www.baidu.com','baidu.html')	#第一个参数是 url，第二个参数是下载的地址和文件名，默认下载到项目根目录的上一级

在这里插入图片描述

request对象

利用openurl可以发起最基本的请求，但这几个简单的参数不足以构建一个完整的请求，可以利用更强大的Request对象来构建更加完整的请求

headers = {
		'User-Agent' : '填请求头'
}
req = request.Request(‘http://www.baidu.com',headers=headers)	#这里 headers 要用关键字参数，不要用位置参数，因为第二个位置是 data 参数
response = request.urlopen(req)
print(response.read())
#b'<!DOCTYPE html>\n<!--STATUS OK--....\n}\n</script>\n\n\n\n</body>\n</html>\n\n\r\n\n\n\r\n'

在这里插入图片描述

请求头添加

- 通过urllib发送的请求会有一个默认的Headers: "User-Agent":"python-urllib/3.6",指明请求是由urllib发送的，所以遇到一些验证User-Agent的网站时，需要我们自定义Headers把自己伪装起来

		headers = {
		    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
		}
		print(headers)
		url = 'http://baidu.com'
		req = request.Request('http://baidu.com',headers=headers)

操作cookie

什么是 cookie
- HTTP协议本身是无状态的，即服务器无法判断用户身份，Cookie实际上是一小段的文本信息（key-value格式），客户端向服务器发起请求，如果服务器需要记录该用户状态，就使用response向客户端浏览器颁发一个Cookie，客户端浏览器会把Cookie保存起来，当浏览器再请求该网站时，浏览器把请求的网址连同该Cookie一同提交给服务器，服务器检查该Cookie，以此来辨认用户状态
当用户第一次访问并登陆一个网站的时候，cookie的设置以及发送会经历以下4个步骤：
- 客户端发送一个请求到服务器 --》服务器发送一个HttpResponse响应到客户端，其中包含Set-Cookie的头部 --》客户端保存cookie，之后向服务器发送请求时，HttpRequest请求中会包含一个Cookie的头部 --》服务器返回响应数据
cookie 格式：
Set-Cookie: NAME=VALUE; Expires/Max-age=DATE; Path=PATH; Domain=DOMAIN_NAME; SECURE
cookie 默认作用于主域名，不作用于子域名，如果想要作用于子域名，要设置 Domain

作者：mcrwayfun
链接：https://www.jianshu.com/p/6fc9cea6daa2
来源：简书
在开发爬虫过程中，对cookie的处理非常重要，urllib的cookie的处理如下案例:

from urllib import request
	from http import cookiejar
	#创建一个cookie对象
	cookie = cookiejar.CookieJar()
	#创建一个cookie处理器
	cookies = request.HTTPCookieProcessor(cookie)
	#以他为参数，创建opener对象
	opener = request.build_opener(cookies)
	#使用这个openner来发请求
	res = opener.open('http://www.baidu.com')
	print(cookies.cookiejar)

保存 cookie 到本地

from urllib import request
from http.cookiejar import CookieJar, MozillaCookieJar

cookiejar = MozillaCookieJar('cookie.txt')  #这里有 cookie 的文件名这个参数，cookiejar.save() 中就不用再放这个参数了，两者中有一个有这个参数即可
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}
req = request.Request('http://httpbin.org/cookies/set?handsome=xiaoge',headers=headers)
response = opener.open(req)
# response = opener.open('http://httpbin.org/cookies/set?handsome=xiaoge')
cookiejar.save(ignore_discard=True) #ignore_discard=True 表示将即将过期的 cookie 也保存下来

在这里插入图片描述

调用保存到本地的 cookie

from urllib import request
from http.cookiejar import CookieJar, MozillaCookieJar

headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}

cookiejar = MozillaCookieJar('cookie.txt')
cookiejar.load(ignore_discard=True)	#此处参数作用同 save()，是即使过期了的 cookie 也导入
handle = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handle)
req = request.Request('http://httpbin.org/cookies/set?name=xiaoge',headers=headers)
opener.open(req)
for cookie in cookiejar:
    print(cookie)
#<Cookie handsome=xiaoge for httpbin.org/>
   <Cookie name=xiaoge for httpbin.org/>

当然，只是单独调用本地的 cookie 也可以

cookiejar = MozillaCookieJar('cookie.txt')
cookiejar.load(ignore_discard=True)
for cookie in cookiejar:
    print(cookie)
#<Cookie handsome=xiaoge for httpbin.org/>

设置代理

运行爬虫的时候，经常会出现被封ip的情况，这时我们就需要使用ip代理来处理，urllib的ip代理的设置如下：
- 使用 ProxyHandler({‘协议’:‘xxx.xxx.xxx.xxx:xx’}) 构建 handler
- 使用 build_opener(handler) 构建 opener
- 用 opener.open(url) 生成 response

url = 'http://httpbin.org/ip'
handle = request.ProxyHandler({'http':'47.104.172.108:8118'})
opener = request.build_opener(handle)
req = request.Request(url)
response = opener.open(req)

在这里插入图片描述

上图是 request.urlopen() 的源代码，可以看出其实 urlopen() 也是先生成 handle，然后生成 opener，只不过用的是本机 ip，而不是代理

Response对象

- urllib库中的类或者方法，在发送网络请求后，都会返回一个urllib.response的对象，它包含了请求回来的数据结果，一些属性和方法，供我们处理返回的结果
1. read()获取响应返回的数据，只能用一次
print(response.read())
2. readline()读取一行
while True:
	data = response.readline()
	if data:
		print(data)
3. info()获取响应头信息
print(response.info())
4. geturl()获取访问的url
print(response.geturl())
5. getcode()返回状态码
print(response.getcode())

urllib.parse模块

parse模块是一个工具模块，提供了需要对url处理的方法，用于解析url

parse.quote()

url中只能包含ascii字符，在实际操作过程中，get请求通过url传递的参数中有大量的特殊字符，例如汉字，那么就需要进行url编码
如http://baike.baidu.com/item/URL编码/3703727?fr=aladdin,我们需要将编码进行url编码
url = ‘http://httpbin.org/get?aaa={}’
safe_url = url.format(parse.quote(“小哥”))
print(safe_url) #http://httpbin.org/get?aaa=%E5%B0%8F%E5%93%A5
#利用parse.unquote()可以反编码回来

parse.urlencode() & parse.parse_qs()

在发送请求的时候，往往会需要传递很多的参数，如果用字符串方法去拼接会比较麻烦，parse.urlencode()方法就是用来拼接url参数的
可以通过 parse.urlencode() 把字典数据转换为 url 编码数据，也可以通过parse.parse_qs() 将它转回字典

from urllib import request, parse
dict = {'name' : '小哥', 'age' : 16,'hi' : 'hello world'}
res = parse.urlencode(dict)
print(res)
ps = parse.parse_qs(res)
print(ps)
#name=%E5%B0%8F%E5%93%A5&age=16&hi=hello+world	#空格编码成 ‘+’
   {'name': ['小哥'], 'age': ['16'], 'hi': ['hello world']}

urlparse() & urlsplit()

对 url 各个部分进行解析，分割
两者的区别在于 urlparse() 有 params 部分，urlsplit() 将 params 部分放在了 path 部分，两者几乎通用

from urllib import request, parse
url = 'http://www.baidu.com/p;hello?wd=python&username=xiaoge#1'
res1 = parse.urlparse(url)
res2 = parse.urlsplit(url)
print(res1)
print(res2)
dict1 = {
    'scheme' : res1.scheme,
    'netloc' : res1.netloc,
    'path' : res1.path,
    'params' : res1.params,
    'query' : res1.query,
    'fragment' : res1.fragment,
}
dict2 = {
    'scheme' : res2.scheme,
    'netloc' : res2.netloc,
    'path' : res2.path,
    'query' : res2.query,
    'fragment' : res2.fragment,
}
print(dict1)
print(dict2)
#ParseResult(scheme='http', netloc='www.baidu.com', path='/p', params='hello',    query='wd=python&username=xiaoge', fragment='1')
   SplitResult(scheme='http', netloc='www.baidu.com', path='/p;hello', query='wd=python&username=xiaoge', fragment='1')
   {'scheme': 'http', 'netloc': 'www.baidu.com', 'path': '/p', 'params': 'hello', 'query': 'wd=python&username=xiaoge', 'fragment': '1'}
   {'scheme': 'http', 'netloc': 'www.baidu.com', 'path': '/p;hello', 'query': 'wd=python&username=xiaoge', 'fragment': '1'}

urllib.error模块

error模块主要负责处理异常，如果请求出现错误，我们可以用error模块进行处理，主要包含URLError和HTTPError
- URLError：是error异常模块的基类，由req模块产生的异常都可以用这个类来处理
- HTTPError：是URLError的子类，主要包含三个属性
  - Code：请求的状态码
  - reason：错误的原因
  - headers：响应的报头

urllib.robotparse模块

robotparse模块主要负责处理爬虫协议文件，robots.txt的解析
https://www.taobao.com/robots.txt
Robots协议（也称为爬虫协议，机器人协议等）的全称是“网络爬虫排除标准”（Robots Exclusion Protocol)，网站通过Robots协议告诉所搜引擎哪些页面可以抓取，哪些页面不能抓取
robots.txt文件就是一个文本文件，使用任何一个常见的文本编辑器，比如Windows系统自带的Notepad，就可以创建和编辑他
robots.txt是一个协议，而不是一个命令，robots.txt是搜索引擎中访问网站的时候要查看的第一个文件，robots.txt文件告诉蜘蛛程序在服务器上什么文件是可以被查看的

urllib3

urllib3是一个基于Python3的功能强大，友好的http客户端，越来越多的Python应用开始采用urllib3，它提供了很多Python标准库里没有的重要功能
urllib3通过pip来安装：pip install urllib3
urllib3功能强大使用简单

构造请求

#导入urllib3库
import urllib3
#需要实例化一个PoolManager对象构造请求，这个对象处理了连接池和线程安全的所有细节，所以我们不用自行处理
http = urllib3.PoolManager()
#用request()方法发送一个请求
r = http.request('GET','http://httpbin.org/robots.txt')
b'User-agent: *\nDisallow: /deny\n'
#可以用request()方法发送任意http请求，我们发一个post请求
r = http.request(
    'POST','http://httpbin.org/post',
    fields={'hello':'world'}
)

Response content

http响应对象提供status，data和header等属性

  import urllib3
  http = urllib3.PoolManager()
  r = http.request('GET','http://httpbin.org/ip')
  print(r.status)
  print(r.data)
  print(r.headers)
  #运行结果：
  200
  b'{\n  "origin": "117.179.251.236"\n}\n'
  HTTPHeaderDict({'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Wed, 09 Jan 2019 14:34:28 GMT', 'Content-Type': 'application/json', 'Content-Length': '34', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'})

JSON content

返回的json格式数据可以通过json模块，load为字典数据类型

  import urllib3
  import json
  http = urllib3.PoolManager()
  r = http.request('GET','http://httpbin.org/ip')
  print(json.loads(r.data.decode('utf-8')))
  #{'origin': '117.179.251.236'}

Binary content

响应返回的数据都是字节类型，对于大量的数据我们通过stream来处理更好：

  import urllib3
  http = urllib3.PoolManager()
  r = http.request('GET','http://httpbin.org/bytes/1024',preload_content=False)
  for chunk in r.stream(32):
      print(chunk)

也可以当做一个文件对象来处理

  import urllib3
  http = urllib3.PoolManager()
  r = http.request('GET','http://httpbin.org/bytes/1024',preload_content=False)
  for line in r:
      print(line)import urllib3

Proxies

可以利用ProxyManager进行http代理操作

  import urllib3
  proxy = urllib3.ProxyManager('http://180.76.111.69:3128')
  res = proxy.request('get','http://httpbin.org/ip')
  print(res.data)

Request data

Headers

request方法中添加字典格式的headers参数去指定请求头

  http = urllib3.PoolManager()
  r = http.request('GET','http://httpbin.org/headers',headers={'key':'value'}) 
  print(json.loads(r.data.decode('utf-8')))

Query parameters

get，head，delete请求，可以通过提供字典类型的参数fields来添加查询参数

  http = urllib3.PoolManager()
  r = http.request('GET','http://httpbin.org/get',fields={'arg':'value'})
  print(json.loads(r.data.decode('utf-8'))['args'])

对于post和put请求，如果需要查询参数，需要通过url编码将参数编码成正确格式然后拼接到url中

  import urllib3
  import json
  from urllib.parse import urlencode
  http = urllib3.PoolManager()
  encoded_args = urlencode({'args':'value'})
  url = 'http://httpbin.org/post?' + endoded_args
  r = http.request('POST',url)
  print(json.loads(r.data.decode('utf-8'))['args'])

Form data

对于put和post请求，需要提供字典类型的参数field来传递form表单数据
r = http.request(‘POST’,‘http://httpbin.org/post’,fields={‘field’:‘value’})
print(json.loads(r.data.decode(‘utf-8’))[‘form’])

JSON

当我们需要发送json数据时，我们需要在request中传入编码后的二进制数据类型的body参数，并制定Content-Type的请求头

  http = urllib3.PoolManager()
  data = {'attribute':'value'}
  encoded_data = json.dumps(data).encode('utf-8')
  r = http.request('post','http://httpbin.org/post',body=encoded_data,headers={'Content-Type':'application/json'})
  print(json.loads(r.data.decode('utf-8'))['json'])

Files & binary data

对于文件上传，我们可以模仿浏览器表单的方式

  with open('example.txt') as fp:
      file_data = fp.read()
  r = http.request(
      'POST',
      'http://httpbin.org/post',
      fields={
          'filefield':("example.txt",file_data),
      }
  )
  print(json.loads(r.data.decode('utf-8'))['files'])

对于二进制的数据上传，我们用指定body的方式，并设置Content-Type的请求头

  http = urllib3.PoolManager()
  with open('example.jpg','rb') as fb:
      binary_data = fb.read()
  r = http.request(
      'post',
      'http://httpbin.org/post',
      body=binary_data,
      headers={'Content-Type':'image/jpeg'}
  )
  print(json.loads(r.data.decode('utf-8')))

爬虫一般开发流程

（主要是数据的获取）：

找到目标数据
分析请求流程
构造http请求
提取清洗数据

数据持久化

 import urllib3
 import re
 #1.找到目标数据
 page_url = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E5%9B%BE%E7%89%87'
 #图片是浏览器下载下来的
 #图片的url   比图片更早的下载下来
 # 2.分析请求流程
 #下载HTML
 http = urllib3.PoolManager()
 res = http.request('get',page_url)
 html = res.data.decode('utf-8')		#找charset看
 #提取清洗数据 img_url
 img_urls = re.findall(r'"thumbURL":"(.*?)"',html)
 #构造请求头防止被禁止  防盗链
 headers = {
     'Referer':'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E5%9B%BE%E7%89%87'
 }
 #遍历 下载
 for index,img_url in enumerate(img_urls):
     img_res = http.request('get',img_url)
     #动态拼接文件名
     img_file_name = '%s.%s' % (index,img_url.split('.')[-1])
     with open(img_file_name,"wb") as f:
         f.write(img_res.data)

例

人人网想要看别人的主页要登录后才能看，利用 cookie 解决

1. 不推荐的方法

将登录过的 cookie 放入请求头中，这个 cookie 是死的，过段时间这个 cookie 就失效了

from urllib import request

url = 'http://www.renren.com/880151247/profile'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Cookie': 'anonymid=k3jhsye6-ob9l77; depovince=TJ; _r01_=1; JSESSIONID=abcafhmFj52plc6Ik306w; ick_login=abbbf7d8-2840-4980-93a5-b5c3a3f53181; _de=A13F803502C0B12180550585A43513A1; ick=71ef5069-76ff-4391-af3b-1edeb7844dfa; __utma=151146938.1901049612.1574993220.1574993220.1574993220.1; __utmc=151146938; __utmz=151146938.1574993220.1.1.utmcsr=renren.com|utmccn=(referral)|utmcmd=referral|utmcct=/SysHome.do; __utmt=1; __utmb=151146938.4.10.1574993220; jebecookies=0a948e94-e2be-48b5-a37c-e9c6d91f707a|||||; p=8f0f0cb63a3345aef65f0a411fa514178; first_login_flag=1; ln_uact=15940631363; ln_hurl=http://hdn.xnimg.cn/photos/hdn521/20150203/2150/h_main_48s5_c0000003841d195a.jpg; t=f03294e6e7e1230e2a54b753fd4dc13f8; societyguester=f03294e6e7e1230e2a54b753fd4dc13f8; id=578994168; xnsid=4b1350b9; ver=7.0; loginfrom=null; jebe_key=262dcff4-cee3-4e2c-a69c-2b6779655a2f%7C569404b7d77d3ffe21120de4f64ed968%7C1574993445102%7C1%7C1574993445248; jebe_key=262dcff4-cee3-4e2c-a69c-2b6779655a2f%7C569404b7d77d3ffe21120de4f64ed968%7C1574993445102%7C1%7C1574993445254; wp_fold=0'
}
resp = request.Request(url,headers=headers)
response = request.urlopen(resp)
with open('dapeng.html','w',encoding='utf8') as f:
    f.write(response.read().decode('utf8'))

2. 推荐的方法

from urllib import request, parse
from http import cookiejar

headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}

def get_opener():
    # 登录
    # 创建一个 cookiejar 对象
    cookie = cookiejar.CookieJar()
    # 使用 cookiejar 创建一个 HTTPCookieProcessor 对象
    handler = request.HTTPCookieProcessor(cookie)
    # 使用 上一步创建的 handler 创建一个 opener
    opener = request.build_opener(handler)
    return opener

def login_renren(opener):
    login_url = 'http://www.renren.com/Login.do'
    data = {
        'email': 'xxx',	#账号
        'password': 'xxxx'	#密码
    }
    req = request.Request(login_url,data=parse.urlencode(data).encode('utf8'),headers=headers)  #字典不能用 encode() 编码
    # 使用 opener 发送登录的请求
    opener.open(req)    #使用 opener 登录/访问之前需要用 request.Request 携带请求头

def visit_renren(opener):
    # 访问
    dapeng_url = 'http://www.renren.com/880151247/profile'
    req = request.Request(dapeng_url,headers=headers)
    response = opener.open(req) #cookie 被 opener 携带
    with open('dapeng.html','w',encoding='utf8') as f:
        f.write(response.read().decode('utf8')) #write 写入的格式是字符串

if __name__ == '__main__':
    opener = get_opener()
    login_renren(opener)
    visit_renren(opener)

注

有的网站有反爬机制，比如下图，招聘信息在源代码中找不到，是通过 ajax 添加的
将该处 json 复制粘贴到 json.cn 中，就可以明显看出
url 为下图中的 Request URL
data 为下图中的 Form Data
如果报错：TypeError: can’t concat str to bytes ，是因为 python3 中，字符串是 Unicode 类型，将其转换成 bytes
这里是网站的一种伪造，误导爬虫，headers 中要有 User-Agent，Referer 和 Accept，如果还不行要用到 cookie

c = requests.Session()
c.get(url,headers=headers,timeout=3)
cookie = c.cookies

在这里插入图片描述

xiaogeldx

关注

2
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

urllib，urllib3，爬虫一般开发流程

文章目录

urllib

urllib.request模块

urlopen方法

urlretrieve 方法

request对象

请求头添加

操作cookie

设置代理

Response对象

urllib.parse模块

parse.quote()

parse.urlencode() & parse.parse_qs()

urlparse() & urlsplit()

urllib.error模块

urllib.robotparse模块

urllib3

构造请求

Response content

JSON content

Binary content

Proxies

Request data

Headers

Query parameters

Form data

JSON

Files & binary data

爬虫一般开发流程

例

1. 不推荐的方法

2. 推荐的方法

注