Python爬虫——urllib库的基本使用

最新推荐文章于 2023-09-28 08:02:11 发布

rjbp40ht

最新推荐文章于 2023-09-28 08:02:11 发布

阅读量1.7k

点赞数 2

分类专栏： Python爬虫文章标签： urllib库 urllib 爬虫 python爬虫 URL

本文链接：https://blog.csdn.net/qiao39gs/article/details/86546228

版权

Python爬虫专栏收录该内容

12 篇文章 0 订阅

订阅专栏

什么是Urllib

最基本的请求库

Python内置的HTTP请求库

库	名称
urllib.request	请求模块
urllib.error	异常处理模块
urllib.parse	url解析模块
urllib.robotparser	robots.txt解析模块

urlopen

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

url:地址

data:POST请求参数

timeout:超时时间

例子一

import urllib.request

response = urllib.request.urlopen('http://www.python.org')

print(response.read().decode('utf-8'))#read读取

以GET形式发送请求，获取响应体的内容

例子二

import urllib.parse

import urllib.request

data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')#把字典传到POST请求中

response=urllib.request.urlopen('http://httpbin.org/post',data=data)

print(response.read())

以POST方式发送请求

例子三

import socket

import urllib.error

import urllib.request

try:

    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)#timeout:设置超时判断时间

except urllib.error.URLError as e:#捕获异常

    if isinstance(e.reason,socket.timeout):#判断错误类型是否为超时

        print('TIME OUT')

判断错误类型是否为超时

响应（response）

响应类型

import urllib.request

responce = urllib.request.urlopen('https://www.python.org')

print(type(responce))

运行结果：

<class 'http.client.HTTPResponse'>

类型为“http.client.HTTPResponse”

HTTPResponse包含状态码和响应头

状态码，响应头

状态码

响应头

import urllib.request

responce = urllib.request.urlopen('https://www.python.org')

print(responce.status)

print(responce.getheaders())#响应头，类型为list

print(responce.getheader('Server'))

运行结果：

200

[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48990'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 18 Jan 2019 02:47:58 GMT'), ('Via', '1.1 varnish'), ('Age', '2096'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-lax8625-LAX'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '37, 179'), ('X-Timer', 'S1547779678.133551,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

nginx

Request（传递Headers）

例子四

import urllib.request

request = urllib.request.Request('https://python.org')

response = urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

结果与例一一致

request可更加方便地指定请求方式，往headers中添加参数和加额外的数据

方法一

from urllib import request,parse

url = 'http://httpbin.org/post'

headers = {

    'User-Agent':'Mozilia/4.0(compatible;MSIE 5.5;Windows NT)',

    'Host':'httpbin.org'

}

dict = {

    'name':'Germey'

}

data = bytes(parse.urlencode(dict),encoding='utf8')

req = request.Request(url=url,data=data,headers=headers,method='POST')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

运行结果：

{

  "args": {},

  "data": "",

  "files": {},

  "form": {

    "name": "Germey"

  },

  "headers": {

    "Accept-Encoding": "identity",

    "Connection": "close",

    "Content-Length": "11",

    "Content-Type": "application/x-www-form-urlencoded",

    "Host": "httpbin.org",

    "User-Agent": "Mozilia/4.0(compatible;MSIE 5.5;Windows NT)"

  },

  "json": null,

  "origin": "183.200.46.48",

  "url": "http://httpbin.org/post"

}

作用：传递Headers

优点：逻辑结构清晰

Form Data中的数据需编码

方法二

from urllib import request,parse

url = 'http://httpbin.org/post'

dict = {

    'name':'Germey'

}

data = bytes(parse.urlencode(dict),encoding='utf8')

req = request.Request(url=url,data=data,method='POST')

req.add_header('Ueer-Agent','Mozilia/4.0(compatible;MSIE 5.5;Windows NT)')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

运行结果与方法一相同

应用场景：有多个键值对时用for循环添加

HANDLER

作用：切换IP代理，防止被封

例子

import urllib.request

proxy_hander = urllib.request.ProxyHandler({

    'http':'http://127.0.0.2',

    'https':'https://127.0.0.2'

    })

opener = urllib.request.build_opener(proxy_hander)

response = opener.open('http://baidu.com')

print(response.read())

在客户端保存，用于记录用户身份的文本文件

在爬虫中是维持登录状态的一个机制

获取Cookie

import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar()

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

for item in cookie:

print(item.name+'='+item.value)

运行结果：

BAIDUID=79AFFAB6229EB9D1A62E3B973D8BB693:FG=1

BIDUPSID=79AFFAB6229EB9D1A62E3B973D8BB693

H_PS_PSSID=1444_21103_28328_28131_26350_28267_27244

PSTM=1547805103

delPer=0

BDSVRTM=0

BD_HOME=0

保存Cookie

将cookie保存在当前文件夹

格式一

import http.cookiejar,urllib.request

filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)#火狐浏览器

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True,ignore_expires=True)

格式二

import http.cookiejar,urllib.request

filename = "cookie.txt"

cookie = http.cookiejar.LWPCookieJar(filename)#LWP格式

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True,ignore_expires=True)

读取Cookie

用什么格式存，就用什么格式读

import http.cookiejar,urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt',ignore_discard = True,ignore_expires = True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

异常处理

保证程序正常运行

from urllib import request,error

try:

    response = request.urlopen('https://blog.csdn.net/qiao39gs/1234')

except error.URLError as e:

    print(e.reason)

运行结果：

Not Found

URL解析

urlparse

urlib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

作用：URL拆分

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')

print(type(result),result)

运行结果：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

scheme：协议类型

netloc：域名

path：路径

params：参数

在域名中没有协议类型时，可添加参数scheme=''指定协议类型，域名中有协议类型时则无用

allow_fragments：False时向前拼接fragment

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html#comment',allow_fragments=False)

print(result)

运行结果：

ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

urlunparse

作用：拼接URL

from urllib.parse import urlunparse

data = ['http','ww.baidu.com','index.html','user','a=5','comment']

print(urlunparse(data))

运行结果：

http://ww.baidu.com/index.html;user?a=5#comment

urlencode

作用：将字典对象转换成GET请求参数

from urllib.parse import urlencode

params = {

    'name':'germey',

    'age':22

}

base_url = 'http://www.baidu.com?'

url = base_url + urlencode(params)

print(url)

运行结果：