Day29.爬虫基础之Urllib

最新推荐文章于 2024-07-03 21:10:54 发布

晶晶家的小可爱

最新推荐文章于 2024-07-03 21:10:54 发布

阅读量148

点赞数

分类专栏： 100 Days With Python 文章标签： python ajax 爬虫数据库 websocket

本文链接：https://blog.csdn.net/Tomandjava/article/details/115638462

版权

100 Days With Python 专栏收录该内容

43 篇文章 6 订阅

订阅专栏

爬虫基础之Urllib

文章目录

爬虫基础之Urllib
前言
一. 什么是Urllib？
二. Urllib 用法详解
总结

前言

本文主要是了解一下Urllib库在爬虫中的简单使用。

一. 什么是Urllib？

Urllib是Python内置的HTTP请求库。我们在爬虫的过程中主要是使用以下几个模块：

主要的模块	主要的功能
urllib.request	请求模块
urllib.error	异常处理模块
urllib.parse	url解析模块
urllib.robotparser	robots.txt解析模块

二. Urllib 用法详解

2.1 urlopen

主要的参数展示：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

简单使用：

import urllib.request

response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))

# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ #
import urllib.parse
import urllib.request

# 将输入的关键词写入，在特定的测试网站"http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({"word": "hello"}), encoding="utf-8")
response = urllib.request.urlopen("http://httpbin.org/post", data=data)
print(response.read())

在这个字典里面可以发现： “form”: {\n “word”: “hello”\n }，也就是我们刚才输入的关键词。

# 不同于上面，这里使用 get 请求，而且添加了响应的时间
response = urllib.request.urlopen("http://httpbin.org/get", timeout=1)
print(response.read())

另外一种：

import socket
import urllib.error

try:
    response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print("TIME OUT!")

2.2 响应

# 响应类型
response = urllib.request.urlopen("http://www.python.org")
print(type(response))

# 状态码，响应头
response = urllib.request.urlopen("https://www.python.org")
print(response.status)
print(response.getheaders())
print(response.getheader("Server"))

print(response.read().decode("utf-8"))


from urllib import request, parse

url = "http://httpbin.org/post"
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}

dict_info = {
    'name': 'Germey'
}

data = bytes(parse.urlencode(dict_info), encoding="utf-8")
req = request.Request(url=url, data=data, headers=headers, method="POST")
response = request.urlopen(req)
print(response.read().decode("utf-8"))


url = 'http://httpbin.org/post'
dict_info = {
    'name': 'Germey'
}

data = bytes(parse.urlencode(dict_info), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

2.3 Handler

# 代理:设置可用的代理，以代理的IP进行爬取
proxy_handler = urllib.request.ProxyHandler({
    "http": "http://127.0.0.1:9743",
    "htpps": "https://127.0.0.1:9743"
})

opener = urllib.request.build_opener(proxy_handler)
response = opener.open("http://httpbin.org/get")
print(response.read())


# Cookie
import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie: print(item.name + "=" + item.value)
# print(response.)


filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True, ignore_expores=True)

filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

可以详细对比一下几组代码的区别。

2.4 异常处理

from urllib import request, error
try:
    response = request.urlopen("https://mp.csdn.ne")
except error.URLError as e:
    print(e.reason)


try:
    response = request.urlopen("http://csdn.ne")
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep="\n")
except error.URLError as e:
    print(e.reason)
else:
    print("Request Successfully!")

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

对于不同的情况，会返回不同的内容，主要还是由于错误的情况不同，可以根据反馈修改代码。

2.5 URL解析

urlparse函数展示：
urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)

from urllib.parse import urlparse

result = urlparse("http://www.baidu.com.index.html;user?id=5#comment")
print(type(result), result)
# 发现几个参数scheme,netloc,path,params,query,fragment.


result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
# 对比上面发现，在url里面没scheme时，我们可以自己设定。


result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
# 对比上面，在url中存在scheme又设定时，以url中的为准。


result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)
# 这里设定allow_fragments=False，发现输出中fragment为空。


result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)

上面的解析结果（ParseResult）中有scheme，netloc，path，params，query，fragment。这几种参数，在给定的url中，不同特殊的符号会改变这些参数的赋值。

# urlunparse

from urllib.parse import urlunparse

data = ["http", "www.baidu.com", "index.html", "user", "a=6", "comment"]
print(urlunparse(data))
# 这样可以帮你自动构造（拼接）一个url。

# urljoin
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
# 上面主要是两个网址的拼接。前面没有时就拼接，后面有时，会出现后面覆盖前面的网址。


# urlencode：也可以做参数的拼接。
from urllib.parse import urlencode

params = {
    "name": "germey",
    "age": 22
}

base_url = "http://www.baidu.com?"
url = base_url + urlencode(params)
print(url)