Python爬虫基础1：Urllib库基本使用

最新推荐文章于 2023-05-19 20:13:41 发布

北纬40度~

最新推荐文章于 2023-05-19 20:13:41 发布

阅读量414

点赞数 1

文章标签：爬虫 https

本文链接：https://blog.csdn.net/weixin_46160781/article/details/115556490

版权

键盘快捷键

在运行单元格时，你可能经常看到它们的边框变成了蓝色，而在编辑的时候它是绿色的。总是有一个“活动”单元格突出显示其当前模式，绿色表示“编辑模式”，蓝色表示“命令模式”。

到目前为止，我们已经看到了如何使用 Ctrl + Enter 来运行单元格，但是还有很多。键盘快捷键是 Jupyter 环境中非常流行的一个方面，因为它们促进了快速的基于单元格的工作流。许多这些都是在命令模式下可以在活动单元上执行的操作。

下面，你会发现一些 Jupyter 的键盘快捷键列表。你可能不会马上熟悉它们，但是这份清单应该让你对这些快捷键有了了解。

在编辑和命令模式之间切换，分别使用 Esc 和 Enter。
在命令行模式下：
用 Up 和 Down 键向上和向下滚动你的单元格。
按 A 或 B 在活动单元上方或下方插入一个新单元。
M 将会将活动单元格转换为 Markdown 单元格。
Y 将激活的单元格设置为一个代码单元格。
D + D(按两次 D)将删除活动单元格。
Z将撤销单元格删除。
按住 Shift，同时按 Up 或 Down ，一次选择多个单元格。
选择了 multple，Shift + M 将合并你的选择。
Ctrl + Shift + -，在编辑模式下，将在光标处拆分活动单元格。
你也可以在你的单元格的左边用 Shift + Click 来选择它们。
你可以在自己的 notebook 上试试这些。一旦你有了尝试，创建一个新的 Markdown 单元，我们将学习如何在我们的 notebook 中格式化文本。

urllib

在这里插入图片描述

urlopen

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

# get类型请求
import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
# read()方法就是获取response的内容，decode()方法指定编码格式
print(response.read().decode('utf-8'))

# post类型请求
import urllib.parse # url解析模块
import urllib.request

# 给它传入一个bytes类型的数据
data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding='utf8')
print(data)
print("\n")

# http://httpbin.org -->做http测试
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

# timeout 关于超时
import urllib.request

response = urllib.request.urlopen('http://httpbin.org',timeout=3)
print(response.read())

# 超时异常处理
import socket 
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/post',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print("TIME OUT")

响应

响应类型

import urllib.request

response = urllib.request.urlopen('http://httpbin.org')
print(type(response)) # 打印response类型

响应码、响应头

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(response.status) # 状态码
print(response.getheaders()) # 响应头
print("\n")
print(response.getheader('Date')) # 使用getheader()传入一个参数，获取特定的响应头
print("\n")
print(response.read().decode()) # 获取响应体的内容 返回字节流（bytes）类型的数据

Request

import urllib.request
 
# url构造成request也能成功实现request请求，可指定请求方式，加headers参数，加额外的数据
request = urllib.request.Request('http://httpbin.org/')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

from urllib import request,parse

url = 'http://httpbin.org/post' # 构造一个post请求
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Hander

代理

import urllib.request

"""
使用代理之后可以伪装自己的ip地址，服务器识别我们的IP地址为代理ip
运行爬虫的过程中可以一直切换ip，服务器识别他是来自不同地域的ip，就不会封掉我们的ip
"""


# 构造Handler
proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

Cookie 客户端保存的，记录用户身份的文本文件

做爬虫时，Cookie是用来维持我们登录状态的一个机制

import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar() # 把cookie声明为CookieJar这样一个对象
handler = urllib.request.HTTPCookieProcessor(cookie) 
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

cookie保存格式

import http.cookiejar,urllib.request

filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
hander = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True,ignore_expires=True)

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

异常处理

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

URL解析

urlparse

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)

urlunparse

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

博客为本人学习所得，侵删

北纬40度~

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
1
评论
Python爬虫基础1：Urllib库基本使用

键盘快捷键在运行单元格时，你可能经常看到它们的边框变成了蓝色，而在编辑的时候它是绿色的。总是有一个“活动”单元格突出显示其当前模式，绿色表示“编辑模式”，蓝色表示“命令模式”。到目前为止，我们已经看到了如何使用 Ctrl + Enter 来运行单元格，但是还有很多。键盘快捷键是 Jupyter 环境中非常流行的一个方面，因为它们促进了快速的基于单元格的工作流。许多这些都是在命令模式下可以在活动单元上执行的操作。下面，你会发现一些 Jupyter 的键盘快捷键列表。你可能不会马上熟悉它们，但是这份清单应
复制链接

扫一扫