python爬虫---urllib

最新推荐文章于 2024-08-15 01:55:01 发布

velpro_!

最新推荐文章于 2024-08-15 01:55:01 发布

阅读量877

点赞数 10

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/weixin_52053631/article/details/135052718

版权

urllib是Python的一个内置库，专门用于处理网络请求。主要包含了四个模块：request、error、parse和robotparser。

# 1.导包
# 使用urllib来获取百度首页的源码
import urllib.request
# quote：将非ASCII字符转换为%XX格式，以便在URL中使用。应用场景：1个参数,get请求
# urlencode: 将非ASCII字符转换成对应的UTF-8编码格式，应用场景：多个参数,get请求
# import urllib.parse


# 2.定义一个url
url = "url地址"

# 如果有参数，需要把参数单独拿出来，再和原来的url拼接起来(适合get请求，post请求不能拼接),例如：
# data = urllib.parse.quote("周杰伦")
data = {
    "from": "en",
    "to": "zh"
}
data = urllib.parse.urlencode("data")   # get请求不需要调用encode方法
# data = urllib.parse.urlencode("data").encode("utf-8")  # post请求需要调用encode方法

url = url + data   # 适合get请求，post请求不能拼接（post的data直接写在请求对象的定制的data参数中）

# 3. 解决反爬
headers={
    # 用户代理
    "User-Agent": "F12查看"， # ---- 反爬1 
    # cookie中携带方你的登陆信息，如果有登陆之后的cookie，那么我们就可以携带芳cookie进入到任何页面
    "Cookie": "F12查看" ## ---- 反爬2 解决登录
    # 判断当前路径是不是由上一个路径进来的， 一般是做图片的防盗链
    "Referer":"https://weibo.com/7473392681"
}

# 4.因为urlopen方法中不能存储字典，所以headers不能传递进去
# 请求对象的定制 ---- 解决反爬 User-Agent   -----post 请求的data需要写在参数里面
request = urllib.request.Request(url=url, headers=headers)   # get请求
# request = urllib.request.Request(url=url, data=data, headers=headers)   # post请求

# 5.模拟浏览器向服务器发送请求
response = urllib.request.urlopen(request)

# 6.获取响应的数据, 需要解码 decode()
content = response.read().decode("utf-8")
print(content)

# 7.下载数据
with open("filename.后缀", "w", encoding="utf-8") as fp:  # 编码格式需要F12查看
    fp.write(content)


# 下载网页
# urllib.request.urlretrieve(url=url_page, filename="filename.html")
# 下载图片
# urllib.request.urlretrieve(url=url_image, filename="lisa.jpg")
# 下载视频
# urllib.request.urlretrieve(url=url_video, filename="test.mp4")


# response是HTTPResponse的类型
# read()              一个字节一个字节的去读
# readline()          读取一行
# readlines()         读取多行
# response.getcode()  返回状态码
# response.geturl()   返回url
response.getheaders() 返回状态信息


# 异常：
# HttpError类是UrlError类的子类
# 导入的包urllib.error.HTTPErrorurllib.error.URLError
# http错误: http错误是针对浏览器无法连接到服务器而增加出来的错误提示。引导并告诉浏览者该页是哪里出了问题。
# 通过ur11ib发送请求的时候，有可能会发送失败，这个时候如果想让你的代码更加的健壮，可以通过try--except进行捕获异常，异常有两类，URLError\HTTPError
例如：
try:
    request = urllib.request.Request(url=url, headers=headers)
    response = urllib.request.urlopen(request)
    content = response.read().decode("utf-8")
    print(content)
except urllib.error.HTTPError:
    print("HTTPError")
except urllib.error.URLError:
    print("URLError")



# 请求可能报错：
# 个人信息页面是utf-8，但是还报错了编码错误，因为并没有进入到个人信息页面，而是跳转到了登陆页面，那么登陆页面不是utf-8所以报错（可以F12看编码格式），想要绕过登录，然后进入到某个页面，可能需要用到cookie

get请求与post请求差异：

1. 参数：get请求可以拼接，且参数不需要编码；post请求不能拼接，且参数需要编码encode方法。

2.请求对象定制：get请求参数拼接在url里了；post参数请求参数需要单独放在请求对象定制Request方法里。

处理器Handler的使用：

Handler：定制更高级的请求头(随着业务逻辑的复杂请求对象的定制已经满足不了我们的需求(动态cookie和代理不能使用请求对象的定制)，需要使用到 HTTPHandler、build_opener、open 三者一起使用

# 导包
import urllib.request
# url
url = "http://www.baidu.com"
# 请求头
headers = {
    # 用户代理
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
# 请求对象定制
request = urllib.request.Request(url=url,headers=headers)

# 三个方法 handler    build_opener   open
# 1.获取handler对象
handler = urllib.request.HTTPHandler()
# 2.获取opener对象
opener = urllib.request.build_opener(handler)
# 3.调用open方法，使用open发送HTTP请求
response = opener.open(request)
# 获取响应
content = response.read().decode("utf-8")
print(content)

代理服务器：ip被封后如何操作。1）短时间内高频次操作，会封ip，使用代理服务器。2）访问国外网址 3）访问内网 4）提高访问速度 5）隐藏真实ip

import urllib.request
import random

url = "https://ip.900cha.com/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

request = urllib.request.Request(url=url,headers=headers)
# response = urllib.request.urlopen(request)  # 以下代码代替

# 服务器地址和端口号
# proxies={
#     "http":"114.99.232.174:21753"
# }

# 代理池
proxies_pool = [
    # {"http":"106.122.240.76: 18781"},
    # {"http":"115.219.4.132: 21306"},
]
proxies = random.choice(proxies_pool)
# print(proxies)

# 传入代理服务器地址和端口号，以字典的类型存储
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener(handler)
response = opener.open(request)

content = response.read().decode("utf-8")
with open("daili.html","w",encoding="utf-8") as fp:
    fp.write(content)

cookie库：动态实现cookie

待更新......

velpro_!

关注

10
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
python爬虫---urllib

是Python的一个内置库，专门用于处理网络请求。主要包含了四个模块：request、error、parse和robotparser。1. 参数：get请求可以拼接，且参数不需要编码；post请求不能拼接，且参数需要编码encode方法。2.请求对象定制：get请求参数拼接在url里了；post参数请求参数需要单独放在请求对象定制Request方法里。
复制链接

扫一扫