python第五天之常用库_xlwt unquote-CSDN博客

本文链接：https://blog.csdn.net/qq_45169795/article/details/125516751

主要讲解urllib,bs4,re,xlwt,sqlite3库

一、 urllib

1、使用urlopen()

（1）可以使用get和post方式进行

#get方式请求
#response = urllib.request.urlopen("http://www.baidu.com")
#print(response.read().decode('utf-8'))

（2）post方式要把参数转化为字节流

#post方式请求
# import urllib.parse
# #以utf-8方式将内容进行封装
# data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")  #转换为二进制数据包 数据包里面可包含键值对以及编码解码内容
# response = urllib.request.urlopen("http://httpbin.org/post",data=data)   #要传递一些表单信息，用data存传进来
# print(response.read().decode("utf-8"))

（3）当你去爬虫的时候就会被检测到，导致无法爬取数据，这时候就要去控制台F12里面找到User-Agent复制过来伪装为浏览器。

#get不需要传数据
# response = urllib.request.urlopen("http://httpbin.org/get")
# print(response.read().decode("utf-8"))
#"User-Agent": "Python-urllib/3.8", 结果之间显示你是爬虫，可能有些网站爬不了

（4）得到的reaponse可以获得单个数据值

# response = urllib.request.urlopen("http://douban.com")
# print(response.status)   #结果为418:，发现你是爬虫
#response = urllib.request.urlopen("http://www.baidu.com")
# print(response.getheaders()) #获取头部信息
#print(response.getheader("Server")) #获取头部信息

2、使用Request()

urllib爬取无法携带很多信息，这就需要Request类

#urlopen包含的信息太少，而Reauest封装，可包含很多东西比如请求头等
# import urllib.request
# url = "http://httpbin.org/post"
# headers = {
#     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
# }     #伪装成浏览器
# data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
# req = urllib.request.Request(url=url,data=data,headers=headers,method="POST") #将请求对象封装
# response = urllib.request.urlopen(req)
# print(response.read().decode("utf-8"))

3、异常处理

URLError 以及HTTPError
使用e.code以及e.reason查看异常信息
超时处理,有些网址是死链接或者不允许爬，可以先跳过，之后在针对性爬

# import urllib.error
# try:
#     response = urllib.request.urlopen("http://httpbin.org/get",timeout=0.01)
#     print(response.read().decode("utf-8"))
# except urllib.error.URLError as e:
#     print("time out!")

4、parse解析URL

urlparse()	实现URL的识别和分段
urlsplit()
urlencode()	把字典参数序列转化为get请求
quote()	把中文编译成%
unquote()	上个编码的解码

5、实操爬取豆瓣

url = "http://www.douban.com"
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
req = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(req)
print(response.read().decode("utf-8"))