爬虫学习——urllib库以及headers查看
爬虫就是模拟自己是一个浏览器,去到网页上爬取想要的信息。
爬虫程序一般分为三步,爬取网页,解析数据,保存数据。
url指网址;介绍一个库Urllib
它可以打开网页、对网页内容进行二进制编码、获取网页的特定信息等
import urllib.request
import urllib.parse
#GET方式
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
#POST方式
data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read().decode('utf-8'))
超时处理
try:
response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01)
print(response.read().decode('utf-8'))
response = urllib.request.urlopen('http://www.douban.com')
print(response.stutas)
except urllib.error.HTTPError as e:
print('被发现是一个爬虫')
except urllib.error.URLError as e:
print("Time out!")
# 获取不同的信息
response = urllib.request.urlopen('http://www.baidu.com')
print(response.stutas)
print(response.getheaders())
print(response.getheader("Server"))
# 将自己伪装成一个浏览器 重点是headers
# url = 'http://httpbin.org/post'
# headers = {
# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Edg/90.0.818.66"
# }
# data = bytes(urllib.parse.urlencode({"name":"aaa"}),encoding='utf-8')
# req = urllib.request.Request(url=url,data=data,headers=headers,method="POST")
# response = urllib.request.urlopen(req)
# print(response.read().decode('utf-8'))
#将自己伪装成浏览器(douban就不会发现我们是爬虫)
url = 'http://www.douban.com'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Edg/90.0.818.66"
}
req = urllib.request.Request(url=url,headers=headers,method="GET")
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))
怎么获取网页上的headers
F12开发者界面–>Network–>刷新页面–>停止记录–>鼠标放置在进度条最前–>点击name–>Headers最后就是你的浏览器包装。