爬虫学习（一）

最新推荐文章于 2024-07-18 15:53:23 发布

_Tunan

最新推荐文章于 2024-07-18 15:53:23 发布

阅读量152

点赞数 1

分类专栏：爬虫学习文章标签： python 数据挖掘

本文链接：https://blog.csdn.net/North_City_/article/details/117258997

版权

爬虫学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

爬虫学习——urllib库以及headers查看

爬虫就是模拟自己是一个浏览器，去到网页上爬取想要的信息。

爬虫程序一般分为三步，爬取网页，解析数据，保存数据。

url指网址；介绍一个库Urllib
它可以打开网页、对网页内容进行二进制编码、获取网页的特定信息等

import urllib.request
import urllib.parse

#GET方式
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

#POST方式
data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read().decode('utf-8'))

超时处理
try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01)
    print(response.read().decode('utf-8'))
    response = urllib.request.urlopen('http://www.douban.com')
    print(response.stutas)
except urllib.error.HTTPError as e:
    print('被发现是一个爬虫')
except urllib.error.URLError as e:
    print("Time out!")

# 获取不同的信息
response = urllib.request.urlopen('http://www.baidu.com')
print(response.stutas)
print(response.getheaders())
print(response.getheader("Server"))

# 将自己伪装成一个浏览器		重点是headers
# url = 'http://httpbin.org/post'
# headers = {
# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Edg/90.0.818.66"
# }
# data = bytes(urllib.parse.urlencode({"name":"aaa"}),encoding='utf-8')
# req = urllib.request.Request(url=url,data=data,headers=headers,method="POST")
# response = urllib.request.urlopen(req)
# print(response.read().decode('utf-8'))


#将自己伪装成浏览器(douban就不会发现我们是爬虫）
url = 'http://www.douban.com'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Edg/90.0.818.66"
}
req = urllib.request.Request(url=url,headers=headers,method="GET")
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

怎么获取网页上的headers

F12开发者界面–>Network–>刷新页面–>停止记录–>鼠标放置在进度条最前–>点击name–>Headers最后就是你的浏览器包装。
在这里插入图片描述

_Tunan

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习（一）

爬虫学习——urllib库以及headers查看爬虫就是模拟自己是一个浏览器，去到网页上爬取想要的信息。爬虫程序一般分为三步，爬取网页，解析数据，保存数据。url指网址；介绍一个库Urllib它可以打开网页、对网页内容进行二进制编码、获取网页的特定信息等import urllib.requestimport urllib.parse#GET方式response = urllib.request.urlopen('http://www.baidu.com')print(response.r
复制链接

扫一扫