python requests 爬虫_Python爬虫学习（一）使用requests库和robots协议

最新推荐文章于 2023-02-04 15:30:18 发布

weixin_39615219

最新推荐文章于 2023-02-04 15:30:18 发布

阅读量188

点赞数

文章标签： python requests 爬虫

(一)爬虫需要的库和框架：

(二)爬虫的限制：

1，Robots协议概述：

网站拥有者可以在网站根目录下建立robots.txt文件，User-agent：定义不能访问者；Disallow定义不可以爬取的目录

例如：http://www.baidu.com/robots.txt的部分内容：

//不允许Baiduspider访问如下目录

User-agent: Baiduspider

Disallow:/baidu

Disallow:/s?

Disallow:/ulink?

Disallow:/link?

Disallow:/home/news/data/Disallow:/bh//不允许Googlebot访问如下目录

User-agent: Googlebot

Disallow:/baidu

Disallow:/s?

Disallow:/shifen/Disallow:/homepage/Disallow:/cpro

Disallow:/ulink?

Disallow:/link?

Disallow:/home/news/data/Disallow:/bh

2，Robots协议的使用：爬虫要求，类人行为爬虫可以不用遵守robots协议

(三)使用Requests库：

1，安装命令：pip install requests

2，测试

importrequests;

r=requests.get("http://www.baidu.com")

print(r.status_code)

#如果requests库安装成功，则打印200

r.encoding='utf-8'

r.text #打印网页内容

常见方法：

1，requests.request()方法：

1，第三个参数使用params=xx，将数据添加到url后面

2，第三个参数使用data=xxx，将数据赋值到内容

3，第三个参数使用json=xx，将内容赋值到json

4，第三个参数使用headers=字典类型数据；可以修改网页headers的属性值

5，第三个参数使用timeout=n，设置访问时间，当访问超时则报异常

6，第三个参数使用proxies=xx，设置访问的代理服务器(xx为字典数据，可以存放http或https代理)，可以有效隐藏爬虫原IP，可以有效防止爬虫逆追踪

例一：

例二：第三参数使用data=xxx，将数据赋值到内容

importrequests;from _socket importtimeout#from aifc import data

defgetHTMLText(url):try:

r=requests.get(url,timeout=30)

r.raise_for_status()#如果连接状态不是200，则引发HTTPError异常

r.encoding=r.apparent_encoding #使返回的编码正常

print("连接成功")returnr.status_codeexcept:print("连接异常")returnr.status_code

url="http://httpbin.org/"

if getHTMLText(url)==200:

payData={'name;':'zs',"sex":"man"};

r=requests.request("POST",url+"post",data=payData) #第一个参数赋什么值，第二个参数在url后就添加什么参数

print(r.text)

运行截图：

2，requests.get()方法：**kwargs是request方法中除了params外的其他12个参数(最常用的方法)

r.encoding：当连接的网页的

标签没有charset属性，则认为编码为ISO-8859-1

当连接失败时：

抛出异常方法：

#判断url是否可以正常连接的通用方法：

defgetHTMLText(url):try:

r=requests.get(url,timeout=30)

r.raise_for_status()#如果连接状态不是200，则引发HTTPError异常

r.encoding=r.apparent_encoding #使返回的编码正常

returnr.textexcept:return "连接异常"

3，requests.head()方法：

4，requests.post()方法：

5，requests.put()方法：

6，requests.patch()方法：

7，requests.delete()方法：

8，HTTP协议：

(四)爬取网站数据案例；

例一：爬取起点某小说某章节

importrequests;from _socket importtimeout#from aifc import data

defgetHTMLText(url):try:

r=requests.get(url,timeout=30)

r.raise_for_status()#如果连接状态不是200，则引发HTTPError异常

r.encoding=r.apparent_encoding #使返回的编码正常

print("连接成功")returnr.status_codeexcept:print("连接异常")returnr.status_code

url="https://read.qidian.com/chapter/z4HnwebpkxdqqtWmhQLkJA2/pYlct39tiiVOBDFlr9quQA2"access={"user-agent":"Mozilla/5.0"} #设置访问网站为浏览器Mozilla5.0

if getHTMLText(url)==200:#pst={"http":"http://user:pass@10.10.10.1:1234","https":"https://10.10.10.1:4321"} #设置代理

r=requests.get(url,headers=access)#访问网站设置代理服务器

#r.encoding="utf-8"

print(r.request.headers) #获取头部信息

print(r.text[2050:3000])

截图：

例二：保存网上各种格式文件到本地：

importosimportrequests

url="http://a3.att.hudong.com/53/88/01200000023787136831885695277.jpg" #可以是图片，音乐，视频等的网上地址access={"user-agent":"Mozilla/5.0"}

file="D://picture//" #定义图片保存的本地文件

path=file+url.split("/")[-1]print(path)try:if not os.path.exists(file): #如果不存在目录D://picture就新建

os.mkdir(file)if notos.path.exists(path):

r=requests.get(url,headers=access)

with open(path,"wb") as f: #打开文件path作为f；

f.write(r.content) #r.content：爬取的文件的二进制形式

f.close()print("图片保存成功")else:print("文件已保存")except:print("爬取失败")

例三：

weixin_39615219

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python requests 爬虫_Python爬虫学习（一）使用requests库和robots协议

(一)爬虫需要的库和框架：(二)爬虫的限制：1，Robots协议概述：网站拥有者可以在网站根目录下建立robots.txt文件，User-agent：定义不能访问者；Disallow定义不可以爬取的目录例如：http://www.baidu.com/robots.txt的部分内容：//不允许Baiduspider访问如下目录User-agent: BaiduspiderDisallow:/baid...
复制链接

扫一扫