网络爬虫

最新推荐文章于 2024-10-13 19:04:04 发布

生-信

最新推荐文章于 2024-10-13 19:04:04 发布

阅读量138

点赞数 1

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/weixin_44549759/article/details/103757553

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

爬虫

2019-12-29

1. 一个爬取网页的通用代码框架：

import requests#导入requests库
def HTMLText(url):#该部分为通用框架
	try:
	**加粗样式**	r=requests.get(url,timeout=30)#请求访问url链接
		r.raise_for_status()#如果访问状态不是200，引发HTTPError
		r.encoding=r.apparent_encoding#根据内容的编码方式进行解析
		return r.text#读取URL对应的页面内容
	except:
		return "产生异常"
if __name__=="__main__":
	url="http://www.sougou.com"
	print(getHTMLText(url))

2019-12-30

2. reponse（返回的对象）最常用的五个属性：

requests.status_code#HTTP请求的返回状态，200表示连接成功，404或其他非200值表示请求失败。
r.text #HTTP响应内容的字符串形式，即URL对应的页面内容
r.encoding#从HTTPheader（头部）中猜测的响应内容的编码
r.apparent_encoding#从内容中分析出响应内容的编码方式
r.content#HTTP响应内容的二进制形式
注：其中3/4/5为解析内容的常用属性

3. 为什么r.encoding返回值为"ISO-8859-1"?

r.encoding时如果HTTP的header中不存在charset字段，则默认为ISO-8859-1编码，并返回。然而需要注意的是该编码并不能用于解析中文，因此可以用r.apparent_encoding来解析内容中可能出现的编码形式。

r.encoding=r.apparent_encoding#根据内容的编码方式进行解析

4. 常用的五种HTTP方法

获取

GET#请求获取URL位置的资源；
HEAD#请求获取URL资源的头部信息。

r=requests.head("http://www.baidu.com")
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Mon, 30 Dec 2019 12:54:40 GMT', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:08 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18'}
>>> r.text#空响应
''
>>>

将信息添加在URL链接上或删除部分信息

POST#请求向URL的资源后位置附加新的数据；

#向URLpost一个字典，将自动编码为form（表单）
>>> payload={'key1':'value1','key2':'value2'}
>>> r=requests.post("http://www.baidu.com",data=payload)
>>> print(r.text)
{...
	"form":{
		"key2":"value2",
		"key1":"value1"
		},
}
#向URL post一个字符串，将自动编码为data
>>> r=requests.post("http://www.baidu.com",data="ABC")
>>> print(r.text)
{...
	"data":"ABC"
	"form":{},
}