python 网页版笔记_python爬虫笔记（一）基本入门抓取网页

最新推荐文章于 2023-08-04 11:58:12 发布

weixin_39909859

最新推荐文章于 2023-08-04 11:58:12 发布

阅读量121

点赞数

文章标签： python 网页版笔记

网页抓取，就是将指定的资源从网络中抓取。然后介绍一下python爬虫中最基本的模块

一、 request 模块

r = requests.get(url) 构造一个向服务器请求资源的Request对象并返回一个包含服务器资源的Response对象。

requests.get( url, params = None, **kwargs ) 有三个参数

url ：是链接;

params: url中额外的参数，字典或字节流格式，可选;

**kwargs ： 12个控制访问的参数 ;

Requests 库的2个重要对象即 Response 和 Request

Response 对象的属性 r = requests.get(url)

属性：(1)r.status_code: HTTP 请求返回的状态，200表示成功，400表示失败；

(2)r.text : HTTP响应内容的字符串形式，即url对应的页面内容；

(3)r.encoding : HTTP header猜测的响应内容编码方式；

(4)r.apparent_encoding : 从内容中分析出的响应内容编码方式；

(5)r.content : HTTP响应内容的二进制形式；

来个例子(这个是python 3.5+版本的例子，下面有python2.7的补充)：

importrequests

r= requests.get("http://www.baidu.com")print(r.text)

然后可以发现是打印出来的是乱码，通过使用 r.apparent_encoding 发现输出是 'utf-8'

因此把上面的代码改一下(如果header中不存在charset字段，则认为编码为ISO-8859-1，即通过r.encoding无法得出内容，则通过r.apparent_encoding根据网页内容分析出编码方式)

importrequests

r= requests.get("http://www.baidu.com")

r.encoding= 'utf-8'

print(r.text)

这样就会打印出正常的内容。

接下讲讲通用基础爬虫框架，来个例子：

importrequestsdefgetHTMLText(url)try:

r= requests.get(url, timeout = 30)

r.raise_for_status()

r.encoding=r.apparent_encodingreturnr.textexcept:return "ERROR"

if __name__ == "__main__":

url= "http://www.baidu.com"

print(getHTMLText(url))

二、 method

HTTP 协议对资源的操作 requests.request(method, params = None, **kwargs )

方法：

(1)GET ：请求获取URL位置的资源；

importrequests

r= requests.get("http://www.baidu.com")

(2)HEAD ：请求获取URL位置资源的响应消息报告，即获得该资源的头部信息；

(3)POST ：请求URL位置的资源后附加新的数据；

(4)PUT ：请求向URL位置存储一个资源，覆盖原URL位置的资源；

(5)PATCH：请求局部跟新URL位置的资源，即改变该处资源部分内容；

(6)DELET ：请求删除URL位置存储的资源；

三、**kwargs :

requests.request(method, url, **kwargs)

**kwargs : 控制访问参数(可选项)

1) params, data, jason, header. headers : HTTP定制头

>> hd = {'user-agent':'Chrome/10'}>> r = requests.request('POST', 'http://ython/ws', headers=hd)

2 ) cookies: 字典或CookieJar, Request中的cookie；

3)auth: 元祖。支持http认证功能；

4)files: 字典类型，传输文件；用open方式打开进行相关联，向链接提交文件；

>> fs = {'file' : open('data.xls', 'rb')}>> r = requests.request('POST', 'http://python123.io/ws', files=fs )

5)timeout: 设定超时时间，秒为单位。在timeout时间内没有完成会抛出异常；

>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)

6)proxies : 字典类型，设定访问代理服务器，可以增加登录认证。使用代理服务器的IP地址，隐藏用户原IP地址信息，防止爬虫逆追踪；

>>> pxs = {'http' : 'http://user:pass@10.10.10.1:1234'

'http' : 'http://10.10.10.1:4321'}>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)

7)allow_redirects : True/False, 默认为True, 重新定向开关；

8)stream : True/False, 默认为True, 获取内容立即下载开关；

9)verify : True/False, 默认为True, 认证SSL证书开关；

10)cert : 本地SSL证书路径；

>>> kv = {'key1': 'value1', 'key2': ‘value2’}>>> r = requests.request('POST', 'http://python/ws', data=kv)>>> body = '主要内容'

>>> r = requests.request('POST', 'http://python/ws', data=body)

补充(python 2.7 的版本)：

先来段例子：

1、 urllib2库中 urlopen 模块的使用

#代码部分

import urllib2 #导入urllib2库

repsonse = urllib2.urlopen("http://www.baidu/com") #向指定的url发送请求，并返回服务器相应的类文件对象

html = response.read() #通过response的read方法读取文件

print html #打印字符串

然后终端就会有网页信息。

2、Request 模块

比如需要增加http的报头通过Request

import urllib2

request = urllib2.Request("http://www.baidu.com")

request = urllib2.urlopen(request) 将request作为urlopen的参数

html = response.read()

print html

3、User-Agent模块

使用用户身份去登陆网站

import urllib2

url = "http://www.baidu.com"

weixin_39909859

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫