爬虫之Requests库入门

最新推荐文章于 2023-10-30 09:15:54 发布

韩明宇

最新推荐文章于 2023-10-30 09:15:54 发布

阅读量330

点赞数

分类专栏： Python

Python 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

学习地址：https://www.icourse163.org/learn/BIT-1001870001?tid=1003245012#/

Requests库的七个主要方法

requests.get()方法

r=requests.get(url,params=None,**kwargs)

url:拟获取页面的url链接
params:url中的额外参数，字典或字节流格式，可选
**kwargs:12个控制访问的参数

Response对象

Response对象的属性

r.encoding:如果header中不存在charset，则认为编码为ISO-8859-1

r.apparent_encoding:根据网页内容分析出的编码方式

爬取网页的通用代码框架

Requests库的异常

处理异常的方法

r.raise_for_status()：如果不是200，产生异常requests.HTTPError

通用代码框架

HTTP协议及Requests库方法

HTTP：超文本传输协议，一个基于请求与响应模式的、无状态的应用层协议，采用URL作为定位网络资源的标识。

URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源，格式：http://host[:port][path]

host:合法的Internet主机域名或IP地址
port:端口号，缺省端口为80
path:请求资源的路径

HTTP协议对资源的操作

PATCH和PUT的区别：

假设URL位置有一组数据UserInfo，包括UserID、UserName等20个字段。

需求：用户修改了UserName，其他不变。

采用PATCH，仅向URL提交UserName的局部更新请求。可节省网络带宽。
采用PUT，必须将所有20个字段一并提交到URL，未提交字段被删除。

HTTP协议与Requests库的方法是一一对应的。

Requests库的head()方法

Requests库的post()方法（附加新的数据）

Requests库的put()方法（覆盖原有数据）

Requests库主要方法解析

requests.requests(method,url,**kwargs)

**kwargs:控制访问的参数，均为可选项：

params:字典或字节序列，作为参数增加到url中

data:字典、字节序列或文件对象，作为Request的内容

json:JSON格式的数据，作为Request的内容

headers:字典，HTTP定制头

cookies:字典或CookieJar，Request中的cookie
auth:元组，支持HTTP认证功能
files:字典类型，传输文件

timeout:设定超时时间，秒为单位

proxies:字典类型，设定访问代理服务器，可以增加登录认证

其他六个主要方法：

requests.get(url,params=None,**kwargs)
requests.head(url,**kwargs)
requests.post(url,data=None,json=None,**kwargs)
requests.put(url,data=None,**kwargs)
requests.patch(url,data=None,**kwargs)
requests.delete(url,**kwargs)

Requests库网络爬虫实战

1.京东商品页面的爬取

import requests
url = 'https://item.jd.com/100004404944.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败")

2.亚马逊商品页面的提取

import requests
url = 'https://www.amazon.cn/gp/product/B01M8L5Z3Y'
try:
    kv = {'User-Agent':'Mozilla/5.0'}
    r = requests.get(url, headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败")

这里通过headers字段模拟浏览器向亚马逊服务器提供HTTP请求。

3.百度360搜索关键词提交

百度的关键词接口：http://www.baidu.com/s?wd=keyword

360的关键词接口：http://www.so.com/s?q=keyword

import requests
keyword = 'python'
try:
    kv = {'wd':keyword}
    r = requests.get("http://www.baidu.com/s", params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

4.网络图片的爬取和存储

网络图片链接的格式：http://www.example.com/picture.jpg

国家地理：http://www.ngchina.com.cn/

选择一个图片Web页面：http://www.ngchina.com.cn/photography/photo_of_the_day/5946.html

图片链接：http://image.ngchina.com.cn/2019/0629/20190629042347347.jpg

import requests
import os
url = 'http://image.ngchina.com.cn/2019/0629/20190629042347347.jpg'
root = 'D://pics//'
path = root + url.split('/')[-1]  # 以jpg图片名字作为文件名
try:
    if not os.path.exists(root):
        os.mkdir(root)  # 当文件夹不存在时创建root文件夹
    if not os.path.exists(path):
        r = requests.get(url)  # 当jpg图片名文件不存在时爬取该图片
        with open(path, 'wb') as f:
            f.write(r.content)
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

5.IP地址归属地的自动查询

接口形式：http://m.ip138.com/ip.asp?ip=ipaddress

import requests
url = 'http://m.ip138.com/ip.asp?ip='
try:
    r = requests.get(url+'106.39.41.16')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print("爬取失败")