北大陈斌-Python语言基础与应用D15基本扩展模块：网络爬虫

本文链接：https://blog.csdn.net/m0_37877850/article/details/109164605

requests库

支持HTTP持久连接和连接池、SSL证书验证、cookies处理、流式上传

http请求类型
requests.request()：构造一个请求
requests.get()：获取HTML网页
requests.head()：获取HTML网页头信息
requests.post()：提交POST请求
requests.put()：提交put请求
requests.patch()：提交局部修改请求
requests.delete()：提交删除请求
requests.options()：获取http请求
返回的是一个response对象

response对象
包含返回的所有信息
.status_code：http请求的返回状态
.text：http响应内容的字符串形式
.content：http响应内容的二进制形式
……

定制请求头
requests的请求接口有一个名为headers的参数，向它传递一个字典来完成请求头定制

设置代理
可以在发送请求时指定proxies参数来替换代理，如：

proxies = {
    "http":"http://10.10.10.10:1010",
    "https":"http://10.11.10.14:1011"
}
r = requests.get(url, proxies = proxies)

Beautiful Soup

网页解析器
处理使用requests库下载网页字符串，提取有用的信息
在这里插入图片描述
搜索方法
再使用find方法找到感兴趣的信息
find_all(name, attrs, recursive, string, **kwargs)
返回文档中符合条件的所有tag（一个列表）
name：对标签名称检索字符串
attrs：对标签属性值的检索字符串
recursive：是否对子节点全部检索，默认true
string：<>…</>中检索字符串
**kwargs：关键词参数列表

爬虫基本流程

分析网页结构
找到标记id
爬取网页
通过requests库向目标站点发送请求，正常响应会收到一个response对象

import requests
url = "http://news.qq.com/"
r = requests.get(url, timeout = 30)
print(r.text)

产生
在这里插入图片描述
解析页面
html代码-网页解析器
Json数据-json模块
二进制数据-以wb形式写入文件，再做进一步处理
实例使用bs4进行解析，lxml解析器

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
for news in soup.find_all('div', class_ = 'text'):
    info = news.find('a')
    if len(info) > 0:
        title = info.get_text()
        link = str(info.get('href'))
        print('标题：' + title)
        print('链接：' + link + '/n')