python基于requests，re，BeautifulSoup库的简单爬虫

曹博Blog

于 2024-07-31 20:14:59 发布

阅读量134

点赞数 2

分类专栏： Python 文章标签： Python

本文链接：https://blog.csdn.net/qq_46062641/article/details/140831614

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

响应类型（状态码）

1XX：信息
2XX：正常
3XX：重定向
4XX：客户端错误
5XX：服务器端错误

Session、Cookie

HTTP协议是无状态协议，利用Session、Cookie

二、requests库如何处理http请求

httpx.py

import requests
resp=requests.get("www.baidu.com")
resp.encoding='utf-8'
print(resp.text)

＃　发送post请求
data={'username':'admin','password':'admin234','verifycode':'0000'}
resp=requests.post(url='http://localhost:8080/xxxx/user/login',data=data)
# 处理响应中的编码问题
resp.encoding=resp.apparent_encoding
# 响应正文
print(resp.text)
#　打印响应头
print(resp.headers)
# 下载图片
# 信息采集，资产收集
url=""
resp=requests.get(url)
print(resp.text)
with open('./banner.jpg',mode='wb') as file:
    # resp.text得到的是文本类型的响应
    # resp.content得到的是二进制流
    file.write(resp.content)
# 文件上传
upload_url=""
file={'batchfile':open(file_path,'rb')}
data={'batchname':'GB20211009'}
resp=requests.post(url=upload_url,data=data,files=file)
print(resp.text)

# 出现问题
# 需要先登录，登陆成功后获取相应的cookie，即
data={'username':'admin','password':'admin234','verifycode':'0000'}
resp=requests.post(url='http://localhost:8080/xxxx/user/login',data=data)
cookie=resp.cookies
print(cookie)
# 获取到cookie后用于在后续中使用，构建header，也可以直接传cookies
resp=requests.post(url=upload_url,data=data,files=file，cookies=cookie)
print(resp.text)
# 维持session的用法，第二种---------更推荐
# 先创建一个session对象
session=requests.session()
# 然后用这个session对象来发送请求就可以维持登陆状态了
resp=session.post(url=url,data=data)
resp=session.post(url=upload_url,data=data,files=file)
print(resp.text)
print(type(resp.text))


# 利用python直接处理json
import
# dumps：序列化，把对象序列化成字符串
# load：从文件中读取
list_json=json.dumps(resp.text)
# loads：反序列化，把字符串反序列化为对象
list_json=json.loads(resp.text)
print(list_json)
print(type(list_json))
print(list_json[1]['goodsname'])


# https加一个参数即可，v...=false，忽略证书校验
resp=requests.get(url="https://www.xxxx.com",verify=false)
print(resp.text)

爬虫初体验

爬虫基本原理，

HTMLl可以看成一个大的字符串
HTML本身也是一门标记语言，是有格式的文本与XML同宗同源，so可以使用DOM对文本进行处理
所有的爬虫核心基于超链接进而实现网站和网页的跳转

正则表达式

spider_re.py

# crawler也是爬虫的意思
import requests
import re
import random
import time
import os

def download_page(url):
    resp=requests.get(url,verify=False)
    # 解析网页所有的超链接
    # 贪婪模式		.*	会一直匹配到最长，知道遇到换行符才会结束
    # 非贪婪模式		.+？匹配到最近的就行了
    # findall(模式字符串，匹配字符串)
    links=re.findall('<a href="(.+?)"',resp.text)
    for link in links:
        print(link)
        if 'articleid' in link or (link.startswith('javascript')) or link.startswith("#"):
            continue
        if  "html" not in link:
            continue
        if not link.startswith("http") and not link.startswith("https"):
            link=f"{url}/{link}"
        print(link)
        resp=requests.get(link,verify=False)
        resp.encoding=resp.apparent_encoding
        # 可能是下列字符串中包含文件名不让使用的字符，故使用split()函数分割，取原来的文件名，然后在打开文件准备写入的时候再加上后缀
        # file_name=link.split("/")[-1]+time.strftime("_%Y%m%d_%H%M%S")+".html"
        file_name=link.split("/")[-1].split("html")[0]
        print(file_name)
        with open(f"{os.getcwd()}\\xxxxnote\\page\\{file_name}.html","w",encoding=resp.encoding) as file:
            file.write(resp.text)

# 爬取蜗牛笔记首页图片
def download_images(url):
    resp=requests.get(url,verify=False)
    images=re.findall('<img src="(.+?)"',resp.text)
    for image in images:
        # 处理url的地址
        image=url+"/"+image
        resp=requests.get(image,verify=False)
        file_name=image.split("/")[-1]
        suffix=file_name.split(".")[-1]
        with open(f"./wxxxxote/image/{file_name}",mode="wb") as file:
            # content 图片的二进制内容
            # text 图片的文本内容
            file.write(resp.content)

if __name__=="__main__":
    url="https://www.woxxxxte.com"
    # downlad_page(url)
    download_images(url)

如果要实现整站爬取程序，首先要收集站内所有网址，并且将重复网址去除，开始爬取内容并保存本地或数据库进而实现后续目标。

BeautifulSoup

基于DOM文档树的结构进行页面内容解析，开始解析时，会将整个页面的DOM树保存于内存中，进而实现查找。

解析器

Python标准库 BeautifulSoup(markup,“html.parser”)Python的内置标准库、执行速度适中、文档容错能力强
lxml HTML解析器 BeautifulSoup(markup,“lxml”)速度快、文档容错能力强、需要安装c语言库（常用）
1. pip install lxml
2. 构建xpath表达式 -> 获取到对应的节点 -> 获取到属性和值即可
lxml XML解析器 BeautifulSoup(markup,“xml”)速度快、唯一支持XML的解析器需要安装C语言库
html5lib BeautifulSoup(markup,“html5lib”)最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度快、不依赖外部扩展

代码体验

from bs4 import BeautifulSoup
import requests
url="https://www.xxxxnote.com"
resp=requests.get(url,verify=False)
# apparent_encoding会从网页的内容中分析网页编码的方式，所以apparent_encoding比encoding更加准确。
# 当网页出现乱码时可以把apparent_encoding的编码格式赋值给encoding。
resp.encoding=resp.apparent_encoding
# 初始化解析器
html=BeautifulSoup(resp.text,'lxml')
# 查找页面元素（根据标签层次进行查找）
print(html.head.title)          # <title>成都|重庆|西安|上海软件培训_IT培训_Java开发_软件测试-蜗牛学院</title>
# 获取页面标题的文本内容
print(html.head.title.string)   # 成都|重庆|西安|上海软件培训_IT培训_Java开发_软件测试-蜗牛学院
print(html.div)                 # 第一个div标记，只能找到一个
# 查找页面元素通用方法：
# find_all()：根据标签，属性，Xpath等进行查找
# select（）：CSS选择器，div,#id,.class等
# 查找页面所有超链接,得到一个列表[<a href="....">......</a>, <a href="....">......</a>,...]
links=html.find_all('a')
print(links)
for link in links:
    print(link['href'])
# 查找页面的图片
images=html.find_all("img")
for image in images:
    print(image['src'])

# 根据id或class等属性查找,返回一个对象
keyword=html.find(id="course-Window")
# 打印此标签中的属性
print(keyword["class"])
# 查找页面中所有class为title的标签,为了和关键字class区分，这里用class_代替class
titles=html.find_all(class_='title')
print(len(titles))
for title in titles:
    print(title.find(class_='title-con').string)
# 根据内容找
title=html.find(text='开班计划')
# 找父元素
print(title.parent)
print("="*20)

xpath风格查找

# 根据xpath的风格进行查找 //div[@class="title" and @id="course-Window"]
titles=html.find_all('div',{'class':'title'})
for title in titles:
    print(title.find(class_='title-con').string)

CSS选择器

# css选择器
title=html.select('div.title')
for t in title:
    print(t.select('.title-con')[0].string)title=html.select('div.title')
for t in title:
    print(t.select('.title-con')[0].string)
# select找出的是多个
keywords=html.select('#java')
print(keywords[0]["class"])
lis=html.select('ul li')
print(lis)