python 爬虫bs4包的使用

最新推荐文章于 2025-04-11 22:46:48 发布

爱笑的蛐蛐

最新推荐文章于 2025-04-11 22:46:48 发布

阅读量1.9k

点赞数 1

分类专栏：笔记文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_62859191/article/details/126237398

版权

笔记专栏收录该内容

44 篇文章

订阅专栏

简介：bs4包不是python的内部库，是第三方工具，需要下载，在终端输入指令：pip install bs4 即可，安装的bs4不是一个模块，是一个python的包。在这里我使用了bs4包里面的BeautifulSoup模块，该模块主要获取html网页的标签内容。

BeautifulSoup对象的创建：对象名=BeautifulSoup(网页源代码，'指定的类型')

从BeautifulSoup对象中查找想要的内容，使用两个函数：find 和 find_all

find(标签，属性=值) ：会返回标签内的所有内容，包括标签本身

find_all(标签，属性=值)：返回标签中的内容，不包括标签

# bs4 是第三方工具，需要安装
# 在终端安装 pip install bs4

import requests
from bs4 import BeautifulSoup  # 导入bs4包

url = "http://www.xinfadi.com.cn/priceDetail.html"
resp = requests.get(url)
html1 = resp.text
# print(html1)

# 数据解析
# 把页面源代码交给BeautifulSoup处理，生成bs对象
page = BeautifulSoup(html1, 'html.parser')  # 需要指定类型，不然会出现警告
# 从bs4对象中查找数据
# find(标签，属性=值)  第一个是源代码的标签，第二个对应标签的属性,返回标签内的所有内容，包括标签本身
# find_all(标签，属性=值)  返回标签中的内容，不包括标签
# table = page.find("table", border="0")  # 因为class是python关键字，需要加下滑线区分
table = page.find("table", attrs={'border': "0"})  # attrs使用字典，与上面的效果一样
print(table)

使用bs4编写的一个获取壁纸的爬虫代码：

# 需要拿到主页面的代码

import requests
from bs4 import BeautifulSoup
import time

url = "http://pic.netbian.com/"

resp = requests.get(url)
resp.encoding = "gbk"  # 处理乱码
html1 = resp.text
# print(html1)

# 把源代码交给bs
main_page = BeautifulSoup(html1, 'html.parser')
alist = main_page.find('ul', class_="clearfix").find_all("a")  # 以a标签为获取对象
# print(alist)
for a in alist:
    href = url + a.get('href').strip('/')
    # 拿到子页面源代码
    child_page_resp = requests.get(href)
    child_page_resp.encoding = "gbk"
    html2 = child_page_resp.text
    # 从子页面中拿到下载路径
    child_page = BeautifulSoup(html2, 'html.parser')
    child_alist = child_page.find('div', class_="photo-pic")  # 获取下载地址所在的标签内容
    # print(child_alist)
    # 获取下载地址
    img = child_alist.find('img')  # 获取下载地址
    src = url + img.get('src')
    # print(src)
    # 下载图片
    img_src = requests.get(src)
    # img_src.content # 这里是拿到内容的字节
    img_name = src.split("/")[-1]  # 拿到src最后一个/以后的内容
    with open(img_name, 'wb') as f:
        f.write(img_src.content)  # 图片写入文件
        time.sleep(1)
    print("over")