Python爬虫基础知识笔记——聚焦爬虫bs4

最新推荐文章于 2024-10-08 20:27:53 发布

wlrobot

最新推荐文章于 2024-10-08 20:27:53 发布

阅读量112

点赞数

分类专栏：爬虫基础知识笔记文章标签： python

本文链接：https://blog.csdn.net/weixin_51382726/article/details/117212546

版权

爬虫基础知识笔记专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本文详细介绍了如何使用Python的BeautifulSoup库进行HTML和XML数据解析，包括实例化对象、标签定位、数据提取以及不同方法的选择。从本地文件到网络源码，内容涵盖了关键操作如find(), find_all(), select()和层级选择，还有文本获取、属性获取等技巧。

摘要由CSDN通过智能技术生成

bs4(Python独有)

bs4数据解析的原理：

实例化一个BeautifulSoup对象，并将页面源码数据加载到该对象中
通过调用BeautifulSoup对象相关的属性或者方法进行标签定位和数据提取

环境的安装

在这里插入图片描述
下载lxml的解析器

实例化BeautifulSoup

1. from bs4 import BeautifulSoup

2. 对象的实例化：

1. 将本地的html文档中的数据加载到该对象中

在这里插入图片描述

from bs4 import BeautifulSoup
with open('./sogou.html','r',encoding='utf-8') as fp:
    soup=BeautifulSoup(fp,'lxml')
    print(soup)

在这里插入图片描述

2. 将互联网上获取的页面源码加载到该对象中

page_text = response.text
soup = BeautifulSoup(page_text,‘lxml’)

提供的用于数据解析的方法和属性

soup.tagName 返回的是html中第一次出现的tagName标签

from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
    soup=BeautifulSoup(fp,'lxml')
    print(soup.a)

在这里插入图片描述

soup.find(）

soup.find(‘tagName’)等同于soup.tagName

from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
    soup=BeautifulSoup(fp,'lxml')
    print(soup.find('div'))

在这里插入图片描述

soup.find(‘div’)相当于soup.div
2.属性定位

from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
    soup=BeautifulSoup(fp,'lxml')
    print(soup.find('div',class_='tab-item'))

在这里插入图片描述

soup.find_all()

可以找到所有符合要求的（列表）

from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
    soup=BeautifulSoup(fp,'lxml')
    print(soup.find_all('a'))

在这里插入图片描述

soup.select（）

select（‘某种选择器’），返回的是一个列表。

from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
    soup=BeautifulSoup(fp,'lxml')
    print(soup.select('.share-pop'))

在这里插入图片描述
2. 层级选择器

from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
    soup=BeautifulSoup(fp,'lxml')
    print(soup.select('.share-pop > a')[0])