python爬虫第三方库之Beautiful Soup4

最新推荐文章于 2024-10-01 05:04:32 发布

儒雅的曹曹曹

最新推荐文章于 2024-10-01 05:04:32 发布

阅读量353

点赞数

分类专栏： python 文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/CYL_2021/article/details/127040838

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1.下载

pip install bs4

在这里插入图片描述

2.简介

Beautiful Soup4简称bs4，是一个HTML/XML的解析器，其主要功能是解析和提取HTML/XML数据。它不仅支持css选择器，而且支持python标准库中的HTML解析器，以及lxml的XML。
官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

3.基础使用

1.构建BeautifulSoup对象
方式一：

import urllib.request
from bs4 import BeautifulSoup
//读取html对象
url="https://news.hist.edu.cn/kyyw/378.htm"
request=urllib.request.Request(url);
response=urllib.request.urlopen(request)
html=response.read().decode("utf-8");
//构建BeautifulSoup对象
bs=BeautifulSoup(html,"html.parser",from_encoding='utf-8')

方式二：

from bs4 import BeautifulSoup 
file = open('https://news.hist.edu.cn/kyyw/378.htm', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") # 缩进格式

注：html,parser表示使用的解析器是python标准库，其他标准库如下
在这里插入图片描述
2.通过操作方法进行解读搜索

print(bs.prettify()) # 格式化html结构
print(bs.title) # 获取title标签的名称
print(bs.title.name) # 获取title的name
print(bs.title.string) # 获取head标签的所有内容
print(bs.head) 
print(bs.div)  # 获取第一个div标签中的所有内容
print(bs.div["id"]) # 获取第一个div标签的id的值
print(bs.a)

find():用于查找符合查询条件的第一个标签节点
find_all（）方法：查找所有符合查询条件的标签节点，并返回一个列表

print(bs.find_all("a")) # 获取所有的a标签
for item in bs.find_all("a"): 
    print(item.get("href")) # 获取所有的a标签，并遍历打印a标签中的href的值
for item in bs.find_all("a"): 
    print(item.get_text())//获取a标签文本内容
#attrs参数
print(bs.find_all(id="u1")) # 获取id="u1"的所有标签
bs.find_all(“a”，class_="app")获取所有的a标签,并且其类名为app

3.通过css选择器进行搜索

bs.select("p")#通过标签查找
bs.select(".app")#通过类名查找
bs.select("#link")#通过id名查找
bs.select('p #link')#通过组合查找
bs.select("a[href='http://baidu.com']")#通过属性查找

4.案例

import urllib.request
from bs4 import BeautifulSoup
url="https://news.hist.edu.cn/kyyw/378.htm"
request=urllib.request.Request(url);
response=urllib.request.urlopen(request)
html=response.read().decode("utf-8");
bs=BeautifulSoup(html,"html.parser",from_encoding='utf-8')
print(bs.prettify())#格式化html结构
# print(bs.find_all("a"))
divs=bs.find_all('div',{'class':'sec-a'})
lis=divs[0].find_all('li')
#爬取新闻链接和新闻标题并写入xinwen.txt文档里面
with open("xinwen.txt","w") as fp:
   for li in lis:
       fp.write(li.find_all("a")[0].get('href')+","+li.find_all("a")[0].get('title')+"\n")