python常用解析html内容方法BeautifulSoup、lxml以及xpath

最新推荐文章于 2024-10-03 22:12:17 发布

山水阳泉曲

最新推荐文章于 2024-10-03 22:12:17 发布

阅读量250

点赞数 10

文章标签： html python beautifulsoup

本文链接：https://blog.csdn.net/zhangdonghuirjdd/article/details/141461493

版权

文章目录

简介
xpath比较
- xpath语法
- 测试代码
使用样例

简介

BeautifulSoup
是一个用于从HTML或XML文件中提取数据的Python库。它创建了一个解析树，用于方便地提取各种数据，如标签名、属性、字符串内容等

安装

pip install beautifulsoup4 lxml

测试

from bs4 import BeautifulSoup

常用方法

查找标签以及全部内容

使用.find()或.find_all()方法查找标签。
find_all() 返回的是列表

# 查找第一个<a>标签  
a_tag = soup.find('a')  
  
# 查找所有<a>标签  
a_tags = soup.find_all('a')  
  
# 查找所有class为'sister'的<a>标签  
sister_tags = soup.find_all('a', class_='sister')  
  
# 查找id为'link1'的标签  
link1 = soup.find(id='link1')

获取标签属性

# 获取<a>标签的href属性  
href = a_tag.get('href')  
  
# 获取<a>标签的文本  
text = a_tag.text  
  
# 获取<b>标签的文本（嵌套在<p>内）  
title = soup.find('p', class_='title').b.text

xpath比较

BeautifulSoup 本身并不直接支持 XPath 表达式，因为 XPath 是一种在 XML 文档中查找信息的语言，而
BeautifulSoup 主要是用于 HTML 和 XML 的解析，但它使用的是自己的方法（如 .find() 和
.find_all()）来搜索文档树。如果需要使用 XPath，并且你的项目中已经包含了
BeautifulSoup，考虑使用 lxml 库，因为 lxml 既支持 BeautifulSoup 风格的解析（使用
lxml.html），也支持 XPath 表达式。

xpath语法

基本选择

选择元素：使用元素名称来选择该元素。例如，book 选择所有的 <book> 元素。
选择特定属性：使用@符号加上属性名来选择具有该属性的元素。例如，book[@category='cooking'] 选择所有 category 属性值为 cooking 的 <book> 元素。

路径表达式

子元素：/ 表示选择根元素，// 表示选择文档中的任意位置。parent/child 表示选择 parent 元素下的所有 child 子元素。
后代元素：//tag 选择文档中的所有 tag 元素，不论它们位于什么位置。
父元素：XPath 1.0 中没有直接的父元素选择器，但在XPath 2.0及以上版本中，可以使用 .. 来选择父元素。
兄弟元素：XPath 也没有直接的兄弟元素选择器，但你可以通过组合父元素和子元素选择器来间接实现

谓语（Predicates）

基于位置的选择：使用方括号 [] 可以根据位置选择元素。例如，book[1] 选择第一个 <book> 元素，book[last()] 选择最后一个 <book> 元素。
基于条件的过滤：可以在方括号中使用表达式来过滤元素。例如，book[price>35] 选择所有价格大于35的 <book> 元素。

通配符

* 表示任意元素。例如，//*/title 选择所有 title 元素，不论它们的父元素是什么。
@* 表示任意属性。例如，book[@*] 选择所有具有至少一个属性的 <book> 元素。

文本和属性值的比较

等于：= 用于比较值是否相等。例如，book[@category='cooking']。
不等于：!= 用于比较值是否不相等。
其他比较运算符：如 <、<=、>、>= 也都可用于数值或字符串的比较。

复合表达式

逻辑与：and 关键字用于组合多个条件。例如，book[price>35 and @category='cooking']。
逻辑或：or 关键字用于选择满足任一条件的元素。

测试代码

from bs4 import BeautifulSoup
from lxml import etree 

html_doc = """  
<html><head><title>The Test</title></head>  
<body>  
<p class="title"><b>The Dormouse's story</b></p>  
<a href="http://test.com/a" class="sister" id="link1">A链接</a>  
<a href="http://test.com/b" class="sister2" id="link2">B链接</a>  
<a href="http://test.com/c" class="sister2" id="link3">C链接</a> 
<a href="http://test.com/d" class="sister" id="link4">D链接</a> 
</body>  
</html>  
"""  
  
soup = BeautifulSoup(html_doc, 'html.parser') 
# 查找第一个<a>标签  
a_tag = soup.find('a')  
print(a_tag)

# 查找所有<a>标签  列表
a_tags = soup.find_all('a')
print(a_tags)
print(a_tags[2])


# 查找所有class为'sister'的<a>标签  
sister_tags = soup.find_all('a', class_='sister')  
print("class",sister_tags)

# 查找id为'link1'的标签  
link1 = soup.find(id='link1')
print("id=link1",link1)

print("获取标签属性 ")
# 获取<a>标签的href属性  
href = a_tag.get('href')  
  
# 获取<a>标签的文本  
text = a_tag.text  
print("href-txt",href,text)
# 获取<b>标签的文本（嵌套在<p>内）  
title = soup.find('p', class_='title').b.text
print(text)

print("==========")
# 使用 lxml 解析 HTML  
tree = etree.HTML(html_doc)  
  
# 使用 XPath 查找所有 class 为 'sister' 的 <a> 标签  
sister_links = tree.xpath('//a[@class="sister"]')  
  
# 遍历找到的标签  
for link in sister_links:  
    print(link.get('href'))  # 输出链接的 href 属性  
  
first_sister_link = tree.xpath('//a[@class="sister"]')[0]
print(first_sister_link)
print(first_sister_link.get('href'))  # 输出第一个 sister 链接的 href

first_sister_link = tree.xpath('//a[@class="sister"]')[1]
print(first_sister_link)
print(first_sister_link.get('href'))  # 输出第一个 sister 链接的 href
print("===")
first_sister_link = tree.xpath('//a[@class="sister"][1]') ##注意这里返回的是数组 
print(first_sister_link)
print(first_sister_link[0].get('href'))  # 输出第一个 sister 链接的 href

使用样例

## 测试验证功能使用 （下载html解析数据）
import requests
from bs4 import BeautifulSoup

def fetch_movie_data(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 获取电影列表
        movie_list = soup.find_all('div', class_='item')
        
        for movie in movie_list:
            title = movie.find('span', class_='title').text
            director_and_cast = movie.find('p', class_='').text.strip()
            director = director_and_cast.split('\n')[0].strip()
            category = movie.find('p', class_='').text.split('/')[-1].strip()
            summary = movie.find('span', class_='inq')
            summary = summary.text if summary else "无简介"
            
            print(f"名称: {title}")
            print(f"分类: {category}")
            print(f"导演: {director}")
            print(f"简介: {summary}\n")
    else:
        print("请求失败，状态码:", response.status_code)

# 豆瓣电影Top250的URL
url = 'https://movie.douban.com/top250'

fetch_movie_data(url)