Datawhale task2-CSDN博客

本文链接：https://blog.csdn.net/qq_45697900/article/details/105719967

Xpath
XPath是XML的路径语言，通俗一点讲就是通过元素的路径来查找到这个标签元素。
Xpath使用方法
1、Xpath支持ID、Class、Name定位功能
1）、通过ID定位　　　　//[@id=‘kw’] 　　
2）、通过Class定位　　 //[@class=‘class_name’] 　
3）、通过Name定位　　//*[@name=‘name’]
xpath语法

表达式	描述
nodename	选取此节点的所有子节点。
/	从根节点选取。
–	–
.	选取当前节点。
…	选取当前节点的父节点。
–	–
@	选取属性。

选取若干路径
通过在路径表达式中使用“|”运算符，您可以选取若干个路径。//book/title | //book/price 选取 book 元素的所有 title 和 price 元素。//title | //price 选取文档中的所有 title 和 price 元素。/bookstore/book/title | //price 选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。
BeautifulSoup
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
四种对象
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag
NavigableString
BeautifulSoup
Comment
Beautiful Soup的基本用法

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)