爬虫--xpath解析入门1

最新推荐文章于 2024-07-18 15:53:23 发布

qq_57346203

最新推荐文章于 2024-07-18 15:53:23 发布

阅读量587

点赞数 1

文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_57346203/article/details/139883664

版权

# xpath 是在XML文档中搜索内容的一门语言
# html是xml的一个子集


# book是id name price author的父亲
# id name price author四个是兄弟
# author是nick的父亲
# 通过/book/price进行查找
"""
<book>
    <id>1</id>
    <name>gugugu</name>
    <price>1.23</price>
    <author>
        <nick>咕咕</nick>
        <nick>冉冉</nick>
    </author>
</book>
"""


# 安装lxml模块
# pip install lxml
# xpath解析

from lxml import etree

xml = """
<book>
    <id>1</id>
    <name>gugugu</name>
    <price>1.23</price>
    <author>
        <nick>咕咕</nick>
        <nick>冉冉</nick>
        <span>
            <nick>阿拉斯孤1</nick>
            <div>
                <nick>阿拉斯孤3</nick>
            </div>
        </span>
        <div>
            <nick>阿拉斯孤2</nick>
        </div>
    </author>
</book>
"""

tree = etree.XML(xml)
# result = tree.xpath("/book") # /表示层级关系，第一个/是根节点
# result = tree.xpath("/book/name") # [<Element name at 0x215fa6f15c0>],拿到的不是name里的文本
# result = tree.xpath("/book/name/text()") # /text()拿到结点中的文本
# result = tree.xpath("/book/author//nick/text()") # // 表示后代 //nick表示author后代的所有nick 输出：['咕咕', '冉冉', '阿拉斯孤1', '阿拉斯孤3', '阿拉斯孤2']
# result = tree.xpath("/book/author/*/nick/text()") # * 表示任意的结点，通配符  输出：['阿拉斯孤1', '阿拉斯孤2']
result = tree.xpath("/book//nick/text()") # 所有的nick


print(result)