python爬虫解析库（Xpath、beautiful soup、Jsonpath）

最新推荐文章于 2024-04-26 16:48:51 发布

莫问收获，但问耕耘

最新推荐文章于 2024-04-26 16:48:51 发布

阅读量2.1k

点赞数 2

分类专栏： python 文章标签： beautiful soup 爬虫 Xpath Jsonpath

本文链接：https://blog.csdn.net/sqsltr/article/details/97545877

版权

python 专栏收录该内容

42 篇文章 2 订阅

订阅专栏

1. HTML解析

HTML的内容返回给浏览器，浏览器就会解析它，并对它渲染。 HTML 超文本表示语言，设计的初衷就是为了超越普通文本，让文本表现力更强。 XML 扩展标记语言，不是为了代替HTML，而是觉得HTML的设计中包含了过多的格式，承担了一部分数据之外的任务，所以才设计了XML只用来描述数据。HTML和XML都有结构，使用标记形成树型的嵌套结构。DOM（Document Object Model）来解析这种嵌套树型结构，浏览器往往都提供了对DOM操作的API，可以用面向对象的方式来操作DOM。

2. Xpath

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。工具 XMLQuire win7+需要.NET框架4.0-4.5。测试XML、XPath

2.1 节点

在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档（根）节点。

/ 根结点
<bookstore> 元素节点
<author>Corets, Eva</author> 元素节点，
id="bk104" 是属性节点，id是元素节点book的属性
节点之间的嵌套形成父子(parent、children)关系。
具有同一个父节点的不同节点是兄弟(sibling)关系。

节点选择：

谓语：谓语用来查找某个特定的节点或者包含某个指定的值的节点。 谓语被嵌在方括号中。谓语就是查询的条件。即在路径选择时，在中括号内指定查询条件。

XPath（轴节点）：轴的意思是相对于当前节点的节点集

步step：步的语法 轴名称::节点测试[谓语]

XPATH实例：以斜杠开始的称为绝对路径，表示从根开始。不以斜杆开始的称为相对路径，一般都是依照当前节点来计算。当前节点在上下文环境中，当前节点很可能已经不是根节点了。一般为了方便，往往xml如果层次很深，都会使用//来查找节点。

2.2. lxml

lxml安装：$ pip install lxml

from lxml import etree


with open('./books.xml') as f:
    # print(f.read())
    text = f.read()
    html = etree.HTML(text.encode())
    # print(html)
    print(html.tag)

    print(html.xpath('//title'))  # 从根节点向下找任意层中title的节点
    print(html.xpath('//book//title'))
    print(html.xpath('//book[@id="bk102"]'))
    print(html.xpath('//book[@id]'))
    print(html.xpath('//@id'))  # 取回的是属性
    print(html.xpath('//*[@id]'))
    print(html.xpath('//bookstore/book[1]'))
    print(html.xpath('//bookstore/book[1]/@id'))  # ['bk101']
    print(html.xpath('//bookstore/book[last()]/@id'))  # last()为最后一个节点
    print(html.xpath('//*[contains(local-name(), "store")]'))  # [<Element bookstore at 0x2ce5648>]
    # local-name()为当前标签名字
    print(html.xpath('//bookstore/*'))  # 匹配根节点bookstore下的所有子节点，不递归；
    print(html.xpath('//*[@*]'))  # 匹配所有有属性的节点
    print(html.xpath('//@*'))  # 匹配所有属性
    print(html.xpath('//book/title|//book/price'))  # 匹配book节点下title标签或prices标签
    print(html.xpath('//book[position()=2]/@id'))  # ['bk102']
    print(html.xpath('//book[price > 40]/@id'))
    print(html.xpath('//book[1]/text()'))  # 匹配第一个book节点下所有文本子节点
    print(html.xpath('//book[1]//text()'))  # 匹配第一个book节点下所有文本节点
    print(html.xpath('//*[contains(@class,"even")]'))  # 匹配属性class中包含even字符串的节点

从豆瓣电影中提取“本周口碑榜”：

import requests
from lxml import etree  # lxml 是c语言的库，效率非常高
from bs4 import BeautifulSoup

url = 'http://movie.douban.com'
headers = {'User-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
            Chrome/55.0.2883.75 Safari/537.36"}
response = requests.get(url, headers=headers)

with response:
    if response.status_code == 200:
        text = response.text
        html = etree.HTML(text)
        print(html.tag)

        titles = html.xpath('//div[@class="billboard-bd"]//a/text()')
        for title in titles:
            print(title)

        print("*********************")

2.3 Beautiful Soup4

BeautifulSoup可以从HTML、XML中提取数据。目前BS4在持续开发。

官方中文文档https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

安装：$ pip install beautifulsoup4

BeautifulSoup(markup, "html.parser") 使用Python标准库，容错差且性能一般。BeautifulSoup(markup, "lxml") 容错能力强，速度快。需要安装系统C库。推荐使用lxml作为解析器，效率高。请手动指定解析器，以保证代码在所有运行环境中解析器一致。

四种对象

BeautifulSoup将HTML文档解析成复杂的树型结构，每个节点都是Python的对象，可分为4种：

BeautifulSoup、Tag、NavigableString、Comment

BeautifulSoup：代表整个文档。

Tag：它对应着HTML中的标签。有2个常用的属性：

name：Tag对象的名称，就是标签名称
attrs：标签的属性字典

多值属性，对于class属性可能是下面的形式， <h3 class="title highlight">python高级班</h3> ，这个属性就是多值（{'class': ['title', 'highlight']}）属性可以被修改、删除

from lxml import etree  # lxml 是c语言的库，效率非常高
from bs4 import BeautifulSoup
# from bs4 import Tag


# features推荐写清楚
with open('E:/马哥教育培训资料/slides/chapter16爬虫/test.html', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')  # str 就是html内容；file-like obj open：'lxml'
    print(0, soup.builder)
    print(1, soup.name)
    print(2, soup.img)  # 返回第一个，
    print(3, soup.p)  # 深度优先遍历，取第一个
    print(4, soup.p.attrs)  # 结果是字典
    print(5, soup.h3.attrs)  # 字典
    print(6, soup.h3['class'])
    print(7, soup.h3.get('class'))
    soup.h3['class'] = 'new_class'
    print(8, soup.h3.get('class'))
    print(000, soup.div.name, soup.div.attrs)
    print(9, soup.p.string)  # p标签的内容
    print(10, soup.div.contents)  # 直接子，包括文本
    print(11, list(soup.div.children))  # 返回子节点的迭代器

    print(12, list(soup.div.descendants))  # 返回所有子孙节点的迭代器
    print(list(map(lambda x: x.name if x.name else x, soup.div.descendants)))  # 子孙节点

    print('************************')
    print(13, "".join(soup.div.strings))  # 拼接，但是换行符还在
    print(14, "".join(soup.div.stripped_strings))  # 连接在一起了

    print(15, soup.p.next_sibling)
    print(16, soup.img.get('src'))
    print(17, soup.img['src'])
    print(18, soup.a)  # 找不到返回None
    del soup.h3['class']  # 删除属性
    print(19, soup.h3.get('class'))

注意，我们一般不使用上面这种方式来操作HTML，此代码是为了熟悉对象类型

NavigableString：如果只想输出标记内的文本，而不关心标记的话，就要使用NavigableString。

print(soup.div.p.string) # 第一个div下第一个p的字符串；print(soup.p.string) # 同上

Comment ：注释对象，这就是HTML中的注释，它被Beautiful Soup解析后对应Comment对象。

遍历字符串：在前面的例子中，soup.div.string返回None，是因为string要求soup.div只能有一个NavigableString类型子节点， 也就是如这样 <div>only string</div> 。如果div有很多子孙节点，如何提取字符串？

print(soup.div.string)  # 返回None，因为多于1个子节点
print("".join(soup.div.strings))  # 返回迭代器，带多余的空白字符
print("".join(soup.div.stripped_strings))  # 返回迭代器，去除多余空白符

遍历祖先节点：

print(soup.parent)  # None 根节点没有父节点
print(soup.div.parent.name)  # body，第一个div的父节点
print(soup.p.parent.parent.get('id'))  # 取id属性，main
print(list(map(lambda x: x.name, soup.p.parents)))  # 父迭代器，由近及远

遍历兄弟节点：

print('{} [{}]'.format(1, soup.p.next_sibling))  # 第一个p元素的下一个兄弟节点，注意可能是一个文本节
点
print('{} [{}]'.format(2, soup.p.previous_sibling))
print(list(soup.p.next_siblings))  # previous_siblings

遍历其他元素 ：next_element是下一个可被解析的对象（字符串或tag），和下一个兄弟节点next_sibling不一样

print(soup.p.next_element)  # 返回"字典"2个字
print(soup.p.next_element.next_element.next_element)
print(list(soup.p.next_elements))

from lxml import etree  # lxml 是c语言的库，效率非常高
from bs4 import BeautifulSoup
# from bs4 import Tag


# features推荐写清楚
with open('E:/马哥教育培训资料/slides/chapter16爬虫/test.html', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')  # str 就是html内容；file-like obj open：'lxml'
    print(soup.p.next_element)  # 返回字典两个字
    print(soup.p.next_element.next_element)
    print(soup.p.next_element.next_element.next_element)
    print(list(soup.p.next_elements))
    print(list(soup.p.next_siblings))

搜索文档树：

name：官方称为filter过滤器，这个参数可以是以下类型：

i.字符串：一个标签名称的字符串，会按照这个字符串全长匹配标签名

print(soup.find_all('p'))

ii.正则表达式对象：按照正则表达式对象的模式匹配标签名

print(soup.find_all(re.compile(r'^h\d')))  # 标签名一h开头后接数字

iii.列表

 print(soup.find_all(['p', 'h1', 'h3']))
 print(soup.find_all(['p', re.compile(r'h\d')]))

IV.True或None：True或None，则find_all返回全部非字符串节点、非注释节点，接胡思Tag标签类型。

   print(soup.list(map(lambda x: x.name, soup.find_all(True))))
   print(soup.list(map(lambda x: x.name, soup.find_all(None))))
   print(soup.list(map(lambda x: x.name, soup.find_all())))

from lxml import etree  # lxml 是c语言的库，效率非常高
from bs4 import BeautifulSoup
import re
from bs4 import Tag


# features推荐写清楚
with open('E:/马哥教育培训资料/slides/chapter16爬虫/test.html', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')  # str 就是html内容；file-like obj open：'lxml'

    values = [True, None, False]
    for value in values:
        all = soup.find_all(value)
        print(type(all[0]))
        print(len(all))
    count = 0
    for i, t in enumerate(soup.descendants):
        print(i, type(t), t.name)
        if isinstance(t, Tag):
            count += 1
    print(count)

V.函数：如果使用以上过滤器还不能提取出想要的节点，可以使用函数，此函数仅只能接收一个参数。如果这个函数返回True，表示当前节点匹配；返回False则是不匹配。

from lxml import etree  # lxml 是c语言的库，效率非常高
from bs4 import BeautifulSoup
import re
import bs4


def many_class(tag: bs4.Tag):
    # print(type(tag))
    # print(tag.attrs)
    return len(tag.attrs.get('class', [])) > 1


# features推荐写清楚
with open('E:/马哥教育培训资料/slides/chapter16爬虫/test.html', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')  # str 就是html内容；file-like obj open：'lxml'

    print(soup.find_all(many_class))
    # [<h3 class="title highlight">python高级班</h3>]

keyword传参：使用关键字传参，如果参数名不是find系函数已定义的位置参数名，参数会被kwargs收集并被当做标签的属性来搜索。属性的传参可以是字符串、正则表达式对象、True、列表。

print(soup.find_all(id='first'))  # id为first的所有节点列表
print(soup.find_all(id=re.compile('\w+')))  # 相当于找有id的所有节点
print(soup.find_all(id=True))  # 所有有id的节点
print(list(map(lambda x:x['id'], soup.find_all(id=True))))
print(soup.find_all(id=['first', re.compile(r'^sec')]))  # 指定id的名称列表
print(soup.find_all(id=True, src=True))  # 相当于条件and，既有id又有src属性的节点列表

css的class的特殊处理：class是Python关键字，所以使用 class_ 。class是多值属性，可以匹配其中任意一个，也可以完全匹配。

print(soup.find_all(class_="content"))
print(soup.find_all(class_="title"))  # 可以使用任意一个css类
print(soup.find_all(class_="highlight"))  # 可以使用任意一个css类
print(soup.find_all(class_="highlight title"))  # 顺序错了，找不到
print(soup.find_all(class_="title highlight"))  # 顺序一致，找到，就是字符串完全匹配

attrs参数：attrs接收一个字典，字典的key为属性名，value可以是字符串、正则表达式对象、True、列表。可以多个属性

print(soup.find_all(attrs={'class':'title'}))
print(soup.find_all(attrs={'class':'highlight'}))
print(soup.find_all(attrs={'class':'title highlight'}))
print(soup.find_all(attrs={'id':True}))
print(soup.find_all(attrs={'id':re.compile(r'\d$')}))
print(list(map(lambda x:x.name, soup.find_all(attrs={
'id':True, 'src':True
}))))

text参数：可以通过text参数搜索文档中的字符串内容，接受字符串、正则表达式对象、True、列表

print(list(map(lambda x: (type(x), x), soup.find_all(text=re.compile('\w+')))))  # 返回文本类节点
print(list(map(lambda x: (type(x), x), soup.find_all(text=re.compile('[a-z]+')))))
print(soup.find_all(re.compile(r'^(h|p)'), text=re.compile('[a-z]+')))  # 相当于过滤出Tag对象，并看
它的string是否符合text参数的要求，返回Tag对象

limit参数：限制返回结果的数量

print(soup.find_all(id=True, limit=3))  # 返回列表中有3个结果

find_all()是非常常用的方法，可以简化省略掉：

print(soup('img'))  # 所有img标签对象的列表，不等价于soup.img
print(soup.img)  # 深度优先第一个img
print(soup.a.find_all(text=True))  # 返回文本
print(soup.a(text=True))  # 返回文本，和上面等价
print(soup('a', text=True))  # 返回a标签对象
print(soup.find_all('img', attrs={'id':'bg1'}))
print(soup('img', attrs={'id':'bg1'}))  # find_all的省略
print(soup('img', attrs={'id':re.compile('1')}))

find方法：find( name , attrs , recursive , text , **kwargs ) 参数几乎和fifind_all一样。找到了，fifind_all返回一个列表，而fifind返回一个单值，元素对象。找不到，fifind_all返回一个空列表，而fifind返回一个None。

print(soup.find('img', attrs={'id':'bg1'}).attrs.get('src', 'magedu'))
print(soup.find('img', attrs={'id':'bg1'}).get('src'))  # 简化了attrs
print(soup.find('img', attrs={'id':'bg1'})['src'])

CSS选择器 ***

和JQuery一样，可以使用CSS选择器来查找节点，使用soup.select()方法，select方法支持大部分CSS选择器，返回列表。CSS中，标签名直接使用，类名前加.点号，id名前加#井号。


from bs4 import BeautifulSoup


# features推荐写清楚
with open('E:/马哥教育培训资料/slides/chapter16爬虫/test.html', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')  # str 就是html内容；file-like obj open：'lxml'
    
    # 元素选择器
    print(1, soup.select('p'))  # 所有的p标签
    # 类选择器
    print(2, soup.select('.title'))
    # 使用了伪类
    # 直接子标签是p的同类型的所有p标签中的第二个
    # （同类型）同标签名p的第2个，伪类只实现了nth-of-type，且要求是数字
    print(3, soup.select('div.content > p:nth-of-type(2)'))  # 只实现了这个伪类
    # id选择器
    print(4, soup.select('p#second'))
    print(5, soup.select('#bg1'))
    # 后代选择器
    print(6, soup.select('div p'))  # div下逐层找p
    print(7, soup.select('div div p'))  # div下逐层找div下逐层找p
    # 子选择器，直接后代
    print(8, soup.select('div > p'))  # div下直接子标签的p，有2个
    # 相邻兄弟选择器
    print(9, soup.select('div p:nth-of-type(1) + [src]'))  # 返回[]
    print(9, soup.select('div p:nth-of-type(1) + p'))  # 返回[]
    print(9, soup.select('div > p:nth-of-type(2) + input'))  # 返回input Tag
    print(9, soup.select('div > p:nth-of-type(2) + [type]'))  # 同上
    # 普通兄弟选择器
    print(10, soup.select('div p:nth-of-type(1) ~ [src]'))  # 返回2个img
    # 属性选择器
    print(11, soup.select('[src]'))  # 有属性src
    print(12, soup.select('[src="/"]'))  # 属性src等于/
    print(13, soup.select('[src="http://www.magedu.com/"]'))  # 完全匹配
    print(14, soup.select('[src^="http://www"]'))  # 以http://www开头
    print(15, soup.select('[src$="com/"]'))  # 以com/结尾
    print(16, soup.select('img[src*="magedu"]'))  # 包含magedu
    print(17, soup.select('img[src*=".com"]'))  # 包含.com
    print(18, soup.select('[class="title highlight"]'))
    print(19, soup.select('[class~=title]'))  # 多值属性中有一个title

获取文本内容:

from bs4 import BeautifulSoup


# features推荐写清楚
with open('E:/马哥教育培训资料/slides/chapter16爬虫/test.html', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')  # str 就是html内容；file-like obj open：'lxml'
    ele = soup.select('div')  # 所有的div标签
    print(ele[0].string, end='\n------------\n')  # 内容仅仅只能是文本类型，否则返回None
    print(list(ele[0].strings), end='\n------------\n')  # 迭代保留空白字符
    print(list(ele[0].stripped_strings), end='\n------------\n')  # 迭代不保留空白字符
    print(ele[0], end='\n------------\n')
    print(ele[0].text, end='\n------------\n')  # 本质上就是get_text()，保留空白字符的strings
    print(ele[0].get_text(), end='\n------------\n')  # 迭代并join，保留空白字符，strip默认为False
    print(ele[0].get_text(strip=True))  # 迭代并join，不保留空白字符

2.4 Json解析

拿到一个Json字符串，如果想提取其中的部分内容，就需要遍历了。在遍历过程中进行判断。还有一种方式，类似于XPath，叫做JsonPath。安装：$ pip install jsonpath

import requests
import simplejson
from jsonpath import jsonpath


url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0"

headers = {'User-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
            Chrome/55.0.2883.75 Safari/537.36"}
response = requests.get(url, headers=headers)

with response:
    text = response.text
    print(type(text), text)

    data = simplejson.loads(text)
    print(data)

    # //subjects
    # subjects = jsonpath(data, '$..subjects')  # 找不到就返回bool的False
    # if isinstance(subjects, list) and len(subjects) == 1:
    #     print(subjects)
    #     for subject in subjects[0]:
    #         print(subject.get['title'])

    # //subjects[rate > 8]/title  $.subjects[?(@.rate >8)]
    subjects = jsonpath(data, '$.subjects[?(@.rate > "8")].title')  # 找不到就返回bool的False
    # if isinstance(subjects, list) and len(subjects) == 1:
    print(subjects)