解析库 xpath, beautifu soup , pyquery_pyquery. children find-CSDN博客

本文链接：https://blog.csdn.net/ljq1998/article/details/99296292

1.Xpath

节点,属性值获取都是列表

基本使用

from lxml import etree
text = '<li>abc刘嘉强</li>'
html = etree.HTML(text)
#以字符串构造节点
print(etree.tostring(html).decode('utf-8'))
#将节点转换为字符串，设置编码
result = html.xpath('//li/text()')
#选择文本值，result是一个列表
print(result)

1.节点选择

'//'子孙节点

html.xpath('//li/a') '/'直接子节点

html.xpath('//li/a/../a') '..'父节点

html.xpath('//li/a/parent::/a') 'parent::'父节点

2.属性获取

html.xpath('//li/text()') text()获取节点内的文本，不会获取子节点的文本

html.xpath('//li/@class') @属性名获取所有匹配节点的属性值

3.属性匹配

(1)简单属性匹配

html.xpath('//a[@class="x"]')   #@属性名=属性值 选择匹配节点

(2)属性多值匹配

针对这种情况,用之前的的简单匹配无法成功，需要contains()函数

html.xpath('//a[contains(@class,"x")]')

(3)多属性匹配

针对需要多个属性确定一个节点，用and来连接 ,or也有

html.xpath('//a[@name="item" and contains(@class,"x")]/li')

4.按序选择

html.xpath('//a/li[1]')#第一个节点
html.xpath('//a/li[last()]')#最后一个节点
html.xpath('//a/li[position() < 3]')#位置小于3的节点，也就是1,2
html.xpath('//a/li[last() - 2]')#倒数第三个节点

5.节点轴选择

2.Beautiful Sopu

节点获取都是Tag

基本使用

from bs4 import BeautifulSoup
text = '<li class="xy">abc</li>'
soup = BeautifulSoup(text,'lxml')
print(soup.prettify())
#以标准缩进格式输出
print(soup.li.string)
#获取li节点的文本

1.节点选择器

soup.a.ul.li

(1)子节点

直接子节点: soup.a.contents 或者 soup.a.children

子孙节点: soup.a.descendants

(2)父节点

直接父节点:soup.a.parent

祖先节点:soup.a.parents 包括父节点

(3)兄弟节点

下一个兄弟:soup.a.next_sibling

上一个兄弟:soup.a.previous_sibling

后面的兄弟:soup.a.next_siblings

前面的兄弟:soup.a.previous_siblings

2.属性获取

soup.a.li.string 'string'获取文本

soup.a.name 'name'获取节点名称为a

soup.a.attr['class'] 'attr[]'获取节点属性值 ,字典

3.方法选择器

find_all

查找所有符合的节点

find_all(name,attrs,recursive,text,**kwargs)

find,返回第一个匹配的节点

(1)根据节点名来查找

soup.find_all(name='ul')

(2)根据节点属性查找

soup.find_all(attrs={'id':'list1'})
soup.find_all(class_='list1')
#因为class是个关键字，需要加下划线

(3)根据节点文本查找

soup.find_all(text=re.compile('link'))
#text为正则表达式对象

4.css选择器

soup.select('#id .class li') #选择节点

3.pyquery

节点获取都是Pyquery类型

基本使用

from pyquery import PyQuery as pq
#字符串初始化
html = '<li></li>'
doc = pq(html)
#url初始化
url = 'https://www/baidu.com'
doc = pq(url=url)
#文件初始化
filename = 'demo.html'
doc = pq(filename=filename)

1.基本CSS选择器

doc('#x .y li')

先选择id为x的节点，再选择class为y的节点内部的所有li节点

2.查找节点

(1)find 查找节点的所有子孙节点,children查找子节点

(2)parent 父节点 ,parents祖先节点

(3)siblings 兄弟节点

doc.find()
doc.children()
doc.parent()
doc.parents()
doc.siblings()
#所有的()可加CSS选择器

3.遍历

调用items() 可以得到一个生成器

for li in doc('li').items():
    pass

4.获取信息

(1)获取属性

doc.attr('class')

当返回结果包含多个节点时，调用attr()，只会得到第一个节点的值

(2)获取文本

doc.text()它会忽略掉HTNL，只返回纯文本,它将所有节点的文本连接成一个字符串

doc.htnl()返回节点内的所有html文本

5.节点操作

pyquery提供了一系列方法来对节点进行动态修改

(1)

doc.addclass('active') #添加class='active' 这个属性

doc.removeclass('active') #删除class='active' 这个属性

(2)attr,text,html

doc.attr('name','link') 第一个参数是属性名，第二个是属性值,添加

doc.text('刘嘉强') 将整个节点内部换为纯字符串,<>刘嘉强<>

doc.html('<li>xy</li>') 将整个节点内部换为html,<><li>xy</li><>

不传参获取属性，传参赋值

(3)remove

<li>

刘嘉强<p>xy</p>

</li>

针对这种情况，直接调用text(),得'刘嘉强xy'，可以先remove<p>节点

doc('li').find('p').remove()

首先选中<p>节点，然后调用remove将其移除

6.伪类选择器

doc('li:first-child')#第一个li节点
doc('li:last-child')#最后一个节点
doc('li:nth-child(2)')#第二个节点
doc('li:gt(2)')#第三个及其以后的节点
doc('li:nth-child(2n)')#偶数位置的节点
doc('li:contains(second)')#包含second文本的节点