PyQuery 通过规则获取指定数量，指定形式的的节点

最新推荐文章于 2023-08-21 20:54:13 发布

shadowsland

最新推荐文章于 2023-08-21 20:54:13 发布

阅读量872

点赞数

文章标签： python html css

本文链接：https://blog.csdn.net/u011888840/article/details/105915786

版权

PyQuery

对于PyQuery就没什么好说的了，网上大量的说明：
其是参照JQuery实现的python库，同BeautifulSoup一样用于快速解析xml和html文件

获取你要的节点

以爱丽丝梦游仙境中的一段稍作修改后写个小例子的例子：

html = '''
<html>
 <head><title>The Dormouse's story</title></head>
 <body>
  <p class="title"> <b>The Dormouse's story</b> </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link0"> start </a>
   <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a>，
   <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and
   <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a>；
   <a class="sister" href="http://example.com/tillie" id="link4"> and they lived at the bottom of a well </a>
   <a class="sister" href="http://example.com/elsie" id="link5"> end </a>
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
'''
doc = pq(html, parser='html')
print('第一个a节点:', doc('a:first-child'))
print('最后一个a节点:', doc('a:last_child'))
print('第二个a节点:', doc('a:nth-child(2)'))
print('第三个后的所有a节点:', doc('a:gt(2)'))
print('前三个a节点:', doc('a:lt(3)'))
print('包含指定文本文本的节点:', doc('a:contains(Elsie)'))  # 文本Elsie
print('索引第一个节点：', doc('a:eq(0)'))
print('偶数的所有a节点:', doc('a:nth-child(2n)'))  
print('奇数节点:', doc('a:even'))
print('偶数节点:', doc('a:odd'))

注意：索引是从0开始的

获取CSDN下某博主所有的博文信息

先看csdn主页信息如下：
在这里插入图片描述
获取需要信息的关键节点参数，直接提取，代码如下：
注意获取博文名要跳过span节点，利用lt获取指定数量

from pyquery import PyQuery as pq

def get_info_from_url(url,  limit):`在这里插入代码片`
        try:
            print("-"*100)
            doc = pq(url)
            if limit:
                item_box = doc(".article-list .article-item-box" + (f":lt({limit})" if limit > 0 else ""))
                yield from [{"url": info.attr.href, "name": info.text()} for info in item_box("a").items()
                            if info.find("span").remove()]
        except Exception as e:
            print("RequestException:", e)

url_input = input("输入网址：").strip()  # CSDN,博主主页链接https://blog.csdn.net/xxx
try:
	num = int(input("输入要获取的数量："))	 # 输入负数表示获取全部
except:
   num = -1
for i in  get_info_from_url(url_input, num):
    print(i)

根据以上代码，获取指定博主的博文的url和博文名称
需要手动输入博主主页链接如：https://blog.csdn.net/u011888840
输入获取博文的数量：需要输入数字，负数会显示所有
其规则是以字符串形式指定，可以通过f字符串或format制定自己想要的规则筛选，非常方便

列出一段获取https://blog.csdn.net/u011888840前五个博文信息的打印信息：
输入https://blog.csdn.net/u011888840和5得到