3.解析库的使用

最新推荐文章于 2024-02-02 03:53:26 发布

Z_Coding

最新推荐文章于 2024-02-02 03:53:26 发布

阅读量216

点赞数

分类专栏： python 文章标签：爬虫

本文链接：https://blog.csdn.net/z1360408752/article/details/112712860

版权

python 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

使用XPath

XPath，全称XML Path Language，即XML路径语言。
它是一门在XML文档中查找信息的语言。

准备工作
安装lxml库

基本用法

from lxml import etree

selector=etree.HTML(源码) #将源码转化为能被XPath匹配的格式

selector.xpath(表达式) #返回为一列表

实例讲解

from lxml import etree
html="""
<!DOCTYPE html>
<html>
<head lang="en">
<title>测试</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<div id="content">
<ul id="ul">
<li>NO.1</li>
<li>NO.2</li>
<li>NO.3</li>
</ul>
<ul id="ul2">
<li>one</li>
<li>two</li>
</ul>
</div>
<div id="url">
<a href="http:www.58.com" title="58">58</a>
<a href="http:www.csdn.net" title="CSDN">CSDN</a>
</div>
</body>
</html>
"""
selector=etree.HTML(html)
content=selector.xpath('//div[@id="content"]/ul[@id="ul"]/li/text()') #这里使用id属性来定位哪个div和ul被匹配 使用text()获取文本内容
for i in content:
    print(i)
# #输出为
# NO.1
# NO.2
# NO.3

con=selector.xpath('//a/@href') #这里使用//从全文中定位符合条件的a标签，使用“@标签属性”获取a便签的href属性值
for each in con:
    print (each)
# #输出结果为：
# http:www.58.com
# http:www.csdn.net

con=selector.xpath('/html/body/div/a/@title') #使用绝对路径�20 <a href="http:www.csdn.2Fa/@title') #使用相对路径定位 两者效果是一样的
print (len(con))
print (con[0],con[1])

# #输出结果为：
# 2
# 58 CSDN

XPath的特殊用法

starts-with 解决标签属性值以相同字符串开头的情况

from lxml import etree
html="""
    <body>
        <div id="aa">aa</div>
        <div id="ab">ab</div>
        <div id="ac">ac</div>
    </body>
    """
selector=etree.HTML(html)
content=selector.xpath('//div[starts-with(@id,"a")]/text()') #这里使用starts-with方法提取div的id标签属性值开头为a的div标签
for each in content:
    print each
#输出结果为：
aa
ab
ac

属性值包含相同字符串
使用contains代替starts-with

例如提取属性值中含有key的标签的文本信息

selector.xpath('//div[contains(@id,"key")]/text()')

使用BeautifulSoup

BS4的安装

pip install beautifulsoup4

BeautifulSoup的导入

from bs4 import BeautifulSoup

BeautifulSoup的基本使用

from bs4 import BeautifulSoup
soup=BeautifulSoup('123','html.parser')#第一个参数为网页源代码，第二个参数为解析器
# soup=BeautifulSoup('123','lxml')#如果安装了lxml，可以使用lxml作为解析器
print(soup.prettify())

BeautifulSoup库解析器
在这里插入图片描述

BeautifulSoup库的基本元素 在这里插入图片描述

标签树基本元素的获取

from bs4 import BeautifulSoup
data='''

<html>
  <head>
    <title>测试</title>
  </head>
  <body>
    <div class="useful">
      <ul>
        <li class="info">我需要的信息1</li>
        <li class="test">我需要的信息2</li>
        <li class="iamstrange">我需要的信息3</li>
      </ul>
     </div>
     <div class="useless">
       <ul>
         <li class="info">垃圾1</li>
         <li class="info">垃圾2</li>
       </ul>
     </div>
  </body>
</html>'''
soup=BeautifulSoup(data,'lxml')#第一个参数为网页源代码，第二个参数为解析器
tag=soup.li#获取标签
print(soup.li.name)#获取标签名
print(soup.li.parent.name)
print(tag.attrs)#获取标签的属性，属性为字典呈现
print(tag.string)#获取标签的非属性字符串

标签树的平行遍历

下行遍历
上行遍历
平行遍历

相关遍历使用的方法
在这里插入图片描述
注意：
在上行遍历中需要先判断是否找到该标签的父亲

soup=BeautifulSoup(data,'lxml')#第一个参数为网页源代码，第二个参数为解析器
for parent in soup.li.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

bs4的HTML格式化和编码

使用prettify方法可以格式化HTML代码
bs4默认使用utf-8编码

import requests
from bs4 import BeautifulSoup
demo=requests.get('https://python123.io/ws/demo.html').text
soup=BeautifulSoup(demo,'html.parser')
print(soup.prettify())#使用prettify可以美化标签树的输出
#bs4默认使用utf-8编码

bs4的find方法

from bs4 import BeautifulSoup
import requests
data='''

<html>
  <head>
    <title>测试</title>
  </head>
  <body>
    <div class="useful">
      <ul>
        <li class="info">我需要的信息1</li>
        <li class="test">我需要的信息2</li>
        <li class="iamstrange">我需要的信息3</li>
      </ul>
     </div>
     <div class="useful">
       <ul>
         <li class="info">垃圾1</li>
         <li class="info">垃圾2</li>
       </ul>
     </div>
  </body>
</html>'''
# html=requests.get('https://movie.douban.com/top250?start=0').text
# print(html)
# soup=BeautifulSoup(html,'lxml')#第一个参数为网页源代码，第二个参数为解析器
# all_connect=soup.find_all('li')
# for connect in all_connect:
#     name=connect.find(class_='title').string
#     print(name)
soup=BeautifulSoup(data,'lxml')
all_connect=soup.find_all('div')
for connect in all_connect:
    info=connect.find('li').string
    print(info)