第五天 xml和xpath的用法

最新推荐文章于 2022-08-18 11:53:03 发布

Rorschach379

最新推荐文章于 2022-08-18 11:53:03 发布

阅读量266

点赞数

分类专栏： python 文章标签： xml python 开发语言

本文链接：https://blog.csdn.net/weixin_63123211/article/details/122499194

版权

python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

复习

# 1. 字体反爬
# 字体文件格式: woff、woff2
# 在线预览：https://kekee000.github.io/fonteditor/
# int('4e00', base=16)

# 2. 登录反爬
# 1)requests登录反爬: headers中添加'cookie'对应的键值
# 2)selenium登录反爬:
# a.获取cookie值保存到本地: b.get_cookies()
# b.使用cookie: 打开网页  -> 添加cookie  -> 重新打开网页

# 3. 代理ip
# 1）requests   -   参数proxies赋值， {'http': 'ip:端口', 'https': 'ip:端口'}
# 2）selenium  -  添加配置

# 补充： http可以兼容https，但是https不能兼容http

树和节点

xpath是用来解析树结构的内容的语法
常见树结构内容：html、xml

1.相关专业术语

树 - 整个html内容或者整个xml内容就是一个树结构
节点 - html或者xml中的每个标签(元素)
根节点 - 整个树最上面的节点(第一个节点)
子节点和父节点

2.xml数据格式

xml和json一样，都是通用的数据格式

例如：一个超市的数据
xml格式:

永辉超市

肖家河大厦

7:00~22:00
<all_goods>

</all_goods>
<all_waiter>

</all_waiter>

python使用xpath语法解析html或者xml数据，需要先安装第三方库: lxml(c语言的库)

from lxml import etree

# 1. 创建树，获取树的根节点
# etree.XML(xml数据)   -   将指定的xml数据转换成树，并且返回树的根节点
# etree.HTML(html数据)    -   将指定的html数据转换成树，并且返回树的根节点
supermarket = etree.XML(open('test.xml', encoding='utf-8').read())
print(supermarket)      # <Element supermarket at 0x103a5ee80>

# 2.获取节点(标签)
# 节点对象.xpath(路径)  -  返回的是指定路径下所有的节点（以列表的形式返回）
# 1)绝对路径:  /绝对路径
# 不管xpath前面的节点对象是谁，绝对路径必须是从根节点开始一层层往下写
# 写法: /根节点/节点1/节点2/...
goods_list = supermarket.xpath('/supermarket/all_goods/goods')
print(goods_list)

names = supermarket.xpath('/supermarket/all_goods/goods/name')
print(names)

2)相对路径: ./相对路径、…/相对路径

.表示当前节点，谁去.xpath，当前节点就是谁。写相对路径的时候，路径中./可以省略
…表示当前节点的上层节点

all_goods = supermarket.xpath('/supermarket/all_goods')[0]
print(all_goods)

# 绝对路径
result = supermarket.xpath('/supermarket/all_goods/goods/name')
print(result)

# 相对路径
result = supermarket.xpath('./all_goods/goods/name')
print(result)

# 相对路径
result = supermarket.xpath('all_goods/goods/name')
print(result)

# 绝对路径
result = all_goods.xpath('/supermarket/all_goods/goods/name')
print(result)

# 相对路径
result = all_goods.xpath('./goods/name')
print(result)

# 相对路径
result = all_goods.xpath('goods/name')
print(result)

result = all_goods.xpath('../all_goods/goods/name')

3.获取标签内容和标签属性

选中标签的路径/text() - 获取标签内容

选中标签的路径/@属性名 - 获取指定属性值

# 3) //路径  -  在整个树结构中找满足路径对应的节点
# //对应的路径和xpath前面是哪个节点无关
result = supermarket.xpath('//name/text()')
print(result)

result = supermarket.xpath('//goods/name/text()')
print(result)

result = supermarket.xpath('//waiter/name/text()')
print(result)


result = supermarket.xpath('//waiter/@tag')
print(result)

1. 谓语 - 条件

1)位置相关谓语:

[N] - 第N个
[last()] - 最后一个
[last()-N] - 倒数第 N+1 个
[position()<N] - 位置小于N

result = supermarket.xpath('/supermarket/all_goods/goods[1]/name[1]/text()')
print(result)

result = supermarket.xpath('./all_goods/goods[last()]/name/text()')
print(result)

result = supermarket.xpath('./all_goods/goods[1]/name[last()-2]/text()')
print(result)

result = supermarket.xpath('./all_goods/goods[1]/name[position()<=2]/text()')
print(result)

2）属性相关谓语

[@属性名] - 拥有指定属性的标签
[@属性名=值] - 指定属性是指定值的标签

result = supermarket.xpath('./all_goods/goods[1]/name[@class]/text()')
print(result)

result = supermarket.xpath('./all_goods/goods[1]/name[@class="c2"]/text()')
print(result)

3）标签内容相关谓语

[子标签名>值] - 指定子标签的标签内容大于指定值

result = supermarket.xpath('./all_goods/goods[price>3]/count/text()')
print(result)

result = supermarket.xpath('./all_goods/goods[price=7.5]/name/text()')
print(result)

1. 通配符：*表示任意标签(节点)或者任意属性

1)用*来表示任意标签

result = supermarket.xpath('//*[@class="c1"]/text()')     # 获取class值为c1的所有标签
print(result)

result = supermarket.xpath('//goods[2]/*/text()')
print(result)

2)用*表示任意属性

result = supermarket.xpath('//goods[1]/name[@*]/text()')
print(result)

result = supermarket.xpath('//goods[1]/name[2]/@*')
print(result)

2. 分支 - |

在路径的表达式中用|将若干个独立的路径连起来，用于同时获取多个路径的结果

# result = supermarket.xpath('//goods[2]/name/text()')
result = supermarket.xpath('//goods[2]/name/text()|//goods[2]/price/text()|//goods[2]/count/text()')
print(result)

租房信息获取

import requests
from lxml import etree
from re import sub


def get_html(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
    }
    r = requests.get(url, headers=headers)
    r.encoding = r.apparent_encoding
    return r.text


def analysis_date():
    html = etree.HTML((get_html('https://cd.zu.ke.com/zufang')))
    all_houses = html.xpath('//div[@class="content__list"]/div')
    houses = []
    for house in all_houses:
        h = {}
        name = house.xpath('./a/@title')
        name = name[0] if name else ''
        h['name'] = name
        price = house.xpath('div/span/em/text()|div/span/text()')
        price = ''.join(price)
        h['price'] = price
        address = house.xpath('div/p[2]/a/text()')
        address = '-'.join(address)
        h['address'] = address
        message = house.xpath('div/p[2]/text()')
        new_message = [x.strip() for x in message]
        area = message[-4]
        h['area'] = sub(r'\s+', '', area)
        orientation = message[-3]
        h['orientation'] = orientation.strip()
        house_type = message[-2]
        h['house_type'] = sub(r'\s+', '', house_type)
        houses.append(h)
    print(houses)


if __name__ == '__main__':
    analysis_date()

商品信息抓取

def get_net_data():
    options = ChromeOptions()
    options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
    b = Chrome(options=options)
    b.get('https://www.jd.com')
    b.implicitly_wait(5)
    search = b.find_element_by_id('key')
    search.send_keys('零食')
    search.send_keys(Keys.ENTER)
    # 滚动
    height = 500
    while True:
        time.sleep(1)
        b.execute_script(f'window.scrollTo(0, {height})')
        height += 800
        if height > 7000:
            break

    result = b.page_source
    b.close()
    return result


def analysis_data():
    html = etree.HTML(get_net_data())
    # all_li = html.xpath('//div[@id="J_goodsList"]/ul/li')
    # all_goods = []
    # for li in all_li:
    #     name = li.xpath('./div/div[3]/a/em/text()')[0]
    #     price = li.xpath('./div/div[2]/strong/i/text()')[0]
    #     comment = li.xpath('./div/div[@class="p-commit"]/strong/a/text()')[0]
    #     goods = [name, price, comment]
    #     all_goods.append(goods)
    # print(all_goods, len(all_goods))

    names = html.xpath('//div[@class="p-name p-name-type-2"]/a/em/text()')
    prices = html.xpath('//div[@class="p-price"]/strong/i/text()')
    comments = html.xpath('//div[@class="p-commit"]/strong/a/text()')
    shop = html.xpath('//div[@class="p-shop"]/span/a/text()')
    goods = list(map(lambda n, p, c, s: [n, p, c, s], names, prices, comments, shop))
    print(goods)


if __name__ == '__main__':
    analysis_data()

Rorschach379

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第五天 xml和xpath的用法

复习# 1. 字体反爬# 字体文件格式: woff、woff2# 在线预览：https://kekee000.github.io/fonteditor/# int('4e00', base=16)# 2. 登录反爬# 1)requests登录反爬: headers中添加'cookie'对应的键值# 2)selenium登录反爬:# a.获取cookie值保存到本地: b.get_cookies()# b.使用cookie: 打开网页 -> 添加cookie -> 重新打开
复制链接

扫一扫