python对lxml的操作

最新推荐文章于 2023-04-06 18:38:48 发布

panda-star

最新推荐文章于 2023-04-06 18:38:48 发布

阅读量179

点赞数

分类专栏： xml解析

本文链接：https://blog.csdn.net/chinabestchina/article/details/104874450

版权

xml解析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

python对lxml的操作

文章目录

python对lxml的操作
- - 一、简介
  - 二、使用

一、简介

在pyathon爬虫中，常用BeatifulSoup进行html解析，但容易内存溢出。这里介绍另一种工具lxml在html元素提取中的使用，同时与BeatifulSoup方式进行比较。

二、使用

这里直接上代码，具体请查看代码注释。

#! /usr/bin/env python
# -*- coding:utf8 -*-

import requests
from bs4 import BeautifulSoup
from lxml import etree, html
from lxml.html import soupparser

def get_html():
    res = requests.get('http://www.ifeng.com/')
    html_str = res.content
    return html_str


def main():
    # 方式一 采用html方式解析html,使用etree作为parser
    page = html.fromstring(get_html())
    eles = page.cssselect('#headLineSichuan > ul:nth-child(1) > li:nth-child(1)')
    content = eles[0].text_content()
    print(content)

    # 方式二 采用html方式解析html,使用beautifulsoup作为parser,对编码有良好支持
    page = soupparser.fromstring(get_html())
    eles = page.cssselect('#headLineSichuan > ul:nth-child(1) > li:nth-child(1)')
    content = eles[0].text_content()
    print(content)


    # 方式三 采用xml方式解析html
    page = etree.HTML(get_html())
    eles = page.cssselect('#headLineSichuan > ul:nth-child(1) > li:nth-child(1)')
    content = eles[0].xpath('string(.)')
    print(content)

    # 方式四 采用beautifulsoup方式解析html,注意,此时后代结点的写法不同
    page = BeautifulSoup(get_html())
    eles = page.select('#headLineSichuan > ul:nth-of-type(1) > li:nth-of-type(1)')
    content = eles[0].text
    print(content)


if __name__ == '__main__':
    main()

panda-star

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python对lxml的操作

python对lxml的操作文章目录python对lxml的操作一、简介二、使用一、简介在pyathon爬虫中，常用BeatifulSoup进行html解析，但容易内存溢出。这里介绍另一种工具lxml在html元素提取中的使用，同时与BeatifulSoup方式进行比较。二、使用这里直接上代码，具体请查看代码注释。#! /usr/bin/env python# -*- coding:...
复制链接

扫一扫