python的lxml库简介_python html parser库lxml的介绍和使用

最新推荐文章于 2023-05-23 22:08:32 发布

weixin_39609483

最新推荐文章于 2023-05-23 22:08:32 发布

阅读量221

点赞数

文章标签： python的lxml库简介

使用由 Python 编写的 lxml 实现高性能 XML 解析 http://blog.csdn.net/yatere/article/details/6667043用lxml解析HTMLhttp://www.cnblogs.com/descusr/archive/2012/06/20/2557075.html分步遍历：比先遍历得到body的某个div,然后在使用这个div的内容做遍历基础，继续通过它往下遍历def scanningHotArticle(url):

print url

request=requests.get(url)

dom=soupparser.fromstring(request.content)

body=dom[1]

articleList=body.xpath("//div[@class='block untagged mb15 bs2']")

for article in articleList:

articleStr= etree.tostring(article)

articleBody=soupparser.fromstring(articleStr)

print len(articleBody.xpath("//div[@class='detail']"))结构是body-*->div[class='block untagged mb15 bs2'],这个div下面又存在许多div,然后把这个div当作根节点，在继续xpath查找下面的元素。python html parser库lxml的介绍和使用(快速入门)http://blog.csdn.net/marising/article/details/5821090lxm是python的一个html/xml解析并建立dom的库，lxml的特点是功能强大，性能也不错，xml包含了ElementTree ，html5lib ，beautfulsoup 等库，但是lxml也有自己相对应的库，所以，导致lxml比较复杂，初次使用者很难了解其关系。1.

解析html并建立dom>>> import lxml.etree as etree>>> html = '

abc

123

def

456

ghi'>>> dom = etree.fromstring(html)>>> etree.tostring(dom)'abc

123

def

456

ghi'如果用beautifulsoup的解析器，则>>> import lxml.html.soupparser as soupparser>>> dom = soupparser.fromstring(html)>>> etree.tostring(dom)'abc

123

def

456

ghi'但是我强烈建议使用soupparser，因为其处理不规范的html的能力比etree强太多。 2.

按照Dom访问Element子元素长度>>> len(dom)1 访问子元素：>>> dom[0].tag'body' 循环访问：>>> for child in dom:...

print child.tag... body 查看节点索引>>>body = dom[0]>>> dom.index(body)0 字节点获取父节点>>> body.getparent().tag'html' 访问所有子节点>>> for ele in dom.iter():...

print ele.tag... htmlbodydivdiv 遍历和打印所有子节点：>>> children = list(root)>>> for child in root:... print(child.tag) 元素的兄弟或邻居节点是通过next和previous属性来访问的The siblings (or neighbours) of an element are accessed as next and previous elements: >>> root[0] is root[1].getprevious() # lxml.etree only! True >>> root[1] is root[0].getnext() # lxml.etree only! True3. 访问节点属性>>> body.get('id')'1'也可以这样>>> attrs = body.attrib>>> attrs.get('id')'1'带属性的元素XML元素支持属性，可以用Element工厂方法直接创建。>>> root = etree.Element("root", interesting="totally")>>> etree.tostring(root)b’’可以使用set和get方法访问这些属性：>>> print root.get("interesting")totally>>> root.set("interesting", "somewhat")>>> print root.get("interesting")somewhat也可以使用attrib性质的字典接口>>> attributes = root.attrib>>> print(attributes["interesting"])somewhat>>> print(attributes.get("hello"))None>>> attributes["hello"] = "Guten Tag">>> print(attributes.get("hello"))Guten Tag>>> print(root.get("hello"))Guten Tag 4. 访问Element的内容>>> body.text'abc'>>> body.tailtext只是从本节点开始到第一个字节点结束；tail是从最后一个字节结束到本节点未知。访问本节点所有文本信息>>> body.xpath('text()')['abc', 'def', 'ghi'] 访问本节点和子节点所有文本信息>>> body.xpath('//text()')['abc', '123', 'def', '456', 'ghi']貌似返回本文档中所有文字信息body.text_content()返回本节点所有文本信息。5.Xpath的支持所有的div元素>>> for ele in dom.xpath('//div'):...

print ele.tag... divdiv id=“1”的元素>>> dom.xpath('//*[@id="1"]')[0].tag'body' body下的第1个div>>> dom.xpath('body/div[1]')[0].tag'div' 参考：lxml的官方文档：http://codespeak.net/lxml/HtmlParser的性能：http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/