python lxml包学习笔记

最新推荐文章于 2024-07-31 09:03:16 发布

探索者v

最新推荐文章于 2024-07-31 09:03:16 发布

阅读量1.9w

点赞数 2

分类专栏：技术文档文章标签： python xml html lxml

本文链接：https://blog.csdn.net/tanzuozhev/article/details/50442243

版权

技术文档专栏收录该内容

56 篇文章 7 订阅

订阅专栏

python lxml包用于解析XML和html文件，可以使用xpath和css定位元素，个人认为相对于BeautifulSoup功能更加强大，更加灵活。本文根据lxml官方文档和自己的理解列出常用的函数, 本文代码为python3.4， lxml2.0

lxml：http://lxml.de/

支持：python2 python3

解析XML，以pubmed文献数据库文本解析为例

导入xml字符串

导入xml字符串有多种方式，我最长使用的是 lxml.etree.XML(xml字符串), etree.fromstring(xml字符串)也可以

import lxml.etree 
import urllib.request
from lxml.etree import *
str_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=26693255&retmode=text&rettype=xml'
request = urllib.request.Request(str_url)
xml_text = urllib.request.urlopen(request).read()
root = lxml.etree.XML(xml_text) # xml_text 为xml纯文本文件

root 为lxml.etree._Element 对象，含有多个函数

root 含有find，findall， xpath，get，getchildren函数，重点请help（root）

### findall， find
findall(…)
| findall(self, path, namespaces=None)
|
| Finds all matching subelements, by tag name or path.
| 输入下一级对象的tag标签或xpath(必须是相对路径.//开头)，返回匹配结果的所有元素，是一个list
| The optional namespaces argument accepts a
| prefix-to-namespace mapping that allows the usage of XPath
| prefixes in the path expression.

# example  获取杂志名称和ISSN
# 使用 tag作为输入需要逐级进行
journal_name = root.find('PubmedArticle').find('MedlineCitation').find('Article').find('Journal').find('Title').text
print('tag:', journal_name)

tag: Cognitive computation

# 也可以使用xpath(必须使用相对路径，以.//开头，如果想使用绝对路径可以使用xpath函数)
journal_name = root.find('.//Title').text
print('xpath:' ,journal_name)

xpath: Cognitive computation

# text是element对象的属性，可以得到内部的内容，如果要得到标签内部的属性
#使用get函数
# 比如得到 <ISSN IssnType="Print">1866-9956</ISSN>的 IssnTYpe属性，则可以使用get函数
issn_attr = root.find('.//ISSN').get('IssnType')
print('issn attr:', issn_attr)

issn attr: Print

# 使用tostring函数
# 可以得到改标签下的全部内容，tostring函数是 lxml.etree 下的静态函数，使用前需要 from lxml.etree import *
tostring(root.find('.//JournalIssue')) # 得到JournalIssue标签下的全部内容

b'<JournalIssue CitedMedium="Print">\n                    <Volume>7</Volume>\n                    <Issue>6</Issue>\n                    <PubDate>\n                        <MedlineDate>2015</MedlineDate>\n                    </PubDate>\n                </JournalIssue>\n                '

findall函数与find函数类似，find相当于findall(‘tag’)[0]

xpath 函数

具体xpath的学习可以参考 http://www.w3school.com.cn/xpath/xpath_syntax.asp
xpath与findall类似也返回一个list，不同之处是只能使用xpath，而且可以使用xpath的相对路径和绝对路径

journal_name = root.xpath('//Title')[0].text
print(journal_name)

Cognitive computation

getchildren函数

得到所有直接子元素

注意，使用findall，find，xpath时一定要确定元素是否存在（可以用 if 判断），然后才读取text属性，否则会遇到 Type ‘NoneType’ cannot be serialized., list index out of range, ‘NoneType’ object has no attribute ‘text’这样的错误。

除了上述读取的函数，lxml还包含了很多设置的函数，功能强大，具体可以去看lxml官方文档

lxml 解析 html 以爬取豆瓣电影主页本周口碑榜

http://movie.douban.com/

导入html字符串，使用 lxml.html.fromstring(html_text)

import lxml.html
str_url = 'http://movie.douban.com/'
request = urllib.request.Request(str_url)
html_text = urllib.request.urlopen(request).read()
root = lxml.html.fromstring(html_text)

依旧可以使用find，findall函数,用法与XML部分完全相同，可以使用下一级的tag和xpath作为输出，此处不再赘述

cssselect() 函数，返回list，包含所有匹配的结果，可以使用css选择器，类似于jquery

# 获取本页面所有项目名称
movies_list = [a.text for a in  root.cssselect('div.billboard-bd tr td a')]
print(movies_list)

['老炮儿', '八恶人', '卡罗尔', '海街日记', '荒野猎人', '寻龙诀', '丹麦女孩', '龙虾', '边境杀手', '实习生']

# 获取所有电影超链接
movies_href = [a.get('href') for a in  root.cssselect('div.billboard-bd tr td a')]
print(movies_href)

['http://movie.douban.com/subject/24751756/', 'http://movie.douban.com/subject/25787888/', 'http://movie.douban.com/subject/10757577/', 'http://movie.douban.com/subject/25895901/', 'http://movie.douban.com/subject/5327268/', 'http://movie.douban.com/subject/3077412/', 'http://movie.douban.com/subject/3071604/', 'http://movie.douban.com/subject/20514947/', 'http://movie.douban.com/subject/25881247/', 'http://movie.douban.com/subject/10594965/']