文章目录
beautiful soup
- 和 lxml 一样,beautifulsoup 也是一个 HTML/XML 的解析器,主要的功能也是如何解析和提取 HTML/XML 数据
- BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库,它的使用方式相对于正则来说更加的简单方便,常常能够节省我们大量的时间
- lxml 只会局部遍历,beautifulsoup 是基于 HTML DOM 的,会在如整个文档,解析整个 DOM 树,因此时间和内存开销会大很多,性能要低于 lxml
- 官方中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
- Beautiful的安装也是非常方便的,pip安装即可
- pip install beautifulsoup4
指定解析器
- BeautifulSoup解析网页需要指定一个可用的解析器,以下是主要几种解析器:
- Python标准库
- 使用方法:BeautifulSoup(markup,“html.parser”)
- 优势:Python的内置标准库执行速度适中,文档容错能力强
- 劣势:Python(2.7.3or3.2.2)前的版本中文档容错能力差
- lxml HTML解析器
- 使用方法:BeautifulSoup(markup,“lxml”)
- 优势:速度快,文档容错能力强
- 劣势:需要安装c语言库
- lxml XML解析器
- 使用方法:BeautifulSoup(markup,[“lxml”,“xml”])
BeautifulSoup(makeup,"xml) - 优势:速度快,唯一支持XML的解析器
- 劣势:需要安装c语言库
- 微信公众号的接口就是用的xml
- 使用方法:BeautifulSoup(markup,[“lxml”,“xml”])
- html5lib
- 使用方法:BeautifulSoup(markup,“html.parser”)
- 优势:最好的容错性,以浏览器的方式解析文档生成HTML5格式的文档
- 劣势:速度慢,不依赖外部扩展
- Python标准库
- 由于这个解析的过程在大规模的爬取中是会影响到整个爬虫系统的速度的,所以推荐使用的是lxml,速度会快很多,而lxml需要单独安装:pip install lxml
soup = BeautifulSoup(html_doc,“lxml”) #指定 - 提示:如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,所以要指定某一个解析器
prettify()
- 将不全的 html 按规范化补全,看起来也更美观
import requests
from lxml import etree
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}
# url = 'http://httpbin.org'
# res = requests.get(url,headers=headers)
# html_str = res.content.decode('utf8')
# html = etree.HTML(html_str)
html_str = """
<div class="bulletin">
<svg class="icon icon-bulletin">
<use xlink:href="#bulletin"></use>
</svg>
<span class="text">公告:</span>
<span class="bulletin-content"><a href="https://blog.csdn.net//xufive/columnpay/9506563?utm_source=payColumnMp"><span style="color: #CA0C16">跟“风云卫星”数据工程师学Python~</span></a> </span>
<svg class="icon icon-bulletin_close">
<use xlink:href="#bulletin_close"></use>
</svg>
</div>
<script type="text/javascript">
$(document).on('click','.bulletin svg.icon-bulletin_close',function() {
$('.bulletin').fadeOut(500);
})
</script>
"""
bs = BeautifulSoup(html_str,'lxml')
html = bs.prettify()
print(html)
用法
- tag对象可以说是BeautifulSoup中最为重要的对象,通过BeautifulSoup来提取数据基本都围绕着这个对象来进行操作
- Tag就是标签的意思,Tag还有许多的方法和属性
- name
- 每一个tag对象都有name属性,为标签的名字
- Attributes
- 在HTML中,tag可能有多个属性,所以tag属性的取值跟字典相同
get_text()
- 通过get_text()方法我们可以获取某个tag下所有的文本内容
- tag对象可以说是BeautifulSoup中最为重要的对象,通过BeautifulSoup来提取数据基本都围绕着这个对象来进行操作
- 首先,一个节点是可以包含多个子节点和多个字符串的,例如html节点中包含着head和body节点,所以BeautifulSoup就可以将一个HTML的网页用这样一层层嵌套的节点来进行表示
Tag与遍历文档树
find_all
- 上方这种直接通过属性来进行访问属性的方法,很多时候只能适用于比较简单的一些场景,所以BeautifulSoup还提供了搜索整个文档树的方法find_all()
- find_all(),返回列表,其中子元素类型是 bs4.element.tag(和 find() 返回的类型一样),会以字符串形式打印出来,因为自动调用了
_repr_
- 通过name搜索,find_all(‘b’)可以直接查找出整个文档树中所有的b标签,并返回列表
- 通过属性搜索,我们在搜索的时候一般只有标签名是不够的,因为可能同名的标签很多,那么这时候我们就要通过标签的属性来进行搜索,这时候我们可以通过传递给attrs一个字典参数来搜索属性
soup.find_all(attes={‘class’:‘sister’}) - 通过文本搜索,在find_all()方法中,还可以根据文本内容来进行搜索
soup.find_all(text=“Elsie”) - 限制查找范围为子节点,find_all()方法会默认的去所有的子孙节点中搜索,而如果将recursive参数设置为False,则可以将搜索范围限制在直接子节点中
soup.html.find_all(“title”,recursive=False) - 通过正则表达式来筛选查找结果在BeautifulSoup中,也是可以与re模块进行相互配合的,将re.compile编译的对象传入find_all()方法,即可通过正则来进行搜索
tags = soup.find_all(re.compile("^b"))
- css选择器
- 在BeautifulSoup中,同样也支持使用CSS选择起来进行搜索,使用select(),在其中传入字符串参数,就可以使用CSS选择器的语法来找到tag
- 例
import requests
from lxml import etree
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}
url = 'https://mkt.51job.com/tg/sem/pz_2018.html?from=baidupz'
res = requests.get(url,headers=headers)
html_str = res.content.decode('gbk')
# html = etree.HTML(html_str)
soup = BeautifulSoup(html_str,'lxml')
# print(soup.title,soup.title.name,soup.title.get_text()) #<title>招聘网_人才网_找工作_求职_上前程无忧</title> title 招聘网_人才网_找工作_求职_上前程无忧
# print(type(soup.title),type(soup.title.name),type(soup.title.get_text()),type(soup.title.string)) #<class 'bs4.element.Tag'> <class 'str'> <class 'str'> <class 'bs4.element.NavigableString'>
# print(soup.title.string,soup.title.parent) # title 的内容(字符串格式),title的父级标签 <head> head 标签内容</head>
ches = soup.head.children # head 的后代标签,返回列表
# for ch in ches:
# print(ch)
# html = bs.prettify()
#获取所有的 span 标签
span = soup.find_all('span')
#获取前二个 span 标签
span = soup.find_all('span',limit=2)
#获取第二个 span 标签
span = soup.find_all('span',limit=2)[1]
# 1.直接型关键字参数:获取所有 class=ipt 的 p 标签
p = soup.find_all('p',class_ = 'ipt')
# 2. 字典型关键字参数
p = soup.find_all('p',attrs={'class':'ipt'})
# 通过多个属性确定想要的标签
input = soup.find_all('input',id='kwdselectid',class_='mytxt')
input = soup.find_all('input',attrs={'id':'kwdselectid','class':'mytxt'})
inputs = soup.find_all('input')[1:] #排除第一个 input 标签,从第二个(索引为1)开始
# for input in inputs:
#获取属性值,可能报错,如下图,原因是不是所有的 input 标签都有 value 这个属性,没有就报错,和字典一样
# 1.
# id = input['value'] #属于字典的方法,通过键名找值名
#2.
# a = input.attrs['value']
# print(a)
# 获取第一个 input 标签(也是tag 类型)
input = soup.input
# print(type(input))
# for sp in span:
# print(type(sp))
# print('='*20)
a = soup.find('span') #获取第一个 span 标签
# 查找所有属性值为 'title'的 class
ls = soup.find_all(class_='title')
divs = soup.find_all('div')
for div in divs:
print(list(div.strings)) #取 div 下的所有的 非标签内的 文字
print('='*20)
print(list(div.stripped_strings)) #取 div 下的所有的 非标签内的非空白的 文字
print('*'*20)
select
- select 用法和作用与 find_all 大致一样,只不过 select 可以结合 css 方法,获取更方便
#获取所有的 span 标签
span = soup.select('input') #也是返回列表
span = soup.select('input#showguide') #id=showguide 的 input 标签
span = soup.select('input[id="showguide"]') #id=showguide 的 input 标签
spans = soup.select('input',limit=3) #前三个 input 标签
NavigableString
- NavigableString的意思是可以遍历的字符串,一般被标签包裹在其中的文本就是NavigableString格式
BeautifulSoup
- BeautifulSoup对象就是解析网页获得的对象
Comment
- Comment指的是在网页中的注释以及特殊字符串
用法
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc)
#beautifulsoup把html_doc解析成beautifulsoup对象,赋值给soup
#以下是在ipython中运行
In [4]: soup.title
Out[4]: <title>The Dormouse's story</title>
In [5]: type(soup.title)
Out[5]: bs4.element.Tag
In [6]: soup.title.name
Out[6]: 'title'
In [7]: soup.title.string #title标签里的内容
Out[7]: u"The Dormouse's story"
In [8]: soup.title.parent #title标签的父级标签
Out[8]: <head><title>The Dormouse's story</title></head>
In [9]: soup.p #p标签,只选一个,默认选第一个
Out[9]: <p class="title"><b>The Dormouse's story</b></p>
In [10]: soup.p["class"] #属性操作,选一个,默认第一个
Out[10]: ['title']
In [13]: soup.find_all('p') #所有的p标签
Out[13]:
[<p class="title"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,<p class="story">...</p>]
In [14]: soup.find(id='link3') #选id=link3的标签
Out[14]: <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
In [15]: soup = BeautifulSoup(html_doc,'lxml')
In [16]: p = soup.p
In [17]: type(p)
Out[17]: bs4.element.Tag
In [22]: a = soup.a
In [23]: a
Out[23]: <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
In [24]: a['class']
Out[24]: ['sister']
In [25]: a['id']
Out[25]: 'link1'
In [26]: a['href']
Out[26]: 'http://example.com/elsie'
In [27]: a.get_text()
Out[27]: u'Elsie'
In [28]: body = soup.body
In [29]: body.get_text()
Out[29]: u"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"
In [30]: type(a.get_text())
Out[30]: unicode
In [32]: p.string
Out[32]: u"The Dormouse's story"
In [33]: type(p.string)
Out[33]: bs4.element.NavigableString
In [34]: body.string
In [35]: type(body.string)
Out[35]: NoneType
In [37]: body.contents
Out[37]:
[u'\n',
<p class="title"><b>The Dormouse's story</b></p>,
u'\n',
<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
u'\n',
<p class="story">...</p>,
u'\n']
In [38]: body.children
Out[38]: <listiterator at 0x7fac9d48f950>
In [39]: list(body.children) #获取的是直接子节点
Out[39]:
[u'\n',
<p class="title"><b>The Dormouse's story</b></p>,
u'\n',
<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
u'\n',
<p class="story">...</p>,
u'\n']
In [40]: body.descendants
Out[40]: <generator object descendants at 0x7fac9cbe3af0>
In [41]: len(body.contents) #获取的是直接子节点
Out[41]: 7
In [42]: len(list(body.descendants))
Out[42]: 20
In [43]: p.string #只能拿到标签里只有一个字符串的
Out[43]: u"The Dormouse's story"
In [44]: p
Out[44]: <p class="title"><b>The Dormouse's story</b></p>
In [45]: body.string
In [46]: body.strings
#可以拿标签里有多个字符串的,是生成器,推荐字符串多的用这个
Out[46]: <generator object _all_strings at 0x7faca40efdc0>
In [47]: list(body.strings)
Out[47]:
[u'\n',
u"The Dormouse's story",
u'\n',
u'Once upon a time there were three little sisters; and their names were\n',
u'Elsie',
u',\n',
u'Lacie',
u' and\n',
u'Tillie',
u';\nand they lived at the bottom of a well.',
u'\n',
u'...',
u'\n']
In [48]: len(list(body.stripped_strings)) #去掉空格和空格行
Out[48]: 9
In [49]: len(list(body.strings))
Out[49]: 13
In [50]: p.parent
Out[50]: <body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body>
In [51]: p.parents
Out[51]: <generator object parents at 0x7fac9cbf2410>
In [52]: list(p.parents)
Out[52]:
[<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body>,
<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body></html>,
<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body></html>]
In [53]: for element in p.parents:
...: print(element)
...:
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
In [55]: p
Out[55]: <p class="title"><b>The Dormouse's story</b></p>
In [56]: p.next_sibling
Out[56]: u'\n' #空行,回车换行,文本都属于节点
In [57]: p.next_sibling.next_sibling
Out[57]: <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>
In [59]: p.previous_sibling
Out[59]: u'\n'
In [60]: p.previous_sibling.previous_sibling #没有上一个的上一个,所以没显示
In [61]: p.previous_siblings
Out[61]: <generator object previous_siblings at 0x7faca40ef4b0>
In [62]: list(p.previous_sibling) #上面所有的节点
Out[62]: [u'\n']
In [63]: soup.find_all('p') #找所有的p标签(通过name)
Out[63]:
[<p class="title"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
<p class="story">...</p>]
In [64]: soup.find_all(['a','b']) #找所有的a标签和b标签
Out[64]:
[<b>The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [65]: soup.find_all(attrs={'class':'sister'}) #通过class标签查找所有class=sister的
Out[65]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [66]: soup.find_all(text="Elsie") #找所有文本内容为Elsie的
Out[66]: [u'Elsie']
In [67]: soup.find_all('a',text='Elsie') #找文本内容为Elsie的a标签
Out[67]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [68]: soup.find_all('a',text='Elsie')[0].parent
Out[68]: <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>
In [69]: soup.find_all(text='Elsie')[0].parent
Out[69]: <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
In [72]: soup.find_all('title')
Out[72]: [<title>The Dormouse's story</title>]
In [70]: soup.find_all('title',recursive=False) #加recursive=False表示只在子节点里查
Out[70]: []
In [73]: import re
In [74]: tags = soup.find_all(re.compile('^b')) #拿到以b开头的标签
In [75]: tags
Out[75]:
[<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body>,
<b>The Dormouse's story</b>]
In [76]: soup.select('title')
Out[76]: [<title>The Dormouse's story</title>]
In [77]: soup.select('p > a') #拿p标签下的a标签
Out[77]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
xpath
- XPath是一门在XML文档中查找信息的语言,XPath可用来在XML文档中对元素和属性进行遍历,相比于BeautifulSoup,XPath在提取数据时会更有效率
- 功能介绍很全的网址:http://www.w3school.com.cn/xpath/xpath_nodes.asp
安装
- 在Python中很多库都提供XPath的功能,但是最流行的还是lxml这个库,效率最高
pip install lxml - 导入
from lxml import etree #在 pycharm 中没有联想提示,因为是用 c 写的
etree 用法
- etree.HTML(str):将字符串参数转换成 Element 集合格式(格式同 xpath 单独获取标签的格式,都是 Element 集合),并且看规范格式进行补全
- etree.tostrings():将 Element 集合格式转换成字节格式,通过 decode 可以得到字符串格式,这里可以定义参数 encoding=‘utf8’
- etree.parse():参数是本地文件名,将本地的 html 文件解析成和 HTML() 一样的 Element 集合格式,然后用 etree.tostrings() 和 decode 得到字符串格式(如果报错
lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 4, column 2
,说明遇到不是很规范的 html 文件,这时用到 etree.HTMLParser(),如果还不行在 HTMLParser 参数里面加 encoding=‘utf8’) - etree.parse() 只会解析本地文件,不会像 etree.HTML() 那样帮助补全 html 标签
import requests
import random
from lxml import etree
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}
url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='
s = requests.Session()
response = s.get(url,headers=headers)
str = response.content.decode('gbk') # 得到字符串格式的网页
a = etree.HTML(str) # etree.HTML()得到集合格式的网页,参数是字符串格式
html = etree.tostring(a,encoding='utf8').decode('utf8') # etree.tostring() 得到字节格式的网页,通过 decode 成字符串
# print(type(a))
str = """
<div class="header">
<ul id="languagelist">
<li class="tle"><span class="list">简</span></li><li><a href="http://big5.51job.com/gate/big5/www.51job.com/" rel="external nofollow">繁</a></li><li class="last"><a href="//www.51job.com/default-e.php" rel="external nofollow">EN</a></li> <script language="javascript">
if(location.hostname == "big5.51job.com")
{
$('#languagelist li span').html("繁");
$('#languagelist li:nth-child(2) a').html("简");
$('#languagelist li:nth-child(2) a').attr('href','javascript:void(0)');
$('#languagelist li:nth-child(2) a').click(function(){location.href=window.cfg.domain.www});
$('#languagelist li:nth-child(3) a').attr('href','javascript:void(0)');
$('#languagelist li:nth-child(3) a').click(function(){location.href=window.cfg.domain.www+"/default-e.php"});
}
</ul>
<span class="l"> </span>
<div class="app">
<ul>
<li><em class="e_icon"></em><a href="http://app.51job.com/index.html">APP下载</a></li>
<li>
<img width="80" height="80" src="//img02.51jobcdn.com/im/2016/code/web_top.png" alt="app download">
<p><a href="http://app.51job.com/index.html">APP下载</a></a></p>
</li>
</ul>
</div>
</div>
"""
a = etree.HTML(str)
html = etree.tostring(a,encoding='utf8').decode('utf8')
print(html)
#<html><body><div class="header">
<ul id="languagelist">
<li class="tle"><span class="list">简</span></li><li><a href="http://big5.51job.com/gate/big5/www.51job.com/" rel="external nofollow">繁</a></li><li class="last"><a href="//www.51job.com/default-e.php" rel="external nofollow">EN</a></li> <script language="javascript">
if(location.hostname == "big5.51job.com")
{
$('#languagelist li span').html("繁");
$('#languagelist li:nth-child(2) a').html("简");
$('#languagelist li:nth-child(2) a').attr('href','javascript:void(0)');
$('#languagelist li:nth-child(2) a').click(function(){location.href=window.cfg.domain.www});
$('#languagelist li:nth-child(3) a').attr('href','javascript:void(0)');
$('#languagelist li:nth-child(3) a').click(function(){location.href=window.cfg.domain.www+"/default-e.php"});
}
</ul>
<span class="l">&nbsp;</span>
<div class="app">
<ul>
<li><em class="e_icon"></em><a href="http://app.51job.com/index.html">APP下载</a></li>
<li>
<img width="80" height="80" src="//img02.51jobcdn.com/im/2016/code/web_top.png" alt="app download">
<p><a href="http://app.51job.com/index.html">APP下载</a></a></p>
</li>
</ul>
</div>
</div>
</script></ul></div></body></html>
- etree.parse() 如果报错
lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 4, column 2
,说明遇到不是很规范的 html 文件,这时定义解析器 etree.HTMLParser()(如果不定义解析器,默认是用 XML 解析器),如果还不行在 HTMLParser 参数里面加 encoding=‘utf8’
a = etree.parse('dapeng.html',etree.HTMLParser())
#a = etree.parse('dapeng.html',etree.HTMLParser(encoding='utf8'))
html = etree.tostring(a).decode('utf8')
print(html)
语法
- XPath使用路径表达式在XML/HTML文档中选取节点,节点是通过沿着路径或者step来选取的
- 下面是最有用的路径表达式
- nodename:选取当前节点的所有nodename子节点
- /:从根节点选取
- //:从匹配选择的当前节点选择文档中的节点,而不考虑他们的位置(不考虑 xpath 前面的所属的标签,如果想要获取当前标签下的后代标签,可以用 .//)
- .:选取当前节点
- …:选取当前节点的父节点
- @:选取属性
谓语
- 谓语用来查找某个或某些特定的节点或者包含某个指定值的节点
- 谓语被嵌在方括号中
- 路径表达式
- //bookstore/book[1]
- 选取属于bookstore子元素的第一个book元素
- //bookstore/book[last()]
- 选取属于bookstore子元素最后一个book元素
- //bookstore/book[last()-1]
- 选取属于bookstore子元素的倒数第二个book元素
- //bookstore/book[position()< 3]
- 选取最前面的两个属于bookstore子元素的子元素的book元素
- //title[@lang]
- 选取所有拥有名为lang的属性的title元素
- //title[@lang=‘eng’]
- 选取所有title元素,且这些元素拥有值为eng的lang属性
- //bookstore/book[price>35.00]
- 选取bookstore元素的所有book元素,且其中的price元素的值须大于35.00
- //bookstore/book[price>35.00]/title
- 选取bookstore元素中的book元素的所有title元素,且其中的price元素的值须大于35.00
- 选取未知节点
- XPath通配符可用来选取未知节点
- XPath通配符可用来选取未知节点
- 选取多个路径
- 通过在路径表达式中使用“|”运算符,可以选取若干个路径
- 在下面的表格中,我们列出了一些路径表达式,以及这些表达式的结果:
- 用text()获取某个节点下的文本
- 用string()获取某个节点下所有的文本
- 也可以用 etree.tostring() 去转换
- //bookstore/book[1]
用法
from lxml import etree #把文档解析成文档树
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
page = etree.HTML(html_doc) #返回HTML节点
print(page.xpath('head')) #html下的 #[<Element head at 0x7fdbf7502248>]
print(page.xpath('.')) #[<Element html at 0x7fdbf75022c8>]
head = page.xpath('head')[0]
print(head.xpath('/')) #[ ]
print(head) #<Element head at 0x7f8df3466388>
print(head.xpath('/body')) #[ ] /表示从根节点下查找
print(head.xpath('/html/body')) #[<Element body at 0x7f0d170f2248>]
print(head.xpath('/html/body/p/a')[0]) #<Element a at 0x7fb1c31d8308>
print(page.xpath('//p')) #[<Element p at 0x7ff51a38ec88>, <Element p at 0x7ff51a38ed88>, <Element p at 0x7ff51a38edc8>]
print(page.xpath('//p/a[1]/@id')) #['link1'] 从1开始,不是从0开始
print(page.xpath('//p/a[last()]/@id')) #['link3'] 倒数第一个
print(page.xpath('//p/a[last()-1]/@id')) #['link2'] 倒数第二个
print(page.xpath('//p/a[position()<3]')) #[<Element a at 0x7f07c8b2cc88>, <Element a at 0x7f07c8b2cd88>] <3,取前两个
print(page.xpath('//p/a[position()=2]/@id')) #['link2']
print(page.xpath('//p/a[@class="sister"]/@id')) #['link1', 'link2', 'link3']
--------调皮的分割线--------------------------
html_a = """
<ul>
<li class="rec_topics">
<a href="https://www.douban.com/gallery/topic/49461/?from=hot_topic_anony_sns" class="rec_topics_name">你读过的最孤独的文学形象</a>
<span class="rec_topics_subtitle">597905</span>
</li>
<li class="rec_topics">
<a href="https://www.douban.com/gallery/topic/51958/?from=hot_topic_anony_sns" class="rec_topics_name">最喜欢的作家遗作</a>
<span class="rec_topics_subtitle">17459</span>
</li>
<li class="rec_topics">
<a href="https://www.douban.com/gallery/topic/52030/?from=hot_topic_anony_sns" class="rec_topics_name">影视作品中你最爱的驱魔桥段</a>
<span class="rec_topics_subtitle">12918</span>
</li>
<li class="rec_topics">
<a href="https://www.douban.com/gallery/topic/48823/?from=hot_topic_anony_sns" class="rec_topics_name">令你难忘的深夜长谈</a>
<span class="rec_topics_subtitle">1341477</span>
</li>
<li class="rec_topics">
<a href="https://www.douban.com/gallery/topic/48855/?from=hot_topic_anony_sns" class="rec_topics_name">宛如小说情节的家族故事</a>
<span class="rec_topics_subtitle">1996938</span>
</li>
<li class="rec_topics">
<a href="https://www.douban.com/gallery/topic/49643/?from=hot_topic_anony_sns" class="rec_topics_name">古寺记</a>
<span class="rec_topics_subtitle">286609</span>
</li>
</ul>
"""
page1 = etree.HTML(html_a)
print(page1.xpath('//li[span>10000]/@class'))
#['rec_topics', 'rec_topics', 'rec_topics', 'rec_topics', 'rec_topics', 'rec_topics']
print(page1.xpath('//li/a|//li/span')) #取标签li下的a标签或者span标签
#[<Element a at 0x7f0c236ead88>, <Element span at 0x7f0c236eadc8>, <Element a at 0x7f0c236eae08>, <Element span at 0x7f0c236eae48>, <Element a at 0x7f0c236eae88>, <Element span at 0x7f0c236eaf08>, <Element a at 0x7f0c236eaf48>, <Element span at 0x7f0c236eaf88>, <Element a at 0x7f0c236eafc8>, <Element span at 0x7f0c236eaec8>, <Element a at 0x7f0c23493048>, <Element span at 0x7f0c23493088>]
print(page1.xpath('//li/span/text()'))
#['597905', '17459', '12918', '1341477', '1996938', '286609']
print(page1.xpath('string(//li)')) #你读过的最孤独的文学形象 597905
#取第一个li的所有文本
xpath 的例
- 爬取豆瓣电影的正在上映的信息
import requests
from lxml import etree
import json
url = 'https://movie.douban.com/cinema/nowplaying/haerbin/'
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Referer': 'https://movie.douban.com/',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'
}
session = requests.Session()
req = session.get(url,headers=headers)
response = req.content.decode('utf8')
html = etree.HTML(response)
lists = html.xpath('//div[@id="nowplaying"]/div[@class="mod-bd"]/ul[@class="lists"]/li[@class="list-item"]')
films = []
for li in lists:
name = li.xpath('./@data-title')[0]
score = li.xpath('./@data-score')[0]
star = li.xpath('./@data-star')[0]
year = li.xpath('./@data-release')[0]
duration = li.xpath('./@data-duration')[0]
country = li.xpath('./@data-region')[0]
director = li.xpath('./@data-director')[0]
actors = li.xpath('./@data-actors')[0]
category = li.xpath('./@data-category')[0]
href = li.xpath('.//a/@href')[0]
img = li.xpath('.//a/img/@src')[0]
film_js = {
'name': name,
'score': score,
'star': star,
'year': year,
'duration': duration,
'country': country,
'director': director,
'actors': actors,
'category': category,
'href': href,
'img': img
}
films.append(film_js)
print(films)
#[{"name": "利刃出鞘", "score": "8.4", "star": "45", "year": "2019", "duration": "130分钟", "country": "美国", "director": "莱恩·约翰逊", "actors": "丹尼尔·克雷格 / 安娜·德·阿玛斯 / 克里斯·埃文斯", "category": "nowplaying", "href": "https://movie.douban.com/subject/30318116/?from=playing_poster", "img": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2574172427.webp"}, .......{"name": "衣柜里的冒险王", "score": "7.2", "star": "40", "year": "2018", "duration": "96分钟(中国大陆)", "country": "法国 印度", "director": "肯·斯科特", "actors": "丹努什 / 贝热尼丝·贝乔 / 艾琳·莫里亚蒂", "category": "nowplaying", "href": "https://movie.douban.com/subject/26715965/?from=playing_poster", "img": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2573630982.webp"}]
小案例
import requests
from lxml import etree
import json
URL_HEAD = 'https://www.ygdy8.net' #全局变量名用大写
HEADERS = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
}
def get_page_url(url):
"""
获取每一页里的所有 url
:return:
"""
# 加上 cookie
session = requests.Session()
req = session.get(url,headers=HEADERS)
response = req.text
html = etree.HTML(response)
# 获取每一页里的 url
urls = html.xpath('//div[@class="co_content8"]/ul//table[@class="tbspan"]//a/@href')
# map(func(),iterable)
# 拼接完整的 url
urls = list(map(lambda x: URL_HEAD+x,urls))
return urls
def detail(urls):
"""
每一个 url 的详细内容
:param url_yes:
:return:
"""
# for url in urls:
# session = requests.Session()
ls = []
for url in urls:
movie = {}
req = requests.get(url,headers=HEADERS)
response = req.content.decode('gbk') #这里如果用 req.text,后面就无法解码成功,所以这里先解码
html = etree.HTML(response)
ts = html.xpath('//div[@id="Zoom"]//text()')
# print(type(ts))
src = html.xpath('//div[@id="Zoom"]//img/@src')
movie['海报'] = src[0]
movie['剧照'] = src[1]
for index, t in enumerate(ts):
# print(type(ts[19]))
# 像这种中间有空格的,直接去网页复制,自己打容易出错,因为有的写的空格自己打不出来
if t.startswith('◎译 名'):
movie['translated_names'] = t.replace('◎译 名','').strip() # 字典添加方法,将 value 赋值给 key
elif t.startswith('◎片 名'):
movie['film_name'] = t.replace('◎片 名','').strip()
elif t.startswith('◎年 代'):
movie['year'] = t.replace('◎年 代','').strip()
elif t.startswith('◎产 地'):
movie['address'] = t.replace('◎产 地','').strip()
elif t.startswith('◎类 别'):
movie['category'] = t.replace('◎类 别','').strip()
elif t.startswith('◎语 言'):
movie['language'] = t.replace('◎语 言','').strip()
elif t.startswith('◎豆瓣评分'):
movie['douban_score'] = t.replace('◎豆瓣评分','').strip()
elif t.startswith('◎编 剧'):
li = []
li.append(t.replace('◎编 剧', '').strip())
movie['screenwriter'] = li
for i in range(index + 1, len(ts)):
if ts[i].startswith('◎主 演'):
break
else:
li.append(ts[i].strip())
movie['screenwriter'] = li
elif t.startswith('◎主 演'):
li = []
li.append(t.replace('◎主 演', '').strip())
movie['actor'] = li
for i in range(index+1,len(ts)):
if ts[i].startswith('◎标 签'):
break
else:
li.append(ts[i].strip())
movie['actor'] = li
elif t.startswith('◎标 签'):
movie['label'] = t.replace('◎标 签','').strip()
elif t.startswith('◎简 介'):
movie['brief introduction'] = t.replace('◎简 介','').strip()
hf = html.xpath('//div[@id="Zoom"]//a/@href')[0]
movie['download_url'] = 'https://www.ygdy8.net/html/gndy/dyzz/20191127/' + hf
ls.append(movie)
return ls
def paging():
"""
获取每一页的 url,将爬取的数据最终放到这里
:return:
"""
# 根据每一页的 url 发现,前面都一样,只有最后{}这里不一样,这里和页数保持一致
url = 'https://www.ygdy8.net/html/gndy/dyzz/list_23_{}.html'
# 取前七页的数据
for i in range(1,8):
# 拼接每一页的 url
url = url.format(i)
# 调用 get_page_url(url),将拼接的 url 作为参数传递,返回每一个电影详细信息的 url
urls = get_page_url(url)
# 调用 detail(urls),将每个电影详细信息的 url 传递,获取电影详细信息返回
movies = detail(urls)
with open('films.json','a',encoding='utf8') as f:
for movie in movies:
f.write(json.dumps(movie,ensure_ascii=False) + '\n')
# 调用时不会使用该方法
if __name__ == '__main__':
paging()
xpath 豆瓣电影top250
#!/home/xiaoge/env python3.6
# -*- coding: utf-8 -*-
"""
__title__ = ' text2'
__author__ = 'xiaoge'
__mtime__ = '2019/12/3 下午3:23'
# code is far away from bugs with the god animal protecting
I love animals. They taste delicious.
┏┓ ┏┓
┏┛┻━━━━━━┛┻┓
┃ ☃ ┃
┃ ┳┛ ┗┳ ┃
┃ ┻ ┃
┗━┓ ┏━┛
┃ ┗━━━┓
┃ 神兽保佑 ┣┓
┃永无BUG!┏┛
┗┓┓┏━┳┓┏┛
┃┫┫ ┃┫┫
┗┻┛ ┗┻┛
"""
import requests
from bs4 import BeautifulSoup
from lxml import etree
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 \
Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Referer': 'https://movie.douban.com/top250?start=225&filter='
}
def spider():
"""
爬电影数据
:return:
"""
# 获取每页的 url
for i in range(10):
url = 'https://movie.douban.com/top250?start={}&filter='.format(i*25)
#一共获取 10 页,将每一页赋值给 one_page_movie 的参数,返回的结果是每一页的 25 个电影的标签的 字典形式 的集合,并赋值给 dicts
dicts = one_page_movie(url)
# 将 该字典传给 detai 函数,获取每一个电影的详细信息
movie = detail(dicts)
with open('movie.txt','a',encoding='utf-8') as f:
f.write(str(movie) + '\n')
def one_page_movie(one_page_url):
"""
获取每一页里所有的电影(25个)的 li 标签
:param one_page_url: spider() 返回的每一页的 url
:return: 返回每一页的 25 个 li(以字典形式) 给 detail()
"""
# 获取 session,session 中有 cookie,防止网站的反爬措施导致的获取不到数据或者错误的数据
session = requests.Session()
# 获取该网页的响应,带上请求头里的一些数据,防止反爬措施封本机的ip
req = session.get(one_page_url, headers=HEADERS)
# 获取到字符串形式的网页,为解析器做准备
str = req.content.decode('utf8')
# 将字符串形式的网页变成 html 形式的网页,为解析器做准备
html = etree.HTML(str)
dicts = {}
for i in range(25):
#取该页的 25 个电影的各自的 li
li = html.xpath('//ol[@class="grid_view"]/li')[i]
# 将这些 li 放入字典中,key 是 0-25,值是对应的 li
dicts[i] = li
return dicts
def detail(lis):
"""
获取电影详细信息
:param lis: one_page_movie() 返回的 25 个 li 的字典
:return: 每一个电影的详细信息,列表形式
"""
movie_li = []
# 取 0-25,为了提取字典中对应的 25 个 li (通过 dict[key] => value)
for i in range(25):
# for li in lis:
# 取 span 中的电影名
text = lis[i].xpath('.//div[@class="info"]/div[@class="hd"]/a/span/text()')
name = ''
for tx in text:
# 得到的电影名还要进一步过滤,拼接
tx = tx.replace('\xa0','')
name += tx
movie = {}
movie['name'] = name
actors = lis[i].xpath('.//div[@class="info"]/div[@class="bd"]/p[1]/text()')
movie['actor'] = actors[0].replace('\xa0','').strip()
movie['year'] = actors[1].replace('\xa0','').strip()
movie['star'] = lis[i].xpath('.//div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[2]/text()')[0]
movie_li.append(movie)
return movie_li
if __name__ == '__main__':
spider()