好久没更新博客了。打算写一个python的爬虫系列及数据分析。falg也不能随便立,以免打脸。
python爬取内容,是过程,分析数据是结果,最终得出结论才是目的。python爬虫爬取了内容,一般都是从网页上获取,那我们从html页面中如何提取出自己想要的信息呢?那就需要解析。目前常用的有BeautifulSoup、PyQuery、XPath和正则表达式。正则容易出错,而且一直是弱项,就讲讲其他三个的使用,今天先看下BeautifulSoup.
一、简介
BeautifulSoup直译为美丽的汤。是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。
二、安装
pip install beautifulsoup4
三、准备测试代码
这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档)
The Dormouse's storyThe Dormouse's story
Once upon a time there were three little sisters; andtheir names wereElsie,Lacie and
Tillie;and they lived at the bottom of a well.
...
我们先以上述代码为例进行测试
四、使用
from bs4 importBeautifulSoup
html_doc= """
The Dormouse's storyThe Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
"""soup= BeautifulSoup(html_doc, features="html.parser")#print(soup.prettify())
print(soup.title)#
The Dormouse's storyprint(soup.title.name)#title
print(soup.title.string)#The Dormouse's story
print(soup.title.parent.name)#head
print(soup.p)#
The Dormouse's story
print(soup.p['class'])#[u'title']
print(soup.a)#Elsie
print(soup.find_all('a'))#[Elsie, Lacie, Tillie]
print(soup.find(id='link3'))#Tillie
for link in soup.find_all('a'):print(link.get('href'))#http://example.com/elsie#http://example.com/lacie#http://example.com/tillie
print(soup.get_text())#The Dormouse's story
#The Dormouse's story#Once upon a time there were three little sisters; and their names were#Elsie,#Lacie and#Tillie;#and they lived at the bottom of a well.#...
以上注释的都是上一行输出的
五、BeautifulSoup可以传入字符串或文件句柄
from bs4 importBeautifulSoup
soup= BeautifulSoup('Extremely bold', features="lxml")
tag=soup.bprint(tag)#Extremely bold
tag.name = "blockquote"
print(tag)#
Extremely bold
print(tag['class'])#['boldest']
print(tag.attrs)#{'class': ['boldest']}
tag['id']="stylebs"
print(tag)#
Extremely bold
del tag['id']print(tag)#
Extremely bold
css_soup= BeautifulSoup('
id_soup= BeautifulSoup('
rel_soup= BeautifulSoup('
Back to the homepage
', features="lxml")print(rel_soup.a['rel'])#['index']rel_soup.a['rel'] = ['index', 'contents']print(rel_soup.p)