学习爬虫,离不开数据解析和分析。python中的BeautifulSoup模块就是绝佳的html解析器,这里记录一下bs4的主要函数。
Install
install bs4
pip3 install beautifulsoup4
install lxml parser
pip3 install lxml
安装lxml解析器可能会出xmlCheckVersion报错,这时候可以到网上下载对应的lxml.whl,用whl来安装即可。
get html
首先从request库获得一个html页面,或者是本地的静态Html页面,用bs4去解析
soup = BeautifulSoup(html_doc, lxml)
//or
url="www.xxx.com"
r=requests.get(url)
soup = BeautifulSoup(r.text, lxml)
parsing function
快速入门,自然是看看bs4有什么好用的解析函数,这里列了最常用的一些方法,对于这样一段html
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>