beautifulsoup是一个开源的html、xml操作库,它构建在第三方的xml、html解析器之上,负责对解析树进行操作。
可选的html、xml解析库有:lxml html5lib
1. 安装
pip install beautifulsoup4
2. 使用
import urllib
import bs4
soup = bs4.BeautifulSoup(urllib.urlopen("http://www.example.com/1.html"), "html5lib", from_encoding="gbk")
soup = bs4.BeautifulSoup(urllib.urlopen("http://www.example.com/1.html"), from_encoding="gbk")
soup = bs4.BeautifulSoup("<html>... ....</html>", from_encoding="gbk")
catlog = soup.find_all('div', class_="globalCrumbs")
title = soup.find_all('div', class_="articleTitle2011")
for e in title:
print e
result["title"] = e.h1.text