什么是BeautifulSoup4
和 lxml 一样,Beautiful Soup 也是python的一个HTML/XML的解析器,用它可以方便的从网页中提取数据。
lxml 只会局部遍历,而Beautiful Soup 是基于HTML DOM的,会载入整个文档,解析整个DOM树,
因此时间和内存开销都会大很多,所以性能要低于lxml。 BeautifulSoup 用来解析 HTML 比较简单
API非常人性化,支持CSS选择器、Python标准库中的HTML解析器,也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已经停止开发,推荐现在的项目使用Beautiful Soup 4。
使用 pip 安装即可: pip install beautifulsoup4
官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0
Beautiful Soup支持的解析器
解析器 | 使用方法 | 优势 |
---|
Python标准库 | BeautifulSoup(markup,‘html.parser’) | Python的内置标准库、执行速度适中、文档容错能力强 |
lxml HTML解析器 | BeautifulSoup(markup,‘lxml’) | 速度快、文档容错能力强 |
lxml XML解析器 | BeautifulSoup(markup,‘xml’) | 速度快、唯一支持XML的解析器 |
html5lib | BeautifulSoup(markup,‘html5lib’) | 最好的容错性、以浏览器的方式解析文档,生成HTML5的格式文档 |
Beautiful Soup支持的方法
方法名 | 作用 |
---|
find | 获取单个节点 |
find_all | 获取所有 |
name | 可以是正则表达式、可以是标签名称、可以是标签的列表[‘a’,‘img’] |
attrs | 字典类型,跟标签的属性值 |
css选择器
这就是另一种与 find_all 方法有异曲同工之妙的查找方法.写 CSS 时,标签名不加任何修饰,类名前加.,id名前加#在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(),返回类型是 list
表达式 | 说明 |
---|
* | 选择所有节点 |
#container | 选择id为container的节点 |
.containe | 选取所有class包含container的节点 |
li a | 选取所有li下的所有a节点 |
div#container > ul | 选取id为container的div的ul子元素 |
a[href=“http://jobbole.com”] | 选取所有href属性为jobbole.com值的a元素 |
a[href*=”jobole”] | 选取所有href属性包含jobbole的a元素 |
a[href^=“http”] | 选取所有href属性值以http开头的a元素 |
a[href$=“.jpg”] | 选取所有href属性值以.jpg结尾的a元素 |
div:not(#container) | 选取所有id非container的div属性 |
li:nth-child(3) | 选取第三个li元素 |
tr:nth-child(2n) | 第偶数个tr |
案例
#pip install lxml
from lxml.html import etree
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
class CollegateRank(object):
def get_page_data(self,url):
response = self.send_request(url=url)
if response:
# print(response)
with open('page.html','w',encoding='gbk') as file:
file.write(response)
self.parse_page_data(response)
def parse_page_data(self,response):
# 使用bs4获取数据
soup = BeautifulSoup(response,'lxml')
# ranks= soup.select('div.scores_List > dl')
#soup.find()获取单个节点
# soup.find_all()获取所有
# ranks = soup.find_all(attrs={'class':"scores_List"})[0].find_all('dl')
ranks = soup.find_all(class_="scores_List")[0].find_all('dl')
# print(ranks)
for dl in ranks:
school_info = {}
school_info['url'] = dl.select('dt a')[0].attrs['href']
school_info['icon'] = dl.select('dt a')[0].select('img')[0].attrs['src']
school_info['name'] = dl.select('dt > strong a')[0].text
school_info['adress'] = dl.select('dd > ul >li')[0].text
school_info['tese'] = '、'.join(span.text for span in dl.select('dd > ul >li')[1].select('span'))
school_info['type'] =dl.select('dd > ul > li ')[2].text
school_info['belong'] =dl.select('dd> ul >li ')[3].text
school_info['level'] = dl.select('dd > ul >li')[4].text
school_info['weburl'] = dl.select('dd > ul > li')[5].text
# print(school_info)
self.parse_school_detail(school_info['url'], school_info)
def parse_school_detail(self,url,school_info):
response = self.send_request(url)
soup = BeautifulSoup(response,'lxml')
rank = soup.find_all(class_='sm_nav bk')[0].find_all('p')
pool = ThreadPoolExecutor(20)
for i in rank:
school_info = {}
school_info['url'] = i.select('p a')[0].attrs['href']
print(school_info)
result = pool.submit(self.send_request, school_info['url'])
result.add_done_callback(self.parse_school_info)
def parse_school_info(self,future):
text = future.result()
print('解析数据', len(text))
#使用xpath解析数据
# etree_xpath = etree.HTML(response)
# ranks = etree_xpath.xpath('//div[@class="scores_List"]/dl')
# print(ranks)
# for dl in ranks:
# school_info = {}
# school_info['url'] = self.extract_first(dl.xpath('./dt/a[1]/@href'))
# school_info['icon'] = self.extract_first(dl.xpath('./dt/a[1]/img/@src'))
# school_info['name'] = self.extract_first(dl.xpath('./dt/strong/a/text()'))
# school_info['adress'] = self.extract_first(dl.xpath('./dd/ul/li[1]/text()'))
# school_info['tese'] = '、'.join(dl.xpath('./dd/ul/li[2]/span/text()'))
# school_info['type'] = self.extract_first(dl.xpath('./dd/ul/li[3]/text()'))
# school_info['belong'] = self.extract_first(dl.xpath('./dd/ul/li[4]/text()'))
# school_info['level'] = self.extract_first(dl.xpath('./dd/ul/li[5]/text()'))
# school_info['weburl'] = self.extract_first(dl.xpath('./dd/ul/li[6]/text()'))
# print(school_info)
def extract_first(self,data=None,defalut=None):
if len(data) > 0:
return data[0]
return defalut
def send_request(self, url, headers=None):
headers = headers if headers else {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
response = requests.get(url=url,headers=headers)
if response.status_code == 200:
return response.text
if __name__ == '__main__':
url = 'http://college.gaokao.com/schlist/'
obj = CollegateRank()
obj.get_page_data(url)