Beautiful Soup 4安装
pip install beautifulsoup4
什么是beautiful soup?
是python的一个HTML或XML的解析库,可以用它来方便地从网页中提取数据
Beautiful Soup支持的解析器
解析器 | 使用方法 | 优势 |
---|---|---|
Python标准库 | BeautifulSoup(markup,'html.parser') | Python的内置标准库、执行速度适中、文档容错能力强 |
lxml HTML解析器 | BeautifulSoup(markup,'lxml') | 速度快、文档容错能力强 |
lxml XML解析器 | BeautifulSoup(markup,'xml') | 速度快、唯一支持XML的解析器 |
html5lib | BeautifulSoup(markup,'html5lib') | 最好的容错性、以浏览器的方式解析文档,生成HTML5的格式文档 |
使用
- 引入,from bs4 import BeautifulSoup
- 初始化 , soup=BeautifulSoup(html文本,‘lxml’)
- 属性值,
find():获取单个节点
find_all():获取所有
name:可以是正则表达式,可以是标签名称,可以是标签的列表[‘a’,‘img’]
attrs:字典类型,标签的属性值
简短案例
def parse_page_data(self, response):
##使用bs4获取数据
soup =BeautifulSoup(response,'lxml')
ranks = soup.find_all(attrs={'class':'scores_List'})[0].find_all('dl')
for dl in ranks:
school_info = {}
school_info['url'] = dl.select('dt a')[0].attrs['href']
school_info['icon'] = dl.select('dt a img')[0].attrs['src']
school_info['name'] = dl.select('dt > strong a')[0].text
school_info['address'] = dl.select('dd > ul > li')[0].text
school_info['test'] = ','.join([span.text for span in dl.select('dd > ul >li')[1].select('span')])
school_info['type'] = dl.select('dd > ul > li')[2].text
print(school_info)