bs4(Python独有)
bs4数据解析的原理:
- 实例化一个BeautifulSoup对象,并将页面源码数据加载到该对象中
- 通过调用BeautifulSoup对象相关的属性或者方法进行标签定位和数据提取
环境的安装
下载lxml的解析器
实例化BeautifulSoup
1. from bs4 import BeautifulSoup
2. 对象的实例化:
1. 将本地的html文档中的数据加载到该对象中
from bs4 import BeautifulSoup
with open('./sogou.html','r',encoding='utf-8') as fp:
soup=BeautifulSoup(fp,'lxml')
print(soup)
2. 将互联网上获取的页面源码加载到该对象中
page_text = response.text
soup = BeautifulSoup(page_text,‘lxml’)
提供的用于数据解析的方法和属性
soup.tagName 返回的是html中第一次出现的tagName标签
from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
soup=BeautifulSoup(fp,'lxml')
print(soup.a)
soup.find()
- soup.find(‘tagName’)等同于soup.tagName
from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
soup=BeautifulSoup(fp,'lxml')
print(soup.find('div'))
soup.find(‘div’)相当于soup.div
2.属性定位
from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
soup=BeautifulSoup(fp,'lxml')
print(soup.find('div',class_='tab-item'))
soup.find_all()
可以找到所有符合要求的(列表)
from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
soup=BeautifulSoup(fp,'lxml')
print(soup.find_all('a'))
soup.select()
- select(‘某种选择器’),返回的是一个列表。
from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
soup=BeautifulSoup(fp,'lxml')
print(soup.select('.share-pop'))
2. 层级选择器
from bs4 import BeautifulSoup
with open('./江西理工大学.html','r',encoding='utf-8') as fp:
soup=BeautifulSoup(fp,'lxml')
print(soup.select('.share-pop > a')[0])
一个>是一个层级,空格表示的是多个层级
获取标签之间的文本数据
soup.a.text/string/get_text()
区别:
text/get_text():可以获取某个标签中所有的文本内容
string:只能获取直系的文本内容
获取标签中的属性值
soup.a[‘属性名称’]