安装
从网页上爬取数据,使用beautifulsoup4来解析数据,安装beautifulsoup4
pip3 install beautifulsoup4
使用测试,格式化显示
import requests
r = requests.get("http://python123.io/ws/demo.html")
print(r.text)
print("----------------------------------------------")
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
print(soup.prettify())
BeautifulSoup基本用法
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>", "html.parser") #直接传入data
soup2 = BeautifulSoup(open("D://demo.html"), "html.parser") #使用一个文件的data
BeautifulSoup基本元素
几种解析器(xml、lxml)
import requests
r = requests.get("http://python123.io/ws/demo.html")
print(r.text)
print("----------------------------------------------")
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
print(soup.prettify())
print("----------------------------------------------")
print("soup.a.parent.name=",soup.a.parent.name) #a是demo数据里面的一个值
print("soup.a.string=", soup.a.string)
print("type(soup.a.string)=", type(soup.a.string))
print("soup.a.parent.parent.name=", soup.a.parent.parent.name)
三大遍历解析
数据基本解析
三大解析总结
下行遍历
上行遍历
平行遍历发生在同一父节点的各节点之间
用法
import requests
r = requests.get("http://python123.io/ws/demo.html")
print(r.text)
print("----------------------------------------------")
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
print(soup.prettify())
print("----------------------------------------------")
print("soup.head=",soup.head) #a是demo数据里面的一个值
print("soup.head.contents=", soup.head.contents)
print("soup.body.contents=", soup.body.contents)
print("soup.body.contents[1]=", soup.body.contents[1])
print("--------------打印节点(上行遍历)---------------------------")
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
print("--------------打印节点(平行遍历)---------------------------")
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)
print(soup.a.previous_sibling)
print(soup.a.previous_sibling.previous_sibling)
print(soup.a.parent)
print("--------------打印节点(平行遍历)---------------------------")
#遍历后续节点
for sibling in soup.a.next_sibling:
print("******",sibling)
#遍历前序节点
for sibling in soup.a.previous_sibling:
print("/",sibling)
总结
tag标签
name
attribute 属性
navigatblestring 标签中间的属性
comment 标签中间的注释