前置准备:
需要安装相应的库,下面的库都是可能会用到的:
requests库:
pip install requests
selenium库:
pip install selenium
BeautifulSoup4库:
pip install BeautifulSoup4
lxml解析器;
pip install lxml
使用BeautifulSoup解析HTML
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。
#导入requests包
import requests
#导入BeautifulSoup包
from bs4 import BeautifulSoup
#导入lxml
import lxml
为了避免浏览器识别出爬虫,需要设置User-Agent
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
使用requests.get()获取网页
response = requests.get(url,headers=headers)
构建BeautifulSoup对象
soup = BeautifulSoup(response.text, "lxml")
拿到so