1. BeautifulSoup4类
BeautifulSoup4:简称bs4
作用:能够在HTML或者XML文档中查找、选择我们的所需内容,bs4是python实现的模块
创建对象,对象类型是bs4:
BeautifulSoup(参数1, 参数2)
参数1:前端页面的字符串类型源码;参数2:解析器
RIGHT Example
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
"""
soup = BeautifulSoup(html, "lxml")
print(soup, type(soup))
2. select方法
(1)select:根据CSS选择器查找内容,select获取呢、页面中所有符合CSS选择器的结果,存入到列表中
RIGHT Example
p_list = soup.select('body > p')
print(p_list)
print(type(p_list[-1]))
(2)select_one:根据CSS选择器查找内容,select_one得到的结果是select结果的第一个元素
RIGHT Example
p = soup.select_one('p')
print(p, type(p))
补充:prettify:格式化bs4对象
注意:select得到的列表中的每个元素和select_one得到的结果一定是bs4类型
3. text和attrs方法
(1)text:获取html标签内的文本。例如:
abcde
--> ‘abcde’RIGHT Example
# b. 获取第一个p标签中b标签的内容
b = soup.select_one('p.title > b').text
print(b, type(b)) # The Dormouse's story <class 'str'>
(2)attrs:获取html标签内的属性值。例如:< a href=“http://www.baidu.com” >< /a > --> ‘http://www.baidu.com’
注意:如果attrs操作对象是class属性,得到的结果是列表
RIGHT Example
# c. 获取第二个p标签中第三个a标签的href属性
a = soup.select_one('body > p:nth-child(2) > a:nth-child(3)').attrs['href']
print(a) # http://example.com/tillie <class 'str'>
4. 案例实操
APPLICATION 使用bs4快速获取数据
import requests
import csv
from bs4 import BeautifulSoup
from tqdm import tqdm
def requests_get(href):
Headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39'
}
resp = requests.get(url=href, headers=Headers)
if resp.status_code == 200:
return resp
else:
print(resp.status_code)
if __name__ == '__main__':
f = open('today_news.csv', 'a', encoding='utf-8', newline='')
f_writer = csv.writer(f)
for page in tqdm(range(1, 11)):
URL = f'https://www.chinanews.com.cn/scroll-news/news{page}.html'
response = requests_get(URL)
response.encoding = 'utf-8'
# 1. 创建对象
soup = BeautifulSoup(response.text, "lxml")
# 2. 先找ul下的所有li
origin_news_list = soup.select('body > div.w1280.mt20 > div.content-left > div.content_list > ul > li')
# 3. 写入表头
f_writer.writerow(['新闻类型', '新闻名', '新闻链接', '新闻时间'])
# 4. 循环写入新闻
for i in origin_news_list:
if i.text:
# a. 获取新闻类型
news_type = i.select_one('li a').text
# b. 获取新闻名
news_name = i.select_one('li > div.dd_bt > a').text
# c. 获取新闻链接
news_link = 'https://www.chinanews.com.cn' + i.select_one('li div.dd_bt a').attrs['href']
# d. 获取新闻时间
news_time = i.select_one('li div.dd_time').text
this_news = [news_type, news_name, news_link, news_time]
f_writer.writerow(this_news)
f_writer.writerow([f'第{page}页完', '', '', ''])
f.close()
# copy的选择器如果拿不到内容,再手写