初使用BeautifulSoup总是觉得哪里不顺手,网页要不下载不全,要不垃圾数据太多不好清理
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0'}
htmlmoban = "https://xxxxxxx/whole.html"
requests_html = requests.get(htmlmoban, headers=headers)
requests_html.encoding = 'gbk'
#print(requests_html)
soup = BeautifulSoup(requests_html.text, "lxml")
html_list = soup.find_all("div", {"class": {"novellist"}}) #选择class="novellist"所有的
#print(html_list)
html_list1 = str(html_list)
#print(html_list1)
with open('file/01.txt', 'w', encoding='utf-8') as fw1:
fw1.write(str(html_list))
fw1.close()
查了资料后,发现find_all()里面有个attributes属性,可以定义class等
关键在:soup.find_all("div", {"class": {"novellist"}})
使用后,只查找<div class="novellist"></div>的所有数据。
初学者,摸索中……