学习目标:
利用bs4库爬取小说(笔趣阁)
学习内容:
bs4库:(from bs4 import BeautifulSoup)
可以将网页源码转化为对象:soup=BeautifulSoup(Html,‘lxml’)
对对象进行解析:利用网页中的标签
soup.a ------soup.a.text/string
soup.find(‘div’)
soup.find_all(‘div’)
soup.select(’.class>a>b’)//层级
soup.select(’.class>a c’)//跨层级
属性:
soup.a[‘href/title’]
学习产出:
import requests
from bs4 import BeautifulSoup
import os
if __name__=='__main__':
if not os.path.exists('./元尊'):
os.mkdir('./元尊')
url='https://www.bqkan8.com/0_790/'
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
response=requests.get(url=url,headers=headers)
page_text=response.text.encode('iso-8859-1')
soup=BeautifulSoup(page_text,'lxml')
yzlist=soup.select('.listmain>dl dd')
for li in yzlist[12:53]:
title=li.a.string
fp=open('./元尊/'+title+'.txt','w',encoding='utf-8')
detail_url='https://www.bqkan8.com'+li.a['href']
detail_page=requests.get(url=detail_url,headers=headers).text
detail_soup=BeautifulSoup(detail_page,'lxml')
div_tag=detail_soup.find('div',class_='showtxt')
content=div_tag.text
fp.write(title+':'+'\n'+content+'\n')
fp.close()
print(title,'over')