Python bs4库爬取小说

最新推荐文章于 2024-05-01 15:59:51 发布

wherehw

最新推荐文章于 2024-05-01 15:59:51 发布

阅读量546

点赞数

本文链接：https://blog.csdn.net/wherehw/article/details/119056409

版权

学习目标：

利用bs4库爬取小说（笔趣阁）

学习内容：

bs4库：（from bs4 import BeautifulSoup）
可以将网页源码转化为对象:soup=BeautifulSoup(Html,‘lxml’)
对对象进行解析：利用网页中的标签
soup.a ------soup.a.text/string
soup.find(‘div’)
soup.find_all(‘div’)
soup.select(’.class>a>b’)//层级
soup.select(’.class>a c’)//跨层级
属性：
soup.a[‘href/title’]

学习产出：

import requests
from bs4 import BeautifulSoup
import os

if __name__=='__main__':
    if not os.path.exists('./元尊'):
        os.mkdir('./元尊')
    url='https://www.bqkan8.com/0_790/'
    
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }

    response=requests.get(url=url,headers=headers)
    page_text=response.text.encode('iso-8859-1')

    soup=BeautifulSoup(page_text,'lxml')

    yzlist=soup.select('.listmain>dl dd')

    for li in yzlist[12:53]:
        title=li.a.string
        fp=open('./元尊/'+title+'.txt','w',encoding='utf-8')
        detail_url='https://www.bqkan8.com'+li.a['href']

        detail_page=requests.get(url=detail_url,headers=headers).text

        detail_soup=BeautifulSoup(detail_page,'lxml')

        div_tag=detail_soup.find('div',class_='showtxt')

        content=div_tag.text
        fp.write(title+':'+'\n'+content+'\n')
        fp.close()
        print(title,'over')

wherehw

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
Python bs4库爬取小说

学习目标：利用bs4库爬取小说（笔趣阁）学习内容：bs4库：（from bs4 import BeautifulSoup）可以将网页源码转化为对象:soup=BeautifulSoup(Html,‘lxml’)对对象进行解析：利用网页中的标签soup.a ------soup.a.text/stringsoup.find(‘div’)soup.find_all(‘div’)soup.select(’.class>a>b’)//层级soup.select(’.class>
复制链接

扫一扫