python爬虫从入门到放弃二(爬取小说网站)

最新推荐文章于 2024-04-06 13:32:08 发布

置顶零zero度

最新推荐文章于 2024-04-06 13:32:08 发布

阅读量733

点赞数

分类专栏：基础爬虫 python

本文链接：https://blog.csdn.net/qq_38636998/article/details/84838155

版权

基础同时被 3 个专栏收录

54 篇文章 0 订阅

订阅专栏

python

9 篇文章 0 订阅

订阅专栏

爬虫

6 篇文章 0 订阅

订阅专栏

本期在闲暇之余用BeautifulSoup爬取了一个小说网站,大家可以学习一下,写的注释还算细,如果不懂的欢迎在下面评论区问我

   import requests#引入requests
    from bs4 import BeautifulSoup
    url='http://www.seputu.com/'#访问的网页
    r=requests.get(url).text#用requests访问网页获取HTML文档,并打印
    # print(r)
    '''
    解析HTMl分析如下:
        标题和章节都在<div class='mulu'>标记下,标题位于<div class='nulu-title'>的<h2>中,
        章节位于其中的<div class='box'>下的<a>标签中
    '''
    #把r转化成BeautifulSoup对象soup
    soup=BeautifulSoup(r,'html.parser',from_encoding='utf-8')#html.parser
    #遍历soup对象用find_all函数寻找(class='mulu')的标签获得mulu
    for mulu in soup.find_all(class_='mulu'):
        #用find方法寻找h2
        h2=mulu.find('h2')
        #用find_all查找出('div',{"class":"box"})标记遍历出b
        for b in mulu.find_all('div',{"class":"box"}):
            #用find_all的方法找出b中的("a")标签,遍历出i
            for i in b.find_all('a'):
                #获取出章节的url定义成href
                href=i.get('href')
                #获取出章节的名字定义为box_title
                box_title=i.get('title')
                #输出章节名和url
                # print(href,box_title)