python Beautifulsoup4爬取凡人修仙传仙界篇连载中文章并生成txt

最新推荐文章于 2024-08-06 23:23:32 发布

Harold_96_lxw

最新推荐文章于 2024-08-06 23:23:32 发布

阅读量4.7w

点赞数 2

分类专栏： Python 爬虫文章标签： python 爬虫 beautifulsoup 存储文章

本文链接：https://blog.csdn.net/weixin_38168694/article/details/86230954

版权

Python 爬虫专栏收录该内容

13 篇文章 4 订阅

订阅专栏

1.为什么有这个想法：
最近想看一部连载小说，奈何没有现成资源，网页一章一章看广告又太烦，于是乎用python beautifulsoup4 代码爬虫爬取文章的想法就应运而生了
2.软件环境：
python 3.7
Beautifulsoup 4.7.1
requests 2.21.0
pycharm 2018.3.2
3.思路分析
先抓取一篇文章的标题与正文内容，之后抓取目录链接，在目录的循环了执行每篇文章的抓取，最后将文章存成txt。
4.代码：

import requests
from bs4 import BeautifulSoup
import os
import time

def getcontent(url):
    html=requests.get(url)
    html.encoding='UTF-8'
    soup=BeautifulSoup(html.text,'html.parser')
    title=soup.select('.bookname h1')
    print(title[0].text)
    re1=title[0].text
    content=soup.select('#content')
    result=content[0].text
    print(str(result).strip().replace('    ','\n'))
    re2=str(result).strip().replace('    ','\n')
    with open(os.path.join(os.getcwd(),'凡人修仙传之仙界篇.txt'),'a+',encoding='utf-8') as f:
        f.write(re1+'\n'+re2+'\r\n')

def getallurl():
    result=[]
    url='https://www.biquke.com/bq/0/990/'
    html=requests.get(url)
    html.encoding='utf-8'
    soup=BeautifulSoup(html.text,'html.parser')
    re=soup.select('#list a')
    for i in re:
        # print(i['href'])
        result.append(i['href'])
    return result

if __name__ == '__main__':
    # url='https://www.biquke.com/bq/0/990/4368042.html'
    # getcontent(url)
    # url='https://www.biquke.com/bq/0/990/4374212.html'
    # getcontent(url)
    # url='https://www.biquke.com/bq/0/990/4375800.html'
    # getcontent(url)
    allpageurl=getallurl()
    for i in allpageurl:
        url='https://www.biquke.com/bq/0/990/'+i
        getcontent(url)
        time.sleep(1)
    print('='*50)
    print('文章截取完毕')
    print('='*50)

5.效果截图：
在这里插入图片描述

6.心得：
爬虫越来越得心应手了，基础是关键，从局部到整体，思路一定要清晰，先打印到控制台，没问题再存成本地文件。