使用python简单的抓取网络小说

最新推荐文章于 2024-04-18 13:54:36 发布

发展稳定

最新推荐文章于 2024-04-18 13:54:36 发布

阅读量1.3k

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_41857305/article/details/120354096

版权

在读完kcl的语言班后，终于有了大块的空闲时间，想着写一点程序练练手，就花费一点时间写了一个python的小爬虫，很简单，上代码。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Sep  9 15:28:23 2021
目的：爬取网络小说文本
@author: fanzhen
"""
import requests
from bs4 import BeautifulSoup
import time

def get_html(url):
    headers={
        'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
        }
    req=requests.get(url,headers=headers)
    if req.status_code == 200:
        req.encoding=req.apparent_encoding #使用网页现在的编码形式，以防乱码
        return req.text
        #这是判断网页返回的状态码，200代表连接成功，大家常见的应该是404和503
    else:
        return
    
    
def get_texts(html):
    soup=BeautifulSoup(html,'html.parser') #使用beautisoup对网页源码进行解析
    title=soup.select("#main > h1")#取得章节名称
    w=''
    w+=title[0].get_text().replace('\n','').replace('\r','')+'\n'
    t=soup.find_all('p')
    for i in range(len(t)-1): #取得正文
        w+=t[i].get_text().replace('\n','').replace('\r','')
    w+='\n'
    print(w)
    return w
    

def next_page(html):
    soup=BeautifulSoup(html,'html.parser') #使用beautisoup对网页源码进行解析
    np=soup.select('a')
    return np[-1].get('href')

def main():
    time_start=time.time()
    with open('青囊尸衣.txt','w')as f:
        url='https://www.tianyabooks.com/horror/qingnangshiyi/107556.html'
        html=get_html(url)
        while next_page(html)!='./':
            f.write(get_texts(html))#将抓取到的文本放入txt中
            url='https://www.tianyabooks.com/horror/qingnangshiyi/'+next_page(html)#取得下一页的网址
            html=get_html(url)
        
    time_end=time.time()#监视程式运行总时间
    print('抓取完毕，用时：',time_end-time_start,'s')
    

if __name__=='__main__':
    main()

这个小爬虫是单线程的，所以很慢，抓取一本网络小说花费时间大概在5分钟左右，还需要改进。

发展稳定

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
使用python简单的抓取网络小说

在读完kcl的语言班后，终于有了大块的空闲时间，想着写一点程序练练手，就花费一点时间写了一个python的小爬虫，很简单，上代码。#!/usr/bin/env python3# -*- coding: utf-8 -*-"""Created on Thu Sep 9 15:28:23 2021目的：爬取网络小说文本@author: fanzhen"""import requestsfrom bs4 import BeautifulSoupimport timedef get_h
复制链接

扫一扫