爬虫爬取小说

最新推荐文章于 2024-07-17 17:44:43 发布

xxm758258

最新推荐文章于 2024-07-17 17:44:43 发布

阅读量522

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/u014318939/article/details/78961048

版权

网络爬虫爬取小说：

#-*_coding:utf8-*
from bs4 import BeautifulSoup
import os
import urllib.request
import re

# 定义打开链接函数
def handlechapter(url) :
    response = urllib.request.urlopen(url)
    html = response.read().decode('gbk', 'ignore')
    return html
#下载资料
f=open('D:zongcai.txt','w',encoding='utf-8')
#爬虫网址
url ="http://www.246zw.com/html/18/18714/"
#打开链接
url_open=urllib.request.urlopen(url)
#获取html码
url_html=url_open.read().decode('gbk', 'ignore')
#赋值给beautifulSoup
soup = BeautifulSoup(url_html, 'html.parser')
#获取网站所有URl
print("获取所有链接")
links=soup.find(id='list').findAll("a", href=re.compile("^[^/html/]"))#截取字符串部分

for link in links:#循环遍历章节
    chattitle=link.string#获取章节标题
    t3 = link.get('href')#获取章节URL
   
    print('准备下载'+chattitle)
   
    f.write('\n'+chattitle+'\n')#写入文件
    htmlurl=url+str(link['href'])
   
    html = handlechapter( htmlurl)#获取构建的URL的html
    chapterhtml = BeautifulSoup(html, 'html.parser')
    for each in chapterhtml.find(id = 'content').strings:
        f.write('%s%s' % (each.replace('\xa0', ''), os.linesep))
    print('成功下载' +chattitle)
f.close()

xxm758258

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫爬取小说

网络爬虫爬取小说：#-*_coding:utf8-*from bs4 import BeautifulSoupimport osimport urllib.requestimport re# 定义打开链接函数def handlechapter(url) : response = urllib.request.urlopen(url) html = respon
复制链接

扫一扫