试验python爬取逐浪小说

最新推荐文章于 2022-05-16 18:25:33 发布

heybob

最新推荐文章于 2022-05-16 18:25:33 发布

阅读量1.2k

点赞数

分类专栏： python学习文章标签： python 爬虫

本文链接：https://blog.csdn.net/heybob/article/details/49903379

版权

python学习专栏收录该内容

36 篇文章 0 订阅

订阅专栏

本文介绍了一位自学Python的作者尝试爬取逐浪小说的历程。通过BeautifulSoup解析网页，抓取小说的标题、章节及内容，并将其保存到本地文件。然而，代码中存在一些不规范之处，例如手动设置编码，且抓取到的文章内容没有自动换行。作者寻求解决方案，并分享了相关的逐浪小说爬取示例链接。

摘要由CSDN通过智能技术生成

只是想试下用python爬网页，之前用米花，后来米花不知道怎么回事打不开了，就用的逐浪。

#coding:utf-8

import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

import urllib,sys,urllib2,os
from bs4 import BeautifulSoup

IMAGE_DIR = '/home/cloud/temp/' #存放目录
if not os.path.exists(IMAGE_DIR):
os.mkdir(IMAGE_DIR)

def get_book_without_db(url):
"""一边爬取一边写入，不用数据库保存"""
soup = BeautifulSoup(request(url))
title = (soup.find_all("title"))[0].string.split('_')[0] #文章名

book_path = os.path.join(IMAGE_DIR, title)
book = open(book_path, 'a+')
i = 1
for volume in soup.find_all('h2'):
i += 1
volume_name = volume.text
print type(volume_name)
book.write(str(volume_name) + '\n\n\n')
for chapter in soup.find_all('ul')[i].find_all("li"):
chapter_name = chapter.find('a').text
book.write(str(chapter_name) + '\n')

chapter_url = chapter.find('a').get('href')
content_soup = BeautifulSoup(request(chapter_url))
content = content_soup.find_all("p")[0].contents[0]
book.write(str(content) + '\n\n')
book.close()
print '书籍路径: ', book_path