Python爬虫之爬取笔趣阁小说下载到本地文件并且存储到数据库

最新推荐文章于 2024-08-24 21:58:48 发布

我开飞机发酒疯

最新推荐文章于 2024-08-24 21:58:48 发布

阅读量4.3k

点赞数

本文链接：https://blog.csdn.net/includestdio12/article/details/84984926

版权

本文介绍了一个使用Python爬虫从笔趣阁网站搜索并下载小说到本地，同时将内容存储到数据库的应用。通过分析URL结构和使用gbk编码解决中文问题，利用BeautifulSoup进行网页解析，结合数据库操作实现小说的持久化存储。此外，还实现了一个推荐图书的功能。

摘要由CSDN通过智能技术生成

学习了python之后，接触到了爬虫，加上我又喜欢看小说，所以就做了一个爬虫的小程序，爬取笔趣阁小说。

程序中一共引入了以下几个库：

import requests
import mysql.connector
import os
import time
from bs4 import BeautifulSoup
import urllib

网站不可能把所有书籍放在页面上，仔细观察网站构成与思考之后，从网站的搜索栏查找书籍是最好的。

这是搜索地址栏，searchkey+= 后面加上书名就能够进入到搜索书籍的详情页了，但是这里就涉及URL的规定了。按照标准，URL只允许一部分ASCLL字符，其他字符（如汉字）是不符合标准的，所以就要对汉字进行编码。例如我们要搜索三寸人间：

首先将输入获得的小说名转化为gbk编码，然后调用urllib.parse.quote（）方法就能把中文转化为相应的URL地址。然后拼接地址得到url_search。到这里我们就得到了我们想要的搜索的地址，然后就可以进行爬虫了。

先上完整代码，然后简单解释一下代码。

import requests
import mysql.connector
import os
from bs4 import BeautifulSoup
import urllib
url_base='http://www.biquge.com.tw'
lastrowid=0
class Sql(object):
    #获得数据库连接：
    conn = mysql.connector.connect(host='localhost',port=3306,user='root',passwd='123456789',database='novel',charset='utf8')
    def addnovels(self,novelname):#向数据库表插入小说
        cur = self.conn.cursor()
        cur.execute("insert into novel(novelname) values('%s')" %(novelname))
        lastrowid = cur.lastrowid
        cur.close()
        self.conn.commit()
        return lastrowid
    def addchapters(self,novelid,chaptername,content):#向数据库插入小说的内容
        cur = self.conn.cursor()
        cur.execute("insert into chapter(novelid,chaptername,content) values(%s , '%s' ,'%s')" %(novelid,chaptername,content))
        cur.close()
        self.conn.commit()

def getHtmltext(url):
    try:
        r = requests.get(url, timeout=60)
        r.raise_for_status()
        r.encoding = "gbk"
        return r.text
    except:
        return ""

def Download(soup_search, path1):
    second=0
    datas=soup_search.find("div",{"id":"list"}).find("dl").find_all("dd")
    if not os.path.exists(path1):
        os.makedirs(path1)
    for i in datas:
        second+=1
        link = i.a.attrs.get("href")  # 得到具体一个章节的URL地址
        numSection=i.a.string#得到小说章节名
        print(numSection)
        path2=path1+"\\