实战爬虫-爬取红袖添香并存入数据库

最新推荐文章于 2023-08-24 15:02:53 发布

天下第一小白

最新推荐文章于 2023-08-24 15:02:53 发布

阅读量2k

点赞数

分类专栏： Python开发日记文章标签：爬虫-python python

本文链接：https://blog.csdn.net/sinat_36899414/article/details/77606711

版权

Python开发日记专栏收录该内容

24 篇文章 2 订阅

订阅专栏

看了很多爬虫视频的视频，最近找了个小说网站练练手

####目标：红袖添香前20页小说，包括小说名字，作者，类别，状态，字数，以及基本介绍

####网址在这儿：

https://www.hongxiu.com/all?pageSize=10&gender=2&catId=30001&isFinish=-1&isVip=-1&size=-1&updT=-1&orderBy=0&pageNum=1
####这是大概样子

这里写图片描述

接下来我会一层层的爬取所有有用的信息，我是使用BeautifulSoup和正则来提取数据的：
用chrome打开网站，选择开发者工具，选择开发者工具左上角那个小鼠标（有的浏览器是个放大镜），
鼠标停在网站的哪个地方就会出现它的源代码，如图：

这里写图片描述

根据这种方法，我们发现每本小说都放在li标签下，根据先抓大后抓小的方法，我们不如先得到li标签，再一一细分，注意通过BeautifulSoup得到的li标签内容并不是一个字符串的html页面，需要再用str()转一下，否则会报错NoneType之类的错误，详细请看下面展示的的代码。

#crawling函数下的，得到每个页面li标签的集合
>>> if type == 'li':
>>>            result = soup.select(".right-book-list > ul > li .book-info")

接下来，我们再对每个li标签进行分析。小说的名字在li标签的a标签里面，result = soup.find_all('a', href=re.compile('/book/\d+')) a标签的特征是href由/book/数字/ 的形式，所以可以这样写。
同理，对作者，状态，字数，描述都是一一寻找其特征。

接下来是处理翻页问题，我们发现每次点击下一页只是链接后面的pageNum的值+1，其它并没有什么变化，所以，我们可以从pageNum下手，url = ('https://www.hongxiu.com / all?pageSize = 10 & gender = 2 & catId = 30001 & isFinish = -1 & isVip = -1 & size = -1 & updT = -1 & orderBy = 0 & pageNum = %s'%i).replace(' ', '')i的值每次循环+1就实现了翻页，replace的作用是替换链接中间的空格。

最后，我们把得到的小说名字，作者等放入到字典中，以列表的形式 msg = {'title': [], 'author': [], 'various': [], 'status': [], 'words': [], 'describe': []}
再一一pop()出来放入到数据库中，数据库最后不要忘了connect.commit()，connect.close()，cur.close()这三个，否则还是没有持久化，数据库什么都没有。

下面是我的数据库内容：（记得设置数据库的格式，否则存入中文会报错）

这里写图片描述

这是源代码：

# encoding: utf-8

import urllib
import re
from bs4 import BeautifulSoup
import MySQLdb
import sys


reload(sys)
sys.setdefaultencoding('utf8')


class spiders:
    def __init__(self):
        print 'crawling'

    def getsourcr(self, url):
        type = sys.getfilesystemencoding()
        html = url.decode('utf-8').encode(type)
        html = urllib.urlopen(html).read()
        return html

    def crawling(self, html ,type):

        soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
        if type == 'li':
            result = soup.select(".right-book-list > ul > li .book-info")
        if type == 'title':
            result = soup.find_all('a', href=re.compile('/book/\d+'))
        if type == 'author':
            result = soup.find_all('a', class_='default')
        if type == 'various':
            result = soup.find_all('span', class_='org')
        if type == 'status':
            result = soup.find_all('span', class_='pink')
        if type == 'words':
            result = soup.find_all('span', class_='blue')
        if type == 'description':
            result = soup.find_all('p', class_='intro')
        return result


class DB:
    def mysql(self):
        connect = MySQLdb.connect(host='localhost', user='root', passwd='123456', charset='utf8')
        cur = connect.cursor()
        # cur.execute('create database if not EXISTS pydb CHARACTER SET utf8')
        connect.select_db('pydb')
        # cur.execute('create table novel(id INT PRIMARY KEY auto_increment,title VARCHAR(50),author VARCHAR(30),various VARCHAR(20),status VARCHAR(20),words VARCHAR(20),description VARCHAR(150))')
        return cur, connect

if __name__ == '__main__':

        db = DB()
        spider = spiders()
        i = 1
        count = 0
        msg = {'title': [], 'author': [], 'various': [], 'status': [], 'words': [], 'describe': []}
        cur, connect = db.mysql()
        while i <= 20:
            url = ('https://www.hongxiu.com / all?pageSize = 10 & gender = 2 & catId = 30001 & isFinish = -1 & isVip = -1 & size = -1 & updT = -1 & orderBy = 0 & pageNum = %s'%i).replace(' ', '')
            html = spider.getsourcr(url)
            result = spider.crawling(html, 'li')
            for li in result:
                titles = spider.crawling(str(li), 'title')
                for tl in titles:
                    msg['title'].append(tl.get_text())
                author = spider.crawling(str(li), 'author')
                for au in author:
                    msg['author'].append(au.get_text())
                var = spider.crawling(str(li), 'various')
                for va in var:
                    msg['various'].append(va.get_text())
                status = spider.crawling(str(li), 'status')
                for sta in status:
                    msg['status'].append(sta.get_text())
                words = spider.crawling(str(li), 'words')
                for wo in words:
                    msg['words'].append(wo.get_text())
                describe = spider.crawling(str(li), 'description')
                for des in describe:
                    msg['describe'].append(des.get_text())
            i = i+1
        try:
            while len(msg['title']):
                title = msg['title'].pop()
                print '名字:'+title
                writer = msg['author'].pop()
                print '作者：'+writer
                kind = msg['various'].pop()
                print '类别：'+kind
                statuss = msg['status'].pop()
                print '状态：'+statuss
                word = msg['words'].pop()
                print '字数：'+word
                describ = msg['describe'].pop()
                print '描述：'+describ
                str = 'insert into novel (title,author,various,status,words,description) VALUES ("%s","%s","%s","%s","%s","%s")' % (title, writer, kind, statuss, word, describ)
                cur.execute(str)
            connect.commit()
            connect.close()
            cur.close()
        except MySQLdb.Error, e:
            print "Mysql Error %d: %s" % (e.args[0], e.args[1])