pyhton爬诛仙小说-CSDN博客

本文链接：https://blog.csdn.net/miaomiao0313/article/details/53339244

好吧，第一次写博客时由于出了一点点小问题，所以还得重新写。。。

初次爬小说，我还是比较担心的，因为不知道应该用python的哪个知识点，在网上查询之后，发现还是得使用bs,心里窃喜，接着，要决定爬哪个小说。网上有新浪小说，有网络小说，再三考虑之后，决定爬诛仙小说。（因为符合我对小说的定义以及有现成的网址），好啦，言归正传，工作开始。

1.首先要有一个明确的思路，最终结果应该是一个文件包含小说的所有章节、题目及小说内容，所以，代码中应有事先定义的章节题目空列表，链接空列表。

2.写出格式化代码后，打开小说页面源代码，发现章节代码都很有规律，如下：

所以，正则便可以很容易写出：

contents = soup.find("div", id="list").find_all("a")

3.在匹配章节题目时遇到难题，不知道怎么书写第几章的正则，后来通过查资料，发现一样特别神奇的武器——汉字转化unicode编码工具。

http://www.bangnishouji.com/tools/chtounicode.html 这是链接，可以尝试--**--

所以，本代码需要的正则便是

re.findall(re.compile(ur'\u7b2c.*\u7ae0'),item.text)

4.接下来就是匹配正文

content = soup.find("div", id="content")

然后，将不必要的东西通过re.sub删除。删除时需将内容复制粘贴即可，即使被删除的内容很多，如：

cont = re.sub(r'<script src="/js/chaptererror.js" type="text/javascript"></script>', '', cont)

5.综上，完整代码如下：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup
import urllib2
import re

url = 'http://www.biquge.tw/26_26491/'
user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0"
headers = {'User-Agent': user_agent}
response = urllib2.Request(url, headers=headers)
html = urllib2.urlopen(response).read()
soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
contents = soup.find("div", id="list").find_all("a")
#print contents

for item in contents:
    title = []
    href = []

    if re.findall(re.compile(ur'\u7b2c.*\u7ae0'),item.text):
        #print item.text
        title.append(item.text)
        href.append(item['href'])
    for i in range(len(href)):
        try:
            url2 = 'http://www.biquge.tw' + href[i]
            user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0"
            headers = {'User-Agent': user_agent}
            response = urllib2.Request(url2, headers=headers)
            html2 = urllib2.urlopen(response).read()
            soup = BeautifulSoup(html2, 'html.parser', from_encoding='utf-8')
            content = soup.find("div", id="content")
            cont = str(content)
            cont = re.sub(r'<script src="/js/chaptererror.js" type="text/javascript"></script>', '', cont)
            cont = re.sub(r'</div>', '', cont)
            cont = re.sub(r'<div id="content">', '', cont)
            cont = re.sub(r'<br/>', '\n', cont)
            f = open('ZX.txt','a')
            f.write(title[i].encode('utf-8')+'\n'+cont+'\n')
            f.close()
            print "OK"
        except:
            print  "NO"