python爬取笔趣阁小说

最新推荐文章于 2024-07-16 18:59:25 发布

anmei3400

最新推荐文章于 2024-07-16 18:59:25 发布

阅读量660

点赞数

文章标签：爬虫测试 python

原文链接：http://www.cnblogs.com/weew12/p/10583041.html

版权

下午打开手机，无意间看到了被我搁在角落的起点小说，。。想起来好久都没看小说了，之前在看净无痕的新作品《伏天氏》，之前充起点币看了大概两百章左右，现在已经更到800+章了，直接充起点币有点舍不得。。。

想起之前自学爬虫在笔趣阁测试爬小说，所以。。。

那就再来爬一波《伏天氏》。。。

结构分析：

1.目录页面：

https://www.qu.la/book/2125/

可以看到目录全都放在一个id为list的盒子了，直接用Xpath来选择这一部分就好了，然后把章节名和url保存，方便后面的使用：

关键语句如下:

    # Xpath筛选
    results = driver.find_elements_by_xpath("//div[contains(@id,'list')]//dl//dd//a")
    for result in results:
        res_url = result.get_attribute('href')  # url
        res_tit = result.text   # 章节标题

2.章节页面

随机分析一个：

https://www.qu.la/book/2125/10580853.html

ok 文章依旧放在一个id是content的盒子里，继续Xpath

关键语句：

driver.get(url)
content = driver.find_element_by_xpath("//div[contains(@id,'content')]").text

3.用chrome的浏览器驱动，无头访问，然后爬取目录对应的页面文章，写入文本，ok搞定。

实现代码如下(格式可能稍微有点丑。。。习惯没养成好，慢慢改。。。大家别学我):

(上次爬猫眼的啥来着，，没设置时间间隔，，请求太频繁了，然后ip被限制了，代理哪些又比较麻烦，而且不太会，所以就猥琐一点了，每隔一秒请求下一张，，，反正是为了不花钱，不是为了速度。。。)

# 爬取笔趣阁的小说 《伏天氏》
# -*- utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

# 目录主页面 地址
root_url = 'https://www.qu.la/book/2125/'


# 获取章节目录 以及章节页面链接
def get_catalogue():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    driver.get(root_url)

    # 目录列表
    catalogue = {}

    # Xpath筛选
    results = driver.find_elements_by_xpath("//div[contains(@id,'list')]//dl//dd//a")
    for result in results:
        res_url = result.get_attribute('href')  # url
        res_tit = result.text   # 章节标题

        # 检测'月票' '通知' 关键字
        flag = False
        if not re.search('月票', res_tit) and not re.search('通知', res_tit):
            flag = True

        # 存入字典
        if res_url not in catalogue and flag:
            catalogue[res_tit] = res_url

    driver.close()
    # 输出章节总数
    # print(catalogue.__len__())
    return catalogue


# 下载并存储章节数据
def download(catalogue):
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    count = 0   # 计数器

    for key, value in catalogue.items():
        count = count + 1
        title = key    # 章节标题
        url = value      # 章节页面
        print('title =' + title + 'url = ' + url)
        try:
            driver.get(url)
            content = driver.find_element_by_xpath("//div[contains(@id,'content')]").text
            with open('G:\python 资源\python project\小说爬取(伏天氏)\\'+str(title)+'.txt','wt', encoding='utf-8') as file:
                file.write(content)
            print('章节'+str(count)+'写入成功')
        except IOError:
            print('章节'+str(count)+'写入出错')

        # 休眠 1s
        time.sleep(1)
        driver.back()
    driver.close()


if __name__ == '__main__':
    catalogues = get_catalogue()
    download(catalogues)