定向爬虫（1）——第一个单线程爬虫

最新推荐文章于 2021-11-20 19:30:09 发布

WhareSong

最新推荐文章于 2021-11-20 19:30:09 发布

阅读量281

点赞数 1

分类专栏：定向爬虫文章标签： python 正则表达式 html

本文链接：https://blog.csdn.net/qq_44037783/article/details/107919753

版权

定向爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

看了好几天的爬虫，终于有点眉目，于是写了第一个单线程爬虫，问题很多，但功能基本实现了

简单记录一下，以后可以翻看回忆
这里爬取的是努努书坊的《龙族五》，原网页链接如下：https://www.kanunu8.com/book2/10943/
先撸一下基本框架结构：
1）使用request获取网页源码
2）使用正则表达式提取内容
3）文件操作，写入文件保存文件
检查网页元素，很容易可以找出特别的地方，

.........
<dl><dt>正文</dt><dd><a href="194884.html">第1章 全民公敌(1)</a></dd><dd><a 。。。第152章 但为君故(56)</a></dd><dd><a href="197941.html">第153章 但为君故(57)</a></dd>
<div class="clear"></div>
</dl>

这一块就是主页的目录部分，可以从中提取出每一章节的连接地址，这里提出来的是逻辑地址，后续用到的时候需要转化为绝对地址
如 a href=“197940.html”>第152章但为君故(56) // 197940.html 就是我们想要的地址
这就是第一个功能的实现思路，提取出每个章节的地址
代码如下：
`def get_toc(html):
“”"

:param html: 传入网页源码
:return: 输出每一章节的绝对链接
"""

toc_url_list = []
toc_block = re.findall('正文</dt>(.*?)<div class="clear"', html, re.S)[0]
toc_l = re.findall('a href="(.*?)">第', toc_block, re.S)
for url in toc_l:
    toc_url_list.append(html_url + url)    # 加上提取出来的相对地址，即可构成绝对地址

return toc_url_list`

然后是第二部分，依次遍历每一个网页，获取源码后提取章节目录，然后提取章节内容

def get_text(toc_list):
    """

    依次请求获取每个章节的原网页，然后依次存入章节列表中
    :param toc_list: 每一章节的地址
    :return: heard-章节名，text-章节文本

    """
    heard_list = []
    text_list = []
    for toc in toc_list:
        html_text = requests.get(toc).content.decode("gbk")
        heard = re.search('h1>(.*?)<br>', html_text, re.S).group(1)
        heard_list.append(heard)
        text = re.search('<p>(.*?)</p>', html_text, re.S).group(1)
        # text_temp_1 = text.replace('<p>', '')
        # text_temp_2 = text_temp_1.replace('&nbsp;&nbsp;&nbsp;&nbsp;', '')
        text_temp1 = re.findall('(<p>.*?&nbsp;&nbsp;&nbsp;&nbsp;)',text,re.S)
        text_temp_2 = text.replace(text_temp1[1],'')
        text_list.append(text_temp_2)
    return heard_list, text_list

	这一部分也比较容易理解，最后一部分就是写入操作，这里导入OS模块是为了方便写入时路径的查找

def write_text(heard, text_list):
    """

    :param heard:章节名
    :param text: 章节内容
    :return:
    """
    os.makedirs('龙族V', exist_ok=True)
    for i in range(len(heard)):
        chart = heard[i]
        text = text_list[i]
        with open(os.path.join('龙族V', chart + '.txt'), 'w', encoding='utf-8') as f:
            f.write(text)

到这里程序就完了，有个问题是每一章节的开头“ ”部分没有被替换完，后续可以图形界面操作也比较容易
缺点是单线程的话run的比较慢，宽带只能跑30kp左右，明天会改进多线程
结果如下
在这里插入图片描述基本内容实现，完整代码如下：

import re
import requests
import os
import multiprocessing

html_url = 'https://www.kanunu8.com/book2/10943/'
html = requests.get(html_url).content.decode('GBK')


def get_toc(html):
    """

    :param html: 传入网页源码
    :return: 输出每一章节的绝对链接
    """

    toc_url_list = []
    toc_block = re.findall('正文</dt>(.*?)<div class="clear"', html, re.S)[0]
    toc_l = re.findall('a href="(.*?)">第', toc_block, re.S)
    for url in toc_l:
        toc_url_list.append(html_url + url)

    return toc_url_list


def get_text(toc_list):
    """

    依次请求获取每个章节的原网页，然后依次存入章节列表中
    :param toc_list: 每一章节的地址
    :return: heard-章节名，text-章节文本

    """
    heard_list = []
    text_list = []
    for toc in toc_list:
        html_text = requests.get(toc).content.decode("gbk")
        heard = re.search('h1>(.*?)<br>', html_text, re.S).group(1)
        heard_list.append(heard)
        text = re.search('<p>(.*?)</p>', html_text, re.S).group(1)
        # text_temp_1 = text.replace('<p>', '')
        # text_temp_2 = text_temp_1.replace('&nbsp;&nbsp;&nbsp;&nbsp;', '')
        text_temp1 = re.findall('(<p>.*?&nbsp;&nbsp;&nbsp;&nbsp;)',text,re.S)
        text_temp_2 = text.replace(text_temp1[1],'')
        text_list.append(text_temp_2)
    return heard_list, text_list


def write_text(heard, text_list):
    """

    :param heard:章节名
    :param text: 章节内容
    :return:
    """
    os.makedirs('龙族V', exist_ok=True)
    for i in range(len(heard)):
        chart = heard[i]
        text = text_list[i]
        with open(os.path.join('龙族V', chart + '.txt'), 'w', encoding='utf-8') as f:
            f.write(text)


if __name__ == '__main__':
    toc_list = get_toc(html)
    heard, text = get_text(toc_list)
    write_text(heard, text)

WhareSong

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
定向爬虫（1）——第一个单线程爬虫

看了好几天的爬虫，终于有点眉目，于是写了第一个单线程爬虫，问题很多，但功能基本实现了简单记录一下，以后可以翻看回忆这里爬取的是努努书坊的《龙族五》，原网页链接如下：https://www.kanunu8.com/book2/10943/先撸一下基本框架结构：1）使用request获取网页源码2）使用正则表达式提取内容3）文件操作，写入文件保存文件检查网页元素，很容易可以找出特别的地方，.........<dl><dt>正文</dt><dd>
复制链接

扫一扫