python开发基于正则表达式的爬虫

YhMjQx

于 2024-08-26 13:31:40 发布

阅读量105

点赞数 3

分类专栏： Python安全开发基础文章标签： python 正则表达式爬虫

本文链接：https://blog.csdn.net/YhMjQx/article/details/141561796

版权

Python安全开发基础专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

==基于正则表达式的爬虫==

基于正则表达式的爬虫

一、爬虫简介

1.搜索引擎：百度，谷歌，企业内部的知识库，某些项目专项数据爬取，专业的数据爬取

2.互联网：公网（不需要授权的情况就可以访问浏览的内容，搜索引擎的重点），深网（需要授权才能使用的内容），暗网（非正式渠道，无法使用常规手段访问）

3.爬取互联网的公开信息，但是正常情况下，也需要遵守一个规则：robots协议：https://www.baidu.com/robots.txt

二、基本原理

1.所有的网页，均是HTML，HTML是一个大的字符串，可以按照字符串处理的方式（最有效的就是正则表达式），对响应进行解析处理。HTML本身也是一门标记语言，与XML同宗同源，所以可以使用DOM对其文本进行处理。

2.所有的爬虫，核心基于超链接，进而实现了网站和网页的跳转。给我一个网站，爬遍全世界。

3.如果要实现一个整站爬取程序，首先要收集到站内所有网址，并且将重复网址进行去重，开始爬取内容并保存在本地或数据库中，进而实现后续目的。

三、正则表达式实现爬虫

D:\Programmingtools\Python\Program\network\regularexpressionspider.py

import re,requests,time

def spide_page():
    try:
        resp = requests.get('http://woniunote.com/')
        resp.encoding='utf-8'
        # print(resp.text)
        page_pattern = '<a href="(.+?)"'
        page_links = re.findall(page_pattern,resp.text)
        for link in page_links:
            if 'css' in link or 'articleid' in link:
                continue
            if link.startswith('#'):
                continue
            if link.startswith('/'):
                link = 'http://woniunote.com' + link
            print(link)
            filename = link.split('/')[-1] + time.strftime('%Y%m%d_%H%M%S') + '.html'
            with open(f'./woniunote/pages/{filename}',mode='w',encoding='utf-8') as file:
                file.write(resp.text)
    except:
        raise Exception('出错了')

def spide_img():
    try:
        resp = requests.get('http://woniunote.com/')
        resp.encoding='utf-8'
        img_pattern = ' <img src="(.+?)"'
        img_links = re.findall(img_pattern,resp.text)
        for link in img_links:
            if link.startswith('/'):
                link = 'http://woniunote.com' + link
            # print(link)
            filename = link.split('/')[-1]
            content = requests.get(link).content
            with open(f'./woniunote/images/{filename}',mode='wb') as file:
                file.write(content)
    except:
        raise Exception('出错啦')

if __name__ == '__main__':
    # pass
    spide_page()
    # spide_img()