爬虫学习笔记

最新推荐文章于 2022-11-21 23:21:24 发布

热爱代码的小方

最新推荐文章于 2022-11-21 23:21:24 发布

阅读量343

点赞数

分类专栏： python 计算机网络文章标签： python html 爬虫

本文链接：https://blog.csdn.net/RaynorFTD/article/details/116486450

版权

python 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

计算机网络

6 篇文章 4 订阅

订阅专栏

文章目录

什么是爬虫
写一个简单的爬虫

什么是爬虫

简单来说，就是从网络爬取信息的脚本

网络爬虫（又称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

写一个简单的爬虫

以从Exploit database爬取为例
从Exploit database（以下简称EDB）获取POC信息

EDB 链接：EDB

首先我们要了解一下我们要爬取的信息长什么样

模型设计

在EDB中每一条信息如下:

编号	TAG	内容
1	title	POC标题
2	EDB-ID	该POC在EDB中的ID，该ID是EDB的主键
3	CVE	POC的CVE-ID
4	Author	该条目作者
5	Type	该POC类型
6	Platform	POC平台
7	Date	POC公布日期
8	EDB Verified	可能是EDB他们的认证，具体什么意思有待各位探索
9	ExploitURL	POC下载链接
10	Vulnerable app	该POC对应的易受攻击的APP
11	Code	脚本代码

对应上表，建立POC类（太简单，就不贴代码了）

爬取数据

该网站每个词条对应网页为BaseURL+EDB_ID，比如6号的网页为

https://www.exploit-db.com/exploits/6

所以遍历所有EDB_ID即可遍历所有的poc数据，此处定义url base

# URL base for spider
urlBase = 'https://www.exploit-db.com/exploits/'

下一步需要假装我们是个浏览器，所以需要添加一个user agent，于是定义了一个ualist:

# User agent list for headers to do spider
user_agent_list = [
    # FireFox user agent
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0",
    # Chrome user agent
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 "
    "Safari/537.36",
]

接下来获取网页html

# from spider_url to get a html return type
def spider(spider_url: str):
    # Construct request
    user_agent = random.choice(user_agent_list)
    print("user agent chose: ", user_agent)
    spider_request = request.Request(spider_url)
    spider_request.add_header('User-agent', user_agent)
    try:
        spider_response = request.urlopen(spider_request, timeout=50)
    except error.URLError:
        return "error, URLError"

    # noinspection PyBroadException
    try:
        rcv_html = spider_response.read().decode('utf-8')
    except socket.timeout:
        return 'error, socket.timeout'
    except UnicodeDecodeError:
        return 'error, UnicodeDecodeError'
    except Exception as e:
        return 'error, Exception: %s' % e

    return rcv_html

重点重点重点：爬取过程

可以按下F12查看网页源码，鼠标在源码上移动，会在左侧页面显示对应的内容板块
在这里插入图片描述
举个例子，怎么从刚刚获取的html信息中找到该网页的CVE和Author词条

首先，要在网页上找到对应的代码
在这里插入图片描述
该网页有3个card：我们要找的都在第一个card内

点开发现里面分了几个col-sm-12 col-md-6 col-lg-3 d-flex align-items-stretch的东西，我们直接利用for循环找所有的这个条目

for card1 in soup.find_all('div', class_="col-sm-12 col-md-6 col-lg-3 d-flex align-items-stretch"):

在这里插入图片描述
点开发现CVE和EDB-ID在两个col-6 text-center内

最后加上判断，即可获取我们所需要的信息

# Parsing the html using beautiful soup to get a table we need
def bs4html(input_html):
    soup = BeautifulSoup(input_html, 'html.parser')
    # 判断是否为404，有时候即使有页面内容，也可能出现title为空（再运行一次即可）
    if soup.title:
        for card1 in soup.find_all('div', class_="col-sm-12 col-md-6 col-lg-3 d-flex align-items-stretch"):
            for div2 in card1.find_all('div', class_='col-6 text-center'):
                if div2.h4.get_text().strip() == 'Author:':
                    print(div2.h6.get_text().strip())
                elif div2.h4.get_text().strip() == 'CVE:':
                    print(div2.h6.get_text().strip())

获取信息就是这么简单，大家可以根据所获取的信息进行操作，比如存入数据库，下载等等

热爱代码的小方

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习笔记

文章目录什么是爬虫写一个简单的爬虫模型设计爬取数据重点重点重点：爬取过程什么是爬虫简单来说，就是从网络爬取信息的脚本网络爬虫（又称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。写一个简单的爬虫以从Exploit database爬取为例从Exploit database（以下简称EDB）获取POC信息EDB 链接：EDB首先我们要了解一下我们要爬
复制链接

扫一扫

专栏目录