使用Python爬虫框架Scrapy爬取Android Vulnerability Bulletin（安卓系统漏洞公告）基本方法

最新推荐文章于 2024-01-21 22:33:19 发布

蛐蛐蛐

最新推荐文章于 2024-01-21 22:33:19 发布

阅读量1.8k

点赞数

分类专栏：科研工具 Python技巧

本文链接：https://blog.csdn.net/qysh123/article/details/106655644

版权

科研工具同时被 2 个专栏收录

125 篇文章 12 订阅

订阅专栏

Python技巧

92 篇文章 2 订阅

订阅专栏

其实之前写过一篇关于Scrapy使用的博客：https://blog.csdn.net/qysh123/article/details/79802250

不过这里的内容和之前相比稍微多了点技巧，所以简单总结一下。由于项目需要，需要爬取：https://source.android.com/security/bulletin 这里列出的所有CVE的修复的commit hash，其实这个需求是比较明确和简单的，不过我还是花了点时间。首先观察一下，每个月的bulletin的链接都是这种形式：https://source.android.com/security/bulletin/2020-04-01，那么首先应该过滤出所有这样的链接，这个和之前博客里用的方法一样。不过需要用到正则来匹配后面的年月日，首先我们定义一个function来判断后面这部分链接：

######################################################################
def is_bulletin(test_string):
    pattern = re.compile(r'/security/bulletin/(\d{4}-\d{1,2}-\d{1,2})$')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################

然后就是用XPath来匹配这个链接了：

        for each in response.xpath('//a/@href'):
            suburl=each.extract()
            if(is_bulletin(suburl)):
                time.sleep(0.1)
                yield scrapy.Request('https://source.android.com'+suburl, self.parse)

这些内容之前博客里也有介绍。在我们进入到类似这样：https://source.android.com/security/bulletin/2020-04-01 的页面中后，就可以看到，实际上每个CVE对应链接里面就有commit hash，具体的页面html大概是这个样子的：

<tr>
<td>CVE-2020-0023</td>
<td><a href="https://android.googlesource.com/platform/packages/apps/Bluetooth/+/0d8307f408f166862fbd6efb593c4d65906a46ae">A-145130871</a></td>
<td>ID</td>
<td>严重</td>
<td>10</td>
</tr>

我们可以用上面那篇博客中类似的方法来定位到这个href：

for each in response.xpath('//tr/td/a[starts-with(@href,"https://android.googlesource.com/")]/@href'):

这个其实一看就明白，就像这里介绍的：https://blog.csdn.net/winterto1990/article/details/47903653，/@xxxx 的作用是提取当前路径下标签的属性值。不过我们如何定位到前面td里的CVE编号呢？这就是今天想要总结的一点内容，按照上面这个页面中说的，.. 双点可用来选取当前节点的父节点，不过作者并没有给出例子，所以我还尝试了一会：首先通过：

for each in response.xpath('//tr/td/a[starts-with(@href,"https://android.googlesource.com/")]'):

这个来定位到上面的"<a>"然后我们得返回到"<tr>"，然后再依次选择tr中的每一个td：

for each_content in each.xpath('../../td/text()'):

这里需要通过另一个function来判断是否是CVE编号，需要注意的是，CVE编号最后一部分有可能是4位，也有可能是5位：

######################################################################
def is_CVE(test_string):
    pattern = re.compile(r'CVE-(\d{4}-\d{4})')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################

这样我们就可以提取出来CVE和对应的commit链接了。就简单总结这么多，最后给出完整的代码：

import scrapy
import re
import time
######################################################################
def is_bulletin(test_string):
    pattern = re.compile(r'/security/bulletin/(\d{4}-\d{1,2}-\d{1,2})$')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################
######################################################################
def is_CVE(test_string):
    pattern = re.compile(r'CVE-(\d{4}-\d{4})')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################
test_string="CVE-2222-22222"
print(is_CVE(test_string))

class VulSpider(scrapy.Spider):
    name="Vul"
    allowed_domains=["source.android.com"]
    start_urls = [
        'https://source.android.com/security/bulletin',
    ]
    
    def parse(self,response):
        
        previous_cve=""
        
        for each in response.xpath('//tr/td/a[starts-with(@href,"https://android.googlesource.com/")]'):#/@href
            git_url=each.xpath('./@href')[0].extract().replace('%2F','/')
            other_content=[]
            
            found_cve=False
            
            for each_content in each.xpath('../../td/text()'):
                content=each_content.extract()
                if(is_CVE(content)):
                    found_cve=True
                    previous_cve=content
                    print(content)
                    print(git_url)
                else:
                    other_content.append(content)   

        for each in response.xpath('//a/@href'):
            suburl=each.extract()
            if(is_bulletin(suburl)):
                time.sleep(0.1)
                yield scrapy.Request('https://source.android.com'+suburl, self.parse)

这里还由于页面特点有些其他的处理内容，我就不详细介绍了。

在尝试的时候也参考了下面一些博客中的例子：

https://blog.csdn.net/winterto1990/article/details/47903653

https://blog.csdn.net/weixin_41558061/article/details/80077423

https://blog.csdn.net/qq_40134903/article/details/80728094

一并表示感谢！

蛐蛐蛐

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Python爬虫框架Scrapy爬取Android Vulnerability Bulletin（安卓系统漏洞公告）基本方法

其实之前写过一篇关于Scrapy使用的博客：https://blog.csdn.net/qysh123/article/details/79802250不过这里的内容和之前相比稍微多了点技巧，所以简单总结一下。由于项目需要，需要爬取：https://source.android.com/security/bulletin这里列出的所有CVE的修复的commit hash，其实这个需求是比较明确和简单的，不过我还是花了点时间。首先观察一下，每个月的bulletin的链接都是这种形式：https://so.
复制链接

扫一扫

专栏目录