0x00 背景
在实验吧学习的时候,发现自己水平不够,想先用难度简单的题入个手。但是实验吧里没有题目的目录概要,只能一道题一道题进去看题目的难易以及通过率。正好最近在学习python,打算用python写个爬虫,抓取一下实验吧的题目。
0x001 环境
python3
burpsuite
0x002 知识点
python3的网页抓取:主要用requests包
网页内容筛选:利用正则表达式的零宽断言,用的re包
JSON数据的处理:利用json包
0x01 网页抓取
0x011 抓取题目目录
先看一下实验吧的网页的结构
点击题目类型,切换题目列表,地址栏的URL并没有变,说明用的是动态加载网页,利用burp suite抓包,看一下请求的内容。
看到请求了 /ctf/exam-ctf-list的网址,得到了一段json数据,经格式化与Unicode解码得到
Title和ExamCTFID是我们想要的内容,分别是题目名称和题目的ID号(ID号可以去拼接具体题目的URL)。
看一下发送的POST请求中携带的data数据 species_id=3&page=1
通过查看网页源码能知道species_id的具体含义,page顾名思义就是页号了
接下来开始coding
def getTis(url,headers,data,proxies,dictTis):
respones = requests.post(url=url,headers=headers,data=data,proxies=proxies) #构造并发送POST请求
respones.encoding = "UTF-8" #指定respones的编码
for paginator in json.loads(respones.text)['paginator']: #遍历每个题目list#拿到想要的Title和ExamCTFID存到字典中
dictTis[paginator['Title']]=paginator['ExamCTFID']
nextPage = json.loads(respones.text)['paginator_next'] #得到下一页的页码,如果没拿到得到的是false
print('当前页题目个数:',len(json.loads(respones.text)['paginator']),'下页页号:',nextPage)
return nextPage
构造号相应的URL,headers,data,proxies(为了在burp suite中调试方便,配置了代理),存放的字典。
urlPost = 'http://www.shiyanbar.com/ctf/exam-ctf-list'
headersPost = {
'Host' : 'www.shiyanbar.com',
'Content-Length' : '19',
'X-Requested-With' : 'XMLHttpRequest',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8'}
data = {'species_id' : '3', 'page' : '1'}
proxies={'http': 'http://127.0.0.1:8080', 'https': 'https://127.0.0.1:8080'}
写个循环将所有的题目类型和所有的页都爬取一遍
for i in range(1,8):
dictTis={}
data = {'species_id' : i,'page' : '1'}
nextPage = getTis(url=urlPost,headers=headersPost,data=data,proxies=proxies,dictTis=dictTis)
while nextPage:
data['page'] = nextPage
nextPage = getTis(url=urlPost,headers=headersPost,data=data,proxies=proxies,dictTis=dictTis)
print('题目总数:',len(dictTis))
至此就得到了所有的题的题目名称和ID号。下一步依次爬取题目的内容。
0x012 抓取题目
由于要浏览题目需要登录,所以要用浏览器登录后,利用burp suite抓取一下自己的cookie(用浏览器F12也可以看到自己的cookie)。
将cookie带到headers中
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
'Accept-Encoding' : 'gzip, deflate',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Cookie' : '**********************************************',
'DNT' : '1',
'Host' : 'www.shiyanbar.com'}
proxies={'http': 'http://127.0.0.1:8080', 'https': 'https://127.0.0.1:8080'}
def getTi(url,headers,proxies):
respones = requests.get(url=url,headers=headers,proxies=proxies)
respones.encoding = "UTF-8"
matchObj=re.findall(r'(?<=<div class="de_mle_par">)(.*?)(?=</div>)',respones.text,re.MULTILINE|re.S)
dictret = {}
if matchObj:
res = re.findall(r'<li>(.*?)</li>',matchObj[0])
for s in res:
dictret[re.findall(r'(.*):',s)[0]] = re.findall(r'<span>(.*?)</span>',s)[0]
return dictret
返回的字典中就填充了题的名称等信息。
至此已经将全部的234道题全都抓下来了。
用excel整理一下
大功告成。
0x02 全部代码
import requests
import re
import json
urlPost = 'http://www.shiyanbar.com/ctf/exam-ctf-list'
urlGet = 'http://www.shiyanbar.com/ctf/'
headersPost = {
'Host' : 'www.shiyanbar.com',
'Content-Length' : '19',
'X-Requested-With' : 'XMLHttpRequest',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8'}
headersGet = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
'Accept-Encoding' : 'gzip, deflate',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Cookie' : 'Hm_lvt_34d6f7353ab0915a4c582e4516dffbc3=1567389334,1567391092,1567473701,1567606454; gid=g_0GgcwIBnP8oWnaju; lid=l_DuEWoQrQ4uxq1c1I; Hm_cv_34d6f7353ab0915a4c582e4516dffbc3=1*visitor*151549%2CnickName%3Adragon3385; Hm_lpvt_34d6f7353ab0915a4c582e4516dffbc3=1567606482',
'DNT' : '1',
'Host' : 'www.shiyanbar.com'}
def getTi(url,headers,proxies):
respones = requests.get(url=url,headers=headers,proxies=proxies)
respones.encoding = "UTF-8"
matchObj=re.findall(r'(?<=<div class="de_mle_par">)(.*?)(?=</div>)',respones.text,re.MULTILINE|re.S)
dictret = {}
if matchObj:
res = re.findall(r'<li>(.*?)</li>',matchObj[0])
for s in res:
dictret[re.findall(r'(.*):',s)[0]] = re.findall(r'<span>(.*?)</span>',s)[0]
return dictret
def getTis(url,headers,data,proxies,dictTis):
respones = requests.post(url=url,headers=headers,data=data,proxies=proxies) #构造并发送POST请求
respones.encoding = "UTF-8" #指定respones的编码
for paginator in json.loads(respones.text)['paginator']: #遍历每个题目list#拿到想要的Title和ExamCTFID存到字典中
dictTis[paginator['Title']]=paginator['ExamCTFID']
nextPage = json.loads(respones.text)['paginator_next'] #得到下一页的页码,如果没拿到得到的是false
print('当前页题目个数:',len(json.loads(respones.text)['paginator']),'下页页号:',nextPage)
return nextPage
listTiTypes = []
with open('tis.txt', 'w',encoding="UTF-8") as fileTis:
for i in range(1,8):
dictTis={}
data = {'species_id' : i,'page' : '1'}
nextPage = getTis(url=urlPost,headers=headersPost,data=data,proxies=proxies,dictTis=dictTis)
while nextPage:
data['page'] = nextPage
nextPage = getTis(url=urlPost,headers=headersPost,data=data,proxies=proxies,dictTis=dictTis)
print('题目总数:',len(dictTis))
for ti in dictTis:
url = urlGet+dictTis[ti]
dictatts = GetShiyanbaTi.getTi(url=url,headers=headersGet,proxies=proxies)
strwrite = str(i) + ',' + ti + ','
for att in dictatts:
strwrite += dictatts[att]+','
strwrite += url+'\n'
print(strwrite)
fileTis.write(strwrite)
listTiTypes.append(dictTis)