glided_sky 镀金的天空爬虫闯关1-2 4-5 解题思路加代码

最新推荐文章于 2023-01-12 14:39:16 发布

四叶草茶艺师

最新推荐文章于 2023-01-12 14:39:16 发布

阅读量2.2k

点赞数 1

分类专栏：爬虫文章标签：爬虫 http 前端

本文链接：https://blog.csdn.net/weixin_46011275/article/details/121862793

版权

爬虫专栏收录该内容

8 篇文章 1 订阅

订阅专栏

最近发现的一个爬虫练习网站,尝试着做了几道题,发现覆盖面很大,因此来记录分享一下自己的解题思路。
http://glidedsky.com/

第一题和第二题

两道问题都是相似的问题，区别在于一个数据加载在一页,一个数据加载在1000页中。
但是要注意的是:直接用requests库请求目标网址会要求登录认证,因此要先把自己的登录信息获取。

进入登录页面,尝试登录,发现浏览器发起了两次请求,一次post请求,然后发生了重定向,又对登录页面发起了一次get请求。可以先看看post请求传递了什么参数
在这里插入图片描述
发现post请求传递了三个参数:_token,password,email，后两者都是未加密的直接输入的参数,第一个可以通过搜索发现是一个在网页源代码中随机生成的字符串。

对于_token参数的获取,可以用re正则库匹配。

datas = {
    "email": "@qq.com",#你的账号
    "password": "",#你的密码
    '_token': '',
}
heads={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
def login():
    url = 'http://glidedsky.com/login'
    session = requests.Session()
    req = session.get(url).text
    result = re.search('input type="hidden" name="_token" value="(.*?)"', req)
    token = result.group(1)
    datas['_token'] = token
    req=session.post(url=url, data=datas,headers=heads)
    print("post status_code:",req.status_code)
    return session

利用这个返回值session,就可以携带我们的登录信息,请求访问目标网站而不用进行登录验证了。
对于第一题和第二题的数字计算,因为检查网页源代码发现目标数字能直接搜到,
因此直接通过xpath获取到数字内容并进行求和就可以了。

def crawler_basic_1():#第一题
    session = login()
    start_time=time.time()
    req = session.get('http://glidedsky.com/level/web/crawler-basic-1',headers=heads)
    tree=etree.HTML(req.text)
    num_list=tree.xpath('//div[@class="col-md-1"]')
    global sum
    for num in num_list:
        a="".join(num.xpath('.//text()')).strip()
        sum+=int(a)
    end_time=time.time()
    print('花费%s' % (end_time-start_time))
    print(sum)

def crawler_basic_2():#第二题,利用多线程加快速度
    start_time=time.time()
    session = login()
    global threadsnum,sum #定义线程数量
    thread_list=[]
    q=queue.Queue()#利用队列来实现线程间数据共享
    for i in range(1,1001):
        q.put(i)
    for i in range(threadsnum):
        t=threading.Thread(target=crawler_basic_2_sums,args=(session,q))#args传参数 target传函数名
        thread_list.append(t)
    for t in thread_list:
        t.start()
    for t in thread_list:
        t.join()
    end_time=time.time()
    print('花费%ss' % (end_time-start_time))
    print(sum)       
def crawler_basic_2_sums(session,q):
    while 1:
        try:
            i=q.get_nowait()#队列为空时,不等待 直接抛出异常 通过break跳出循环
        except BaseException:
            break
        url='http://glidedsky.com/level/web/crawler-basic-2?page={}'.format(i)
        print(url)
        req = session.get(url,headers=heads)
        tree=etree.HTML(req.text)
        num_list=tree.xpath('//div[@class="col-md-1"]')
        global sum
        for num in num_list:
            a="".join(num.xpath('.//text()')).strip()
            sum+=int(a)

第三题:

因为涉及到ip代理池,没有做。

第四题

http://glidedsky.com/level/web/crawler-font-puzzle-1?page=1
第四题利用的是woff文件字体加密,我的另一篇文章大众点评爬虫也涉及到了这方面的操作。
我们可以发现在开发者工具下和网页中的数字不一样,尝试在全局搜索中搜索该字段的字体属性 font-family: glided_sky;
在这里插入图片描述

在这里插入图片描述
发现一个很可疑的字段,尝试着把base64后面的字段复制下来用解密工具解密,发现乱码。后来参考了别的博主的文章才知道这个解密后就是我们需要的woff文件。解密后用在线浏览网页Iconfont Preview打开

仔细观察第一段数字后可以发现:
网页显示的 384 源码显示的 259
可以利用woff文件做对应：3对应two也就是2 8对应five也就5 4对应nine也就是9
也就是说网页显示的数字通过woff文件的对应字典形成了源码显示的259

#加密字典
dict={
"3":"2","8","5","4":"9"......
}

因此我们就可以利用woff文件生成字典,利用源代码得到的数字解密出真正的数字

#解密字典
dict={
"2":"3","5","8","9":"4"......
}

思路如下：
1.通过正则匹配到base64字符串,并解密生成woff文件
2.利用woff文件生成解密字典
3.利用xpath匹配得到源代码显示的数字,在利用解密字典获得网页显示的数字。

nums_dict={#替换英文为数字
            "one":"1",
            "two":"2",
            "three":"3",
            "four":"4",
            "five":"5",
            "six":"6",
            "seven":"7",
            "eight":"8",
            "nine":"9",
            "zero":'0'
        }
def crawler_font_puzzle_1_down_woff(req,filename):#解密并保存woff文件
    gz=re.compile('base64,(.*?)\)')
    result=gz.search(req).group(1)
    # print(result)
    r=base64.b64decode(result)
    with open('%s.woff'%filename,'wb') as f:
        f.write(r)
    # print('woff文件保存完成')
def get_fonts_dict(fontpath):#利用TTfont解析woff文件
        font = TTFont(fontpath+'.woff')  # 打开文件
        codeList = font.getGlyphOrder()[1:]#获取英文数字的列表
        arrayList = codeList
        dc={}#输出字典
        word=[str(i) for  i in range(0,10)]
        for arra,wor in zip(arrayList,word):
                arra=nums_dict[arra]#把英文单词利用字典转变为字母
                dc[arra]=wor
        return dc
def crawler_font_puzzle_1_sums(session,q,filename):#线程执行的任务函数 进行解密并求和
    global sum
    while 1:
        try:
            i=q.get_nowait()
        except BaseException:
            print(BaseException)
            break
        url='http://glidedsky.com/level/web/crawler-font-puzzle-1?page={}'.format(i)
        print(url)
        req=session.get(url=url,headers=heads).text
        with open('f.html','w',encoding='utf-8') as f:
            f.write(req)
        tree=etree.HTML(req)
        crawler_font_puzzle_1_down_woff(req,filename)
        dc=get_fonts_dict(filename)
        nums_list=tree.xpath('//div[@class="col-md-1"]/text()')
        for nums in nums_list:
            true_nums=""
            nums=nums.strip()
            for num in nums:
                true_nums+=dc[num]
            print(int("".join(true_nums)))
            sum+=int("".join(true_nums))
        time.sleep(5)
def   crawler_font_puzzle_1():#主函数
    q=queue.Queue()
    for i in range(1,1001):
        q.put(i)
    filename='f'
    session=login()
    start=time.time()
    thread_list=[]
    global sum,threadsnum
    for i in range(threadsnum):
        t=threading.Thread(target=crawler_font_puzzle_1_sums,args=(session,q,filename))
        thread_list.append(t)
    for  t in thread_list:
        t.start()
    for t in thread_list:
        t.join()
    end=time.time()
    print('花费%ss' % (end-start))
    print(sum)

输出如下,成功解密
在这里插入图片描述

第五题

http://www.glidedsky.com/level/web/crawler-css-puzzle-1?page=1
本题采用了css加密,需要仔细观察,找到规律后才能实现解密并获取数据。
同样发现网页显示内容与源码内容不同,发现各个class名不同,copy进去搜索一下

在这里插入图片描述
搜索后发现在源码中有各个class名对应的属性值,但是有着一个class有多条属性值的情况,因此先用正则匹配,并处理成一个字典方便观察。

def get_css_dict():
 		url='http://www.glidedsky.com/level/web/crawler-css-puzzle-1?page=1'
     	session=login()
        req=session.get(url=url,headers=heads).text
        gz=re.compile('\.([A-Za-z0-9]*).*?{(.*?)}')
        css_list=re.findall(gz,req)
        css_dict={}#把提取到的属性合并为字典
        for i in css_list:
            a="".join(i[1]).strip()
            a=a.split(':')#分割属性名与属性值
            if i[0]  not in css_dict.keys():
                dic={}#未保存过该class名的属性 新建字典
            else:
                dic=css_dict[i[0]]#保存过该class名的属性 获取字典
            dic[a[0]]=a[1]
            css_dict[i[0]]=(dic)
        css_list.clear()
        return css_dict

最终生成这样的字典,结合网页观察初步发现有以下的属性比较重要:
[‘opacity’]:存在该属性,对应数字不是我们需要的真实数字
[‘content’]:存在该属性,其内容就是我们需要的真实数字，且在源代码中显示为::before
[‘left’]:存在该属性,对应数字是我们需要的真实数字,但是其在源代码中的位置与网页的位置不同。

对于前面两个属性,我们可以在遇到的时候跳过或者直接取值,对于left属性,推测数字内容为其偏移位置。
在这里插入图片描述
对left属性的进一步分析如下:

在这里插入图片描述

可以发现:当不存在[‘opacity’]属性的class元素时,数字真实位置=源代码位置+left值。
当存在[‘opacity’]属性的class元素时,数字真实位置=源代码位置+left值-[‘opacity’]出现次数

思路如下：
1.通过正则获取到class元素的属性,并整理成字典
2.通过xpath获取源代码的class名,利用字典中各个属性的存在与否以及属性值实现解密

def crawler_css_puzzle_1():
    session=login()
    start=time.time()
    global threadsnum,sum
    threads_list=[]
    q=queue.Queue()
    for i in range(1,1001):
        q.put(i)
    for i in range(threadsnum):
        t=threading.Thread(target=crawler_css_puzzle_1_sums,args=(session,q))
        threads_list.append(t)
    for t in threads_list:
        t.start()
    for t in threads_list:
        t.join()
    end=time.time()
    print('花费%ss' % (end-start))
    print(sum)
def crawler_css_puzzle_1_sums(session,q):
    while 1:
        try:
            i=q.get_nowait()
        except BaseException:
            print(BaseException)
            break
        url='http://www.glidedsky.com/level/web/crawler-css-puzzle-1?page={}'.format(i)
        print(url)
        req=session.get(url=url,headers=heads).text
        save_html(req)
        gz=re.compile('\.([A-Za-z0-9]*).*?{(.*?)}')
        css_list=re.findall(gz,req)
        css_ditc={}#把提取到的属性合并为字典
        for i in css_list:
            a="".join(i[1]).strip()
            a=a.split(':')
            if i[0]  not in css_ditc.keys():
                dic={}
            else:
                dic=css_ditc[i[0]]
            dic[a[0]]=a[1]
            css_ditc[i[0]]=(dic)
        css_list.clear()
        tree=etree.HTML(req)
        num_list=tree.xpath('//div[@class="col-md-1"]')
        global sum#全局变量sum 计算总和
        for num in num_list:
            class_l=num.xpath('.//div/@class')
            nums_l=num.xpath('.//div/text()')
            class_l=list(class_l)
            nums_l=list(nums_l)
            numss=['','','']#这个列表用来存放最终得到的三位数

            if len(nums_l)<3: #列表小于3 说明有before 直接拿
                for c in class_l:
                    if 'content' in css_ditc[c].keys():
                        gz=re.compile(r'\d{3}')#匹配三位数
                        numss[0]=re.findall(gz,css_ditc[c]['content'])[0]
            else:
                xz=0#修正偏移量
                for c,n in zip(class_l,range(len(nums_l))):
                    key_list=css_ditc[c].keys()
                    i=css_ditc[c]
                    if 'opacity' in   key_list:
                        xz+=1#跳过了一个数字 偏移量加一
                        continue
                    elif 'left' in   key_list:
                        string=css_ditc[c]['left']
                        gz=re.compile('[\-0-9]{1,2}')
                        number=gz.search(string)
                        number=int(number.group(0))
                        #获得实际位置
                        sj_wz=n+number-xz
                        numss[sj_wz]=str(nums_l[n])

                    else:#没有left和opacity 根据循环次数-偏移量直接给列表赋值
                        numss[n-xz]=str(nums_l[n])

            numss=int("".join(numss))  
            # print(numss)
            sum+=numss

输出如下
在这里插入图片描述

四叶草茶艺师

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
4
评论
glided_sky 镀金的天空爬虫闯关1-2 4-5 解题思路加代码

最近发现的一个爬虫练习网站,尝试着做了几道题,发现覆盖面很大,因此来记录分享一下自己的解题思路。http://glidedsky.com/第一题和第二题两道问题都是相似的问题，区别在于一个数据加载在一页,一个数据加载在1000页中。但是要注意的是:直接用requests库请求目标网址会要求登录认证,因此要先把自己的登录信息获取。进入登录页面,尝试登录,发现浏览器发起了两次请求,一次post请求,然后发生了重定向,又对登录页面发起了一次get请求。可以先看看post请求传递了什么参数发现post
复制链接

扫一扫