Python全站爬取MSDN（resquests.post的使用抓包分析）

最新推荐文章于 2024-05-20 09:42:31 发布

野猪被骑

最新推荐文章于 2024-05-20 09:42:31 发布

阅读量1.5k

点赞数 1

分类专栏：爬虫 post msdn 文章标签：爬虫 post msdn

本文链接：https://blog.csdn.net/u014688958/article/details/83934316

版权

爬虫同时被 3 个专栏收录

1 篇文章 0 订阅

订阅专栏

post

1 篇文章 0 订阅

订阅专栏

msdn

1 篇文章 0 订阅

订阅专栏

这个网站需要抓包分析，然后发送post请求得到msdn里的资源信息。
URL：https://msdn.itellyou.cn/
用到的库：requests，re

进入这个网站按F12用开发者工具（谷歌）进行抓包分析
在这里插入图片描述
按照上图的步骤去点击可以在开发者工具的Network下的XHR里找到四个包：index，GetLang，GetList和GetProduct。在头信息里可以看到它们的请求方式都是post，这时我们就要去注意它们的headers里的from data信息和response信息。其中GetProduct这个包里的response返回的信息正好是我们想要的资源的信息。

通过分析可以发现GetProduct里的from data内容是GetList里的response内容，而GetList里的from data内容正好是GetLang里from data的内容加上response的内容，GetLang里from data信息又是index的response里的信息，index的from data的内容也可以找到。如下图：
在这里插入图片描述

通过这样的分析，就能找到这个网站的爬取思路了，经过几层post的请求就能爬取到全站的资源了。
话不多说上代码：

import requests
import re

url = 'https://msdn.itellyou.cn/'
index_url = 'https://msdn.itellyou.cn/Category/Index'
lang_url = 'https://msdn.itellyou.cn/Category/GetLang'
list_url = 'https://msdn.itellyou.cn/Category/GetList'
product_url = 'https://msdn.itellyou.cn/Category/GetProduct'

data = {'id' : 'aff8a80f-2dee-4bba-80ec-611ac56d3849'}

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36','Referer': 'https://msdn.itellyou.cn/'}

def get_ilt(url,data,headers):
    r = requests.post(url,data = data,headers = headers)
    r.raise_for_status()
    return r.text

def dowload(index_res):
    index_id = re.findall(r'{(.*?),"name".*?},',index_res)
    for i in index_id:
        dic = '{' + i +'}'
        data_ = eval(dic)
        lang_res = get_ilt(lang_url,data_,headers)
        # print(lang_res)
        lang_id = re.findall('{"status":true,"result":\[{"id":"(.*?)","lang":.*?}]}',lang_res)
        #print(lang_id)
        if lang_id:#过滤掉空值
            data_['lang'] = lang_id[0]
            data_['filter'] = 'true'
            #print(data_)
            list_res = get_ilt(list_url,data_,headers)
            #print(list_res)
            product_id = re.findall('{"status":true,"result":\[(.*?),"name":.*?',list_res)
            #print(product_id)
            product_id_ = product_id[0] +'}'
            #print(product_id_)
            product_data = eval(product_id_)
            product_res = get_ilt(product_url,product_data,headers)
            file_name = re.findall('{"status":true,"result":{"FileName":(.*?),"DownLoad":.*?',product_res)
            #print(file_name)
            path = file_name[0][1:-4] + '.txt'
            #print(path)
            with open(path,'w') as f:
                f.write(product_res)
                print(file_name,':下载完成')

def dowload_all():
    res = requests.get(url,headers = headers)

    all_index = re.findall(' data-loadmenu="true" data-menuid="(.*?)" data-target=',res.text)
    #print(index_ids)
    for id in all_index:
        data['id'] = id
        #print(data)
        index_res = get_ilt(index_url, data, headers)
        dowload(index_res);

野猪被骑

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python全站爬取MSDN（resquests.post的使用抓包分析）

这个网站需要抓包分析，然后发送post请求得到msdn里的资源信息。URL：https://msdn.itellyou.cn/用到的库：requests，re进入这个网站按F12用开发者工具（谷歌）进行抓包分析按照上图的步骤去点击可以在开发者工具的Network下的XHR里找到四个包：index，GetLang，GetList和GetProduct。在头信息里可以看到它们的请求方式都是p...
复制链接

扫一扫

专栏目录