注:本文仅供学习参考,不涉及任何商用,如有侵权,可联系删除
采集目标:某家装设计平台页面各模型组件参数默认值
- 页面入口
- 目标数据页面,confluence链接:4.132、友商组件分析
网站爬取分析过程:
- 查看目标数据页面的请求方式:
- GET请求,URL:https://ihome-turbo.oss-cn-beijing.aliyuncs.com/online/2155/entity/1324.json?OSSAccessKeyId=STS.NUbd6KGuciYuKWZaXX7sGnRJc&Expires=1656729629&Signature=EBas7BjL%2FIWxr%2BM1tyAuMkZ6SwE%3D&security-token=CAISkQJ1q6Ft5B2yfSjIr5bXL4z%2FqqpC3pueSXHrhVgNO%2FxrgZfhgTz2IHxMfXZpAu4Ys%2FgznWtT5%2FgZlr9yS5hASAnYcNF66dFX9gaseJbQv8GvtRbsBhBWQTr9MQXy%2BeOPScebJYqvV5XAQlTAkTAJstmeXD6%2BXlujHISUgJp8FLo%2BVRW5ajw0b7U%2FZHEVyqkgOGDWKOymPzPzn2PUFzAIgAdnjn5l4qnNqa%2F1qDim1QGll7RI%2Ftuse8n9NJc0bK0SCYnlgLZEEYPayzNV5hRw86N7sbdJ4z%2BvvKvGXwEIvEzdbbSJroE2d1QoOPMgb6VOoOThj%2Fd%2FuvfXlo7twgxcI%2BBOUjjYXpqnxMbU%2BGoclg%2Fr0twagAE6ZE%2BZhP5K3N%2BR8enlR1uXDpU5lFL7YYl9clreIy1f4nQ5PCROp5%2Fg54ee0c%2FCheBUy2A%2F4kr176Tl9Bepz1D5Ae6X32l7AiERMKjPK6vfFl7MXo620iZN0%2BGKRk0%2Bssrc8xRtzp%2FUuoVy60lZuE5orSk7iUdhWDTHn6nsxTUO6w%3D%3D
- 测试后发现URL中只有2155和1324是必要的,URL可以精简为【https://ihome-turbo.oss-cn-beijing.aliyuncs.com/online/2155/entity/1324.json】
- 其中2155固定不变,可能类似于appkey ,1324为组件ID,各组件单独绑定
-
headers分析未发现cookie等反爬措施
Host: ihome-turbo.oss-cn-beijing.aliyuncs.com Connection: keep-alive sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103" sec-ch-ua-mobile: ?0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 sec-ch-ua-platform: "Windows" Accept: */* Origin: https://3d.shejijia.com Sec-Fetch-Site: cross-site Sec-Fetch-Mode: cors Sec-Fetch-Dest: empty Referer: https://3d.shejijia.com/ Accept-Encoding: gzip, deflate, br Accept-Language: zh-CN,zh;q=0.9
- 至此组件详情页面暂未发现反爬
- 组件ID获取:
点击左侧分类菜单触发接口
其中appKey不变,t为13位时间戳,,sign为32位签名,poolIds对应分类ID,分类ID接口不做阐述,整理结果文件:tp_model_cat.txt
Headers:
Host: acs.m.shejijia.com Connection: keep-alive platform-env: sjj EagleEye-UserData: w_t_id=f448885d-426c-4e80-a081-e029ecebb710 sec-ch-ua-mobile: ?0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Content-type: application/x-www-form-urlencoded Accept: application/json env-domain: sjj sec-ch-ua-platform: "Windows" sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103" Origin: https://3d.shejijia.com Sec-Fetch-Site: same-site Sec-Fetch-Mode: cors Sec-Fetch-Dest: empty Referer: https://3d.shejijia.com/ Accept-Encoding: gzip, deflate, br Accept-Language: zh-CN,zh;q=0.9 Cookie: t=4063b01b2822a7d153699afda59f8b7f; cna=u94QGhEH5xICAdpeC1LEqgZT; gr_user_id=bda7244d-a3a3-4910-80a9-0f3c185a2853; xlly_s=1; user=%7B%22memberId%22%3A%222347860193285898240%22%2C%22memberType%22%3A%22designer%22%2C%22nickName%22%3A%22%E8%AE%BE%E8%AE%A1%E5%B8%882346%22%2C%22avatar%22%3A%22%22%2C%22umsId%22%3A%222e9c7ae8-a405-4af0-b87a-5205494571d5%22%2C%22accessToken%22%3A%22d26bad58-2d67-49bd-84b1-3cd1a2f6c139%22%2C%22site%22%3A46%2C%22domain%22%3Anull%2C%22enterpriseId%22%3Anull%2C%22employeeId%22%3Anull%7D; __user_location_modal_show__=true; _m_h5_tk=d91ea8a9b5d7a235cfc4dfa8c3cfdc8e_1656735670492; _m_h5_tk_enc=a1f39769d85b0e8ca4148e5a0bc7d490; cookie2=1b14c65d042e0f5a064fdbabe588f774; _tb_token_=e1ee73333e5ee; _samesite_flag_=true; csg=8b709201; isg=BD8_QNCwYckpn2Yq0BDFhDuzzhPJJJPGRsXkM9EIRe414FJipPSnFB6yJrAeuGs- |
Headers精简后:
Host: acs.m.shejijia.com Connection: keep-alive User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Content-type: application/x-www-form-urlencoded Accept: application/json Origin: https://3d.shejijia.com Referer: https://3d.shejijia.com/ Accept-Encoding: gzip, deflate, br Accept-Language: zh-CN,zh;q=0.9 Cookie: _m_h5_tk=d91ea8a9b5d7a235cfc4dfa8c3cfdc8e_1656735670492; _m_h5_tk_enc=a1f39769d85b0e8ca4148e5a0bc7d490; |
sign签名加密分析思路
- 根据sign值为32位猜测sign为MD5加密方式,所以用MD5作为关键词在chrome控制台全局搜索,定位到一混淆加密过函数名的方法,下断点后无法触发断点,推断该方法不是sign加密,结合cookie中的 _m_h5_tk和 _m_h5_tk_enc可能和账号的加密有关,因此放弃跟踪该方法
- 分析前端click事件关联的JS,推断依据为,前端点击该分类,会向后端发起请求
- 在JS中用关键词【sign】搜索,匹配结果过多,关键词修改为【sign 】【sign:】缩小范围,最终定位到疑似加密函数的位置
- 下断点后确认加密函数,sign.txt,其中函数入口为
token_str + '&' + 13位时间戳 + '&' + '12574478' + '&' + {"poolIds":"[464594]","limit":30,"offset":30,"sort":"desc","requestId":"bcf8af6f-c500-4025-9327-d87ffc29c849","tenant":"ezhome","traceId":"f448885d-426c-4e80-a081-e029ecebb710"} 其中token_str与cookie中的_m_h5_tk一致,测试结果如下
- 至此sign值成功破解,headers中的cookie可以通过抓包获取,本爬虫脚本不需要线上一直运行,因此每次需要抓取数据手动更新即可
总结要点
- 该站点JS代码存在混淆加密,函数名都为随机数字或字母,且一些关键方法是通过函数回调机制调用的
- sign值为32位的字符串,容易误导破解方向,误以为MD5加密方式
- 关键词搜索结果较多的时候,可以通过前端事件监听方面入手缩小关键词分布的JS范围,减轻分析量