爬取需要登录的网站的资源
背景:对于一些网站需要使用用户名和密码登录并且使用了https,我们如果不通过凭证将无法进行该网站的下载、爬虫!,而具体的凭证一般的是”cookies“形式的。
内容:本文主要介绍了如何爬取需要登录网站的内容(视频、图片、网页)的简易教程。
postman文档地址:https://learning.postman.com/docs/sending-requests/requests/
1. 获取登录的请求
首先需要使用用户名密码登录到网站,查看f12找到登录的请求,复制成Copy as CURL
登录请求uri一般是login或register等等,认真找一找
2. 用postman模拟登录请求
- 导入请求到postman
将复制的内容导入到postman接口工具中
- 发送请求,获取到wget代码片段
发送请求,检查是否模拟登录成功,如果请求发送成功,则按下图获取到postman的wget代码片段。
3. 用wget模拟登录请求并保存cookie
- 在从postman复制的代码片段后追加(如下)cookie配置。
意思就是把cookie保存在cookies.txt中,以及后续使用
--save-cookies=cookies.txt --keep-session-cookies
- 模拟登录请求并保存cookie
用命令行发送类似下面的wget命令。该命令就是postman复制的代码片段后追加--save-cookies=cookies.txt --keep-session-cookies
wget --no-check-certificate --quiet --method GET --timeout=0 --header 'authority: qvb111.xyz' --header 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' --header 'accept-language: zh-CN,zh;q=0.9' --header 'cache-control: max-age=0' --header 'cookie: md10=kdfjijf89485.online; _ga=GA1.1.1107869110.1654255726; _ga_6DLS4FBHC6=GS1.1.1654259056.2.1.1654260355.0; _nipple_session=DZmMES3vGmHhXLnp9TnULezhbUhy%2FIqFyLNWNYot0S%2FCq7n73iJ1P7ypivBy4u8IPPYe6smeiP7I%2FttFSLEHeb6jEafg50to7ceYCtDLQdAVwnBRdGenEKtc7dODRRQn9FaVOS9ietmoMO0IAbcJ6%2B%2BypZestlQ9IIoAYyYmTvmzQltULHnuA2cQEGUyxlmJqwCF1nfYrhMtBqEgpFP2UwrBKEcBBcqYFL96klIQBOOCSdm8UueNKLZ9O%2BUAlN%2FEIRQgV229ziwy5kUVxBDYzJ9tmLbxrVtSKzKxESuQ1W9n6JefP64fB%2FC7l7kWfL0Vys%2BlCi57UkpuhHfM0IJhj33FOSy4iMtXcVGETor4NG2%2FHcUL2U974YCfPBX6Rc%2BoQ%2Bm8%2Fkyzdutme9AQS%2FPk--RkCe6gHEAt3X3JgH--j5UScZwkeVHIukpKpt6TGQ%3D%3D; _nipple_session=GBgJoGvRuRJBkWfWwcoSDKiquxucPgj24AUTQQe%2FfPANRvWA6unhiGQFQ8SPqml271vlZwFtGra448GmgDKSnpX%2FCSUkwzEiqDr0ekV9oKw%2FKdrkk6ELO0Z3J8YqInUSiQKm04eVKJvHCRc5p0MH1jJ%2BZAcONVfvfh11Ai2TGpTzYOxZ%2BIi2uHqXn817GUFO7GkDB2VI%2FTIPMz%2B8J7Sxj2GJaEQU%2FKyROs5XN0BWCVhe9EF8CT8RKa1DP%2FrLzOosn33weZOCaPR%2Bbn7jwupxrxsCZ68Tg9oUl%2Ff4GrVTPoAyaWuoPlD0sKtteh9HKqg%2Fb%2BzJMS04US9OlztCm5rzJmV7xW6uoUX9%2BerYxZJB11haN%2Fquablym5VufyWURAZybjY7jEaCoSp94t4EBlPJ--SphXN3nrbR%2Fc3Yhu--G6JqS5oBVQSPdSCeXCf4lg%3D%3D' --header 'referer: https://qvb111.xyz/users/sign_in' --header 'sec-ch-ua: "-Not.A/Brand";v="8", "Chromium";v="102"' --header 'sec-ch-ua-mobile: ?0' --header 'sec-ch-ua-platform: "macOS"' --header 'sec-fetch-dest: document' --header 'sec-fetch-mode: navigate' --header 'sec-fetch-site: same-origin' --header 'sec-fetch-user: ?1' --header 'upgrade-insecure-requests: 1' --header 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36' 'https://qvb111.xyz/' --save-cookies=cookies.txt --keep-session-cookies
4. 开始爬取网站
配置从cookies.txt中加载cookies,并爬取网站https://qvb111.xyz/girl/show/2797
wget --load-cookies cookies.txt \
--keep-session-cookies \
https://qvb111.xyz/girl/show/2797
5. 查看爬取结果
作者爬取了某个带颜色的网站后,并用以下的命令查看爬取的内容
cd firefish
ls
cd show
ls
ls | wc -l
du -sh .
6. 网站爬虫简易教程
1、正常登录目标网站
2、找到登录请求、复制、导入postman处理
3、复制postman生成wget代码片段,并追加设置
--save-cookies cookies.txt --keep-session-cookies
4、模拟登录并保存凭证
wget --no-check-certificate --quiet --method GET --timeout=0 --header 'authority: qvb111.xyz' --header 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' --header 'accept-language: zh-CN,zh;q=0.9' --header 'cache-control: max-age=0' --header 'cookie: md10=kdfjijf89485.online; _ga=GA1.1.1107869110.1654255726; _ga_6DLS4FBHC6=GS1.1.1654259056.2.1.1654260355.0; _nipple_session=DZmMES3vGmHhXLnp9TnULezhbUhy%2FIqFyLNWNYot0S%2FCq7n73iJ1P7ypivBy4u8IPPYe6smeiP7I%2FttFSLEHeb6jEafg50to7ceYCtDLQdAVwnBRdGenEKtc7dODRRQn9FaVOS9ietmoMO0IAbcJ6%2B%2BypZestlQ9IIoAYyYmTvmzQltULHnuA2cQEGUyxlmJqwCF1nfYrhMtBqEgpFP2UwrBKEcBBcqYFL96klIQBOOCSdm8UueNKLZ9O%2BUAlN%2FEIRQgV229ziwy5kUVxBDYzJ9tmLbxrVtSKzKxESuQ1W9n6JefP64fB%2FC7l7kWfL0Vys%2BlCi57UkpuhHfM0IJhj33FOSy4iMtXcVGETor4NG2%2FHcUL2U974YCfPBX6Rc%2BoQ%2Bm8%2Fkyzdutme9AQS%2FPk--RkCe6gHEAt3X3JgH--j5UScZwkeVHIukpKpt6TGQ%3D%3D; _nipple_session=GBgJoGvRuRJBkWfWwcoSDKiquxucPgj24AUTQQe%2FfPANRvWA6unhiGQFQ8SPqml271vlZwFtGra448GmgDKSnpX%2FCSUkwzEiqDr0ekV9oKw%2FKdrkk6ELO0Z3J8YqInUSiQKm04eVKJvHCRc5p0MH1jJ%2BZAcONVfvfh11Ai2TGpTzYOxZ%2BIi2uHqXn817GUFO7GkDB2VI%2FTIPMz%2B8J7Sxj2GJaEQU%2FKyROs5XN0BWCVhe9EF8CT8RKa1DP%2FrLzOosn33weZOCaPR%2Bbn7jwupxrxsCZ68Tg9oUl%2Ff4GrVTPoAyaWuoPlD0sKtteh9HKqg%2Fb%2BzJMS04US9OlztCm5rzJmV7xW6uoUX9%2BerYxZJB11haN%2Fquablym5VufyWURAZybjY7jEaCoSp94t4EBlPJ--SphXN3nrbR%2Fc3Yhu--G6JqS5oBVQSPdSCeXCf4lg%3D%3D' --header 'referer: https://qvb111.xyz/users/sign_in' --header 'sec-ch-ua: "-Not.A/Brand";v="8", "Chromium";v="102"' --header 'sec-ch-ua-mobile: ?0' --header 'sec-ch-ua-platform: "macOS"' --header 'sec-fetch-dest: document' --header 'sec-fetch-mode: navigate' --header 'sec-fetch-site: same-origin' --header 'sec-fetch-user: ?1' --header 'upgrade-insecure-requests: 1' --header 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36' 'https://qvb111.xyz/' --save-cookies=cookies.txt --keep-session-cookies
5、开始爬虫
wget --load-cookies cookies.txt \
--keep-session-cookies \
https://qvb111.xyz/girl/show/2797
6、查看爬虫成果(见视频)
可以以个人网站测试或gitee个人仓库测试,🈲不合理使用