本想着找找电子书了事,原本兴冲冲的我打开百度,搜索‘人教电子书’,满怀希望地按回车,结果…
网站访问流程
我一连点击了多个链接,结果并没有什么,最终找到了一个美丽的网站——新东方学习网,终于找到了我所需要的资料。
新东方中小学下学期电子课本
但是一进去就发现不正常了,还说是新东方呢?
这令人眼花缭乱的广告,都注定了这是一个转载网站
好吧,不去管这么多了,往下继续看吧
好吧好吧,好像要继续点击,我服了
怎么又被定位到这个恶心的网址了?不说了,一看就是一个流氓的网页加上一个流氓的下载云盘。而且为什么不把他们发在一个网盘中,而要分开存储呢,我想一个人手动应该没有这个毅力吧!
代码编写
好了,活不多说,来研究一下这个奇怪的网站
我们来看一看这个请求方式
Request URL: http://nc.xdf.cn/huodong/202002/058569030_2.html
Request Method: GET
Status Code: 200 OK
Remote Address: 116.199.3.88:80
Referrer Policy: no-referrer-when-downgrade
Age: 0
Ali-Swift-Global-Savetime: 1594725551
Cache-Control: max-age=1800
Connection: keep-alive
Content-Type: text/html; charset=utf-8
Date: Tue, 14 Jul 2020 11:19:11 GMT
EagleId: 74c7034515947255515325747e
Expires: Tue, 14 Jul 2020 11:49:11 GMT
Server: Tengine
Timing-Allow-Origin: *
Transfer-Encoding: chunked
Vary: Accept-Encoding
Via: cache21.l2st3-1[79,200-0,M], cache32.l2st3-1[81,0], cache4.cn585[152,200-0,M], cache5.cn585[154,0]
X-Cache: MISS TCP_MISS dirn:-2:-2
X-Frame-Options: SAMEORIGIN
X-Swift-CacheTime: 1800
X-Swift-SaveTime: Tue, 14 Jul 2020 11:19:11 GMT
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Cookie: soukecityid=35; city=35; gr_user_id=164cd416-c45d-450e-9cb6-b7ac5fd308a7; Fingerprint_xdf=1594722429047_0.1479318682825368; Hm_lvt_e010d1faf316a4dbfe8639481a2a3f90=1594722429; _ga=GA1.2.292417677.1594722430; _gid=GA1.2.953320500.1594722430; gr_session_id_a0c70ca07e901f77=1f14e2ad-9aaa-4792-a61d-0c783b1ea698; gr_session_id_a0c70ca07e901f77_1f14e2ad-9aaa-4792-a61d-0c783b1ea698=true; Hm_lpvt_e010d1faf316a4dbfe8639481a2a3f90=1594725384; __xsptplus342=342.2.1594725213.1594725383.2%234%7C%7C%7C%7C%7C%23%23orDIJJ_tFHYiKgjhshk4jtx3_1EOEEFL%23
Host: nc.xdf.cn
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
从上面我们可以看出,它用的是get方法请求所以我们也用get方法来模拟浏览器请求
我们输入以下代码,不设置请求头,看一看能不能获取数据
#Name : GetWebBokk.py
#Date : 2020-07-14
import requests
message = requests.get("http://nc.xdf.cn/huodong/202002/058569030_2.html")
print(message)
这个网站很low啊,我的请求头是Python3Requests都没有禁止过滤掉
好了,打印一下请求后的数据是什么样的
它成功输出了这个网站源代码,可喜可贺
我们输入以下代码,用BS4获取所有属性是a的字符串,并打印出来
#Name : GetWebBokk.py
#Date : 2020-07-14
#Use : Python 3.8
import requests
from bs4 import BeautifulSoup
message = requests.get("http://nc.xdf.cn/huodong/202002/058569030_2.html")
#print(message.text)
soup = BeautifulSoup(message.text, 'html5lib')#导入模块
for k in soup.find_all('a'):
print(k)
我们看到这些URL很多病不是我们想要的,所以我们还要进行继续的清理
在这里插入代码片#Name : GetWebBokk.py
#Date : 2020-07-14
#Use : Python 3.8
import requests
from bs4 import BeautifulSoup
message = requests.get("http://nc.xdf.cn/huodong/202002/058569030_2.html")
#print(message.text)
list1 = []
soup = BeautifulSoup(message.text, 'html5lib')#导入模块
for k in soup.find_all('a',target = "_blank"):
list1.append(k["href"])
print(list1)
我们创建一个列表,并把所有target = "_blank"的标签全部放进这个列表中,我们可以看到数据已经清洗完成,整齐地存入了列表
随便点击一下这个其下网址,观察可以发现百度网盘的链接如下
更改程序为下面这样
#Name : GetWebBokk.py
#Date : 2020-07-14
#Use : Python 3.8
import requests
from bs4 import BeautifulSoup
import re
message = requests.get("http://nc.xdf.cn/huodong/202002/058569030_2.html")
#print(message.text)
list1 = []
list2 = []
list3 = []
soup = BeautifulSoup(message.text, 'html5lib')#导入模块
for k in soup.find_all('a',target = "_blank"):
list1.append(k["href"])
for i in list1:
try:
msg = requests.get(i)
except:
pass
else:
soup1 = BeautifulSoup(msg.text, 'html5lib')#导入模块
for k in soup1.find_all('p',style="text-align:center;margin:0px auto;"):#获取p属性
for j in soup1.find_all('a',style="text-decoration:none;"):
list2.append(j["href"])
for h in list2:
msg = re.search("https://pan.baidu.com/s/.*?",h)
if msg == None:
pass
else:
print(h)
list3.append(h)
print(list3)
这样它已经可以自动帮我们爬取这个地址了,但是百度云的直链实在是太难寻找了,所以自行解决吧
接下来呢,我们把程序封装成函数
def GetUrl(url):
message = requests.get(url)
print("已成功访问"+"\t"+url,end = "\n")
#print(message.text)
list1 = []
list2 = []
list3 = []
soup = BeautifulSoup(message.text, 'html5lib')#导入模块
for k in soup.find_all('a',target = "_blank"):
list1.append(k["href"])
print(url+"\t"+"第一步已完成(1/3)",end = "\n")
for i in list1:
try:
msg = requests.get(i)
except:
pass
else:
soup1 = BeautifulSoup(msg.text, 'html5lib')#导入模块
for k in soup1.find_all('p',style="text-align:center;margin:0px auto;"):#获取p属性
for j in soup1.find_all('a',style="text-decoration:none;"):
list2.append(j["href"])
print(url+"\t"+"第二步已完成(2/3)",end = "\n")
for h in list2:
msg = re.search("https://pan.baidu.com/s/.*?",h)
if msg == None:
pass
else:
list3.append(h)
print(url+"\t"+"第三步已完成(3/3)",end = "\n")
print(url+"\t"+"打印链接中",end = "\n")
for x in list3:
print(x)
return list3
完整代码
#Name : GetWebBokk.py
#Date : 2020-07-14
#Use : Python 3.8
import requests
from bs4 import BeautifulSoup
import re
import threading
import time
def GetUrl(url):
message = requests.get(url)
print("已成功访问"+"\t"+url,end = "\n")
#print(message.text)
list1 = []
list2 = []
list3 = []
soup = BeautifulSoup(message.text, 'html5lib')#导入模块
for k in soup.find_all('a',target = "_blank"):
list1.append(k["href"])
print(url+"\t"+"第一步已完成(1/3)",end = "\n")
for i in list1:
try:
msg = requests.get(i)
except:
pass
else:
soup1 = BeautifulSoup(msg.text, 'html5lib')#导入模块
for k in soup1.find_all('p',style="text-align:center;margin:0px auto;"):#获取p属性
for j in soup1.find_all('a',style="text-decoration:none;"):
list2.append(j["href"])
print(url+"\t"+"第二步已完成(2/3)",end = "\n")
for h in list2:
msg = re.search("https://pan.baidu.com/s/.*?",h)
if msg == None:
pass
else:
print(h)
list3.append(h)
print(url+"\t"+"第三步已完成(3/3)",end = "\n")
return list3
downloadThreads = []
downloadThread1 = threading.Thread(target=GetUrl,args = ("http://nc.xdf.cn/huodong/202002/058569030.html",))
downloadThread2 = threading.Thread(target=GetUrl,args = ("http://nc.xdf.cn/huodong/202002/058569030_2.html",))
downloadThread3 = threading.Thread(target=GetUrl,args = ("http://nc.xdf.cn/huodong/202002/058569030_3.html",))
downloadThread4 = threading.Thread(target=GetUrl,args = ("http://nc.xdf.cn/huodong/202002/058569030_4.html",))
downloadThread5 = threading.Thread(target=GetUrl,args = ("http://nc.xdf.cn/huodong/202002/058569030_5.html",))
downloadThreads.append(downloadThread1)
downloadThreads.append(downloadThread2)
downloadThreads.append(downloadThread3)
downloadThreads.append(downloadThread4)
downloadThreads.append(downloadThread5)
downloadThread1.start()
downloadThread2.start()
downloadThread3.start()
downloadThread4.start()
downloadThread5.start()
重要提示
如需转载,请附上原文链接