什么叫“半爬虫”呢?这是我自己给这类小爬虫的命名。
比如,有的网站,是动态渲染的,你只需要主页面的部分代码,不需要大动干戈的再用selenium测试一番,只需要在浏览器抓包“检查”中,把需要的代码范围的上一层复制一下,保存在本地txt(utf-8编码)中,然后用本爬虫解析就可以啦!
比如,我要下载《小猪佩奇》的241-250集(http://tv.sohu.com/s2015/fhzxm/),就只用把相应范围的代码复制到本地,在用本爬虫解析就行啦!
这是本地文件:
需要的局部html代码为:
<ul class="serielist tebbcon" style="display: block;"><li><em class="num">241</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc1Mi5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第241集:猪爷爷的池塘</a></li><li><em class="num">242</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc1NC5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第242集:在很久以前</a></li><li><em class="num">243</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc1Ni5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第243集:警察局</a></li><li><em class="num">244</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc1OC5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第244集:等我长大以后</a></li><li><em class="num">245</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc2MC5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第245集:救护车</a></li><li><em class="num">246</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc2Mi5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第246集:医生</a></li><li><em class="num">247</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc2NC5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第247集:土豆超人</a></li><li><em class="num">248</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc2Ni5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第248集:兔爷爷的气垫船</a></li><li><em class="num">249</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc2OC5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第249集:幼儿园之星</a></li><li><em class="num">250</em><a href="//tv.sohu.com/v/MjAyMDA3MDkvbjYwMDg3OTc3MC5zaHRtbA==.html" target="_blank" class="fs14 s-tit">第250集:嘉年华</a></li></ul>
具体py代码为;
import requests
import re
import os
import threading
import queue
import threadpool
def download(i):
try:
print('正在多线程下载:',i)
os.system('you-get ' + i)
except:
print('error ' + i)
with open('html代码.txt','r',encoding='utf-8') as file:
txt_list = file.readlines()
txt = ''.join(txt_list)
# print(txt)
pat1 = r'<a class="" href="(.*?)cartoon" target="_blank"'
url_htmls = re.compile(pat1).findall(txt)
# print(url_htmls)
print(len(url_htmls))
url_htmls = ['https:' + x for x in url_htmls]
print(url_htmls)
pool = threadpool.ThreadPool(3)
requests = threadpool.makeRequests(download, url_htmls)
[pool.putRequest(req) for req in requests]
pool.wait()
print('下载结束')