最近在学习了爬虫之后,为了巩固所学的知识,试着爬取了某网站的小说,想不到遇到了很多问题,真是不用不知道,一用吓一跳,好在经过不断努力和熬夜,终于解决了爬取全部小说的问题,爬虫运行了一天一夜,终于没有报错。
1.关于xpath的问题
(1)当提取正文时,标签的属性会随着章节的变化而变化。好在属性在网页源代码中可以找到,所以通过先提取出属性,再用xpath就成功了。如:
xpathsx = re.findall(r"document.all\('(.*?)'\).style", html)
(2)不同章节,使用的标签的名称不一致。不过这也好办,我统一使用string(.)方法进行了强行提取。如:nr = et1[0].xpath('string(.)')
(3)不同章节,需使用不同的cookie请求。由于cookie逆向比较麻烦,因则使用selenium工具轻松解决。但于使用的是无头(headless,即不出现浏览器界面)浏览,在代码中没有加入driver.close()和driver.quit()即关闭浏览器和退出chromedriver.exe,导致CPU中相关进程过多,因而往往在爬取一段时间后,就出现Message: chrome not reachable问题,使爬取过程中断。正确代码如下:
def hqcookies(url):
# from selenium import webdriver
# driver = webdriver.Chrome()
# driver.minimize_window()
global headers
hqcount = 1
while 1:
print(f'开始第 {hqcount} 次获取coockie')
try:
option = webdriver.ChromeOptions()
option.headless = True
driver = webdriver.Chrome(options=option)
driver.get(url)
cookies = driver.get_cookies()
format_cookie = ''.join([f'{i["name"]}={i["value"]}; ' for i in cookies])[:-2]
headers['cookie'] = format_cookie
time.sleep(1)
driver.close()
time.sleep(1)
driver.quit()
zcbs = 1
# print(headers)
except Exception as e:
print(f'-----selenium第 {hqcount} 次获取cookie失败:\n-----{e}\n-----等待5-9秒后开始重新获取')
wtime = random.randint(5, 9)
zcbs = 0
time.sleep(wtime)
if zcbs == 1:
print(f'第 {hqcount} 次获取coockie成功!')
break
hqcount += 1
return headers
(4)请求卡死。请求长时间不动,不抛出异常,也不继续下去。方法就是设置timeout时间,使用try......except语句,超出timeout设置的时间,然后重新请求。
def requestillcl(url):
global headers
global jmfs
global uas
quescs = 1 #请求次数
while 1:
try:
html = requests.get(url, headers=headers,timeout=30)
html.encoding = jmfs
html = html.text
requestsuc = 1 # 请求成功与否标志,1表示成功,0表示不成功
except Exception as e:
print(e)
requestsuc = 0
wtime = quescs * 3 * random.randint(2, 9)
print(f'第{quescs}次请求 {url} 不成功,开始第{quescs + 1}次请求,请耐心等{wtime}秒')
time.sleep(wtime)
quescs = quescs + 1
if requestsuc == 1:
break
if quescs >=2:
print(f'第{quescs}请求{url}不成功,重置User-Agent重新请求')
headers['User-Agent'] = random.choice(uas)
if quescs == 10:
print(f'已请求10次,仍未成功请求到网页,请检查网址是否正确:{url}')
html = f'已请求10次,仍未成功请求到网页,请检查网址是否正确:{url}'
break
return html
(5)章节内容分页、某个章节被删除等。爬取某个章节时,在用循环获取分页内容。遇到章节被删除,则重新获取正确的章节链接再爬取。
def zjnrhq(url):
global uas
url1 = re.findall('(.*?).html',url)[0]
delbz = 0
requcount = 1
while 1:
html = requestillcl(url)
xpathsx = re.findall(r"document.all\('(.*?)'\).style", html)
if len(html) > 800 and len(xpathsx) != 0:
xpathsx = xpathsx[0]
print(f'第{requcount}次请求{url}重置cookie成功')
break
elif len(html) < 100:
print(url)
print('该章节已被删除,请手工补全')
delbz = 1
break
else:
if requcount >= 2:
print(f'第{requcount}次请求{url}重置cookie不成功,重置User-Agent后开始第{requcount+1}次请求!')
headers['User-Agent']= random.choice(uas)
hqcookies(url)
requcount += 1
if delbz == 0:
et = etree.HTML(html)
maxpage = et.xpath('//div[@id="PageSet"]/a/text()')
pagesetbz = len(maxpage)#章节是否分页标志,为0则不分页
if pagesetbz != 0:
maxpage = maxpage[-2]
et1 = et.xpath('//div[@id="{}"]'.format(xpathsx))
nr = et1[0].xpath('string(.)').strip()
nr = ' ' + nr
if pagesetbz != 0:
for i in range(2,int(maxpage)+1):
url0 = '{}-{}.html'.format(url1,str(i))
qupagecount = 1
while 1:
print(f'第{qupagecount}次开始请求--{url0}--')
html0 = requestillcl(url0)
xpathsx = re.findall(r"document.all\('(.*?)'\).style", html0)
if len(xpathsx) != 0:
print(f'第{qupagecount}次请求--{url0}--成功!')
xpathsx = xpathsx[0]
break
qupagecount += 1
et = etree.HTML(html0)
et1 = et.xpath('//div[@id="{}"]'.format(xpathsx))
nr0 = et1[0].xpath('string(.)').strip()
nr = nr + nr0
nr = nr.strip()
nr = nr.strip('"')
nr = nr.strip('。。')
nr = nr.strip()
nr = nr.replace(' ','\n ')
nr = ' ' + nr
else:
nr = '该章节已被删除,请手工补全'
return nr,delbz
运行效果: