记录
最近在练习爬虫xpath内容,做个记录
以下代码xpath可能会变动,根据网址实际标签地址进行修改即可
每次爬取页码最好不要过多,以免影响网站正常运行
20201029更新,该源码非异步执行,请求多个网址,耗时为所有页面耗时总和,另一篇为高能异步执行,效果对比很明显,可对比阅读,地址:https://blog.csdn.net/MKKKKAA/article/details/109347852
源码
import requests
from lxml import etree
import time
headers = {'User-Agent': 'Mozilla/5.0'}
fp = open('./qiubai_download.txt', 'w', encoding='utf-8')
print('----单线程多任务非异步执行----')
page = input('请输入爬取页数:')
start = time.time()
url_head = 'https://www.qiushibaike.com/text/page/'
urls = []
for i in range(1, int(page)+1):
url_each = url_head + str(i)
urls.append(url_each)
# print(urls)
for url in urls:
resp = requests.get(url, headers=headers).text
tree = etree.HTML(resp)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
for div in div_list:
author = div.xpath('./div[@class="author clearfix"]/a[2]/h2/text()')[0]
detail_text = div.xpath('.//div[@class="content"]/span[1]//text()')
detail_text = ''.join(detail_text)
fp.write(author + detail_text+'\n\n\n\n')
fp.close()
print('耗时:',time.time()-start)