最近闲来无事,想到最近在看的小说能不能用爬虫爬取出来,就试着做了一个爬虫的程序,算是第一次正式地尝试吧。也没花多长时间,结果来看,还算不错。笔趣阁网站有很多,我这里选取的是一个叫做biquge365.net的网站。
import requests
from bs4 import BeautifulSoup
import threading
def fetch_chapter_content(url, title,results):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
content= soup.find('div', class_='txt').text
results[title]=content
def run_threads(thread_list):
for thread in thread_list:
thread.start()
thread.join()
def get_novel():
url = 'https://www.biquge365.net/newbook/xxxx/' #这里选取你要的小说
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
chapter_list = soup.find('div', class_='border').find_all('a')
threads = []
results = {}
title_list=[]
for chapter in chapter_list:
chapter_url = 'https://www.biquge365.net' + chapter['href']
chapter_title = chapter['title']
title_list.append(chapter_title)
thread = threading.Thread(target=fetch_chapter_content, args=(chapter_url, chapter_title, results))
threads.append(thread)
# 每次执行100个线程
num_threads_per_iteration = 100
num_threads = len(threads)
for i in range(0, num_threads, num_threads_per_iteration):
thread_subset = threads[i:i + num_threads_per_iteration]
run_threads(thread_subset)
novel_content=''
for real_title in title_list:
novel_content+= f"\n{real_title}\n\n{results[real_title]}\n\n"
return novel_content
if __name__ == '__main__':
novel = get_novel()
with open('E:\\novel.txt', 'w', encoding='utf-8') as file:
file.write(novel)
print('小说已保存到novel.txt文件中。')
这里由于一开始爬取时间太长了,因此使用了线程的方法进行爬取。但是线程一次不能太多,要不然服务器会监测到,就会自动断开连接。
最后结果是成功了,就不用傻乎乎地在网上找txt资源了。