BeautifulSoup+线程池爬取小说全部内容
目标网址:https://www.bixuege.com/7_7120/
- 打开网页,查看源代码,检查是否有我们想要的内容
"""这个网页的源代码很招人喜欢,直接把我们呢想要的内容展示出来了"""
-
先写个获取HTML页面的方法
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36", "Referer": "https://www.bixuege.com/xuanhuan/" } try: res = requests.get(url, headers=headers) res.encoding = res.apparent_encoding except Exception: return else: return res.text finally: res.close()
-
用BeautifulSoup解析HTML页面,拿到对应的节点信息
soup = BeautifulSoup(html, 'lxml') div_list = soup.find('div', attrs={"id": "list"}) dd_list = div_list.find_all('dd')
-
打开一个章节网页,看看我们想要的文章内容是不是在源代码中
这不就是白送吗?内容直接放在了id为content的div节点中了
-
写个存储内容的方法
soup = BeautifulSoup(html, 'lxml') content = soup.find('div', attrs={"id": "content"}).text with open(path, 'w', encoding='utf-8') as f: f.write(content) print('[%s]保存成功!!!' % title)
-
加个线程池,提高抓取效率
with ThreadPoolExecutor(100) as t: for content in contents: url, title = content t.submit(save_contents, url, title)
运行效果
看看内容
源代码:
import os
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
def get_html(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
"Referer": "https://www.bixuege.com/xuanhuan/"
}
try:
res = requests.get(url, headers=headers)
res.encoding = res.apparent_encoding
except Exception:
return
else:
return res.text
finally:
res.close()
def get_content_url(html):
content_info = []
soup = BeautifulSoup(html, 'lxml')
div_list = soup.find('div', attrs={"id": "list"})
dd_list = div_list.find_all('dd')
for dd in dd_list[9:]:
href, title = dd.a['href'], dd.a.text
content_info.append(("https://www.bixuege.com" + href, title))
if content_info:
return content_info
return
def save_contents(url, title):
if not os.path.exists('./笔趣阁小说'):
os.mkdir('./笔趣阁小说')
path = '笔趣阁小说/' + title + '.txt'
html = get_html(url)
soup = BeautifulSoup(html, 'lxml')
content = soup.find('div', attrs={"id": "content"}).text
with open(path, 'w', encoding='utf-8') as f:
f.write(content)
print('[%s]保存成功!!!' % title)
def run():
url = "https://www.bixuege.com/7_7120/"
html = get_html(url)
contents = get_content_url(html)
with ThreadPoolExecutor(100) as t:
for content in contents:
url, title = content
t.submit(save_contents, url, title)
if __name__ == '__main__':
run()