nltk数据集比较多,如果一个一个的点难免崩溃,下面是我做的一个小脚本供大家使用:
import requests
import bs4
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import os
import time
# nltk 数据网页为:http://www.nltk.org/nltk_data/
# 直接另存为搞定, 这里指定网页位置
nltk_page = 'nltk.html'
# 国内下载有点,问题,我这里用了一个代理,大家如果没有代理,可以留言,我及时更新数据.
proxies = {'http': 'socks5://localhost:1080', 'https': 'socks5://localhost:1080'}
def save_data(data, path):
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
with open(path,'wb') as fw:
fw.write(data)
req_data = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0",}
with open(nltk_page,'r') as fr:
data = fr.read()
while True:
total_num,suc_num,err_num = 0,0,0
soup = BeautifulSoup(data, 'html.parser')
for link in soup.find_all('a'):
if link.get_text().strip().lower() == 'download':
link_url = link.get('href')
save_path = urlparse(link_url).path.strip('/')
total_num += 1
if os.path.exists(save_path):
continue
time.sleep(3)
try:
r = requests.get(link_url, headers = req_data, proxies=proxies)
suc_num += 1
print(total_num,'[%d]down:'%(suc_num),save_path)
except:
err_num += 1
print(total_num, '[%d]error:'%(err_num),link_url)
continue
save_data(r.content,save_path)
if err_num == 0:
print('down finish!!')
break
else:
print('try again')
print('-----------------------------------------------------')
time.sleep(10)
print('total_num:',total_num,'suc_num:',suc_num,'err_num:',err_num)
nltk 下载数据,百度云盘地址:
链接: https://pan.baidu.com/s/1r0qPlhXF3ScAG1I3U1_Qdg 提取码: v166
ps: 这个只供学习使用,如果商用请联系数据提供方,给他们更多帮助,如果有其他问题,请留言交流!!