1,怎么下载?
这里有人已经封装好了:
https://github.com/nyu-mll/GLUE-baselines
下载源码之后,要修改一下download_glue_data.py的第44、45行,改成:
MRPC_TRAIN = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt'
MRPC_TEST = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt'
然后执行:
python3 download_glue_data.py --data_dir glue_data
正常情况下,就能在当前目录下的看到一个glue_data目录里面的数据了,总共大概2.9G,其中MRPC只有3M。
参考:https://github.com/nyu-mll/GLUE-baselines/issues/11
2,可能遇到的问题之一:
Processing MRPC...
Traceback (most recent call last):
File "download_glue_data.py", line 144, in <module>
sys.exit(main(sys.argv[1:]))
File "download_glue_data.py", line 136, in main
format_mrpc(args.data_dir, args.path_to_mrpc)
File "download_glue_data.py", line 68, in format_mrpc
URLLIB.urlretrieve(MRPC_TRAIN, mrpc_train_file)
File "/persist/conda/lib/python3.6/urllib/request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/persist/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/persist/conda/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/persist/conda/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/persist/conda/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/persist/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/persist/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
解决办法:如上面第1点提到的,修改download_glue_data.py的第44、45行。
3,可能遇到的问题之二:
Traceback (most recent call last):
File "download_glue_data.py", line 147, in <module>
sys.exit(main(sys.argv[1:]))
File "download_glue_data.py", line 139, in main
format_mrpc(args.data_dir, args.path_to_mrpc)
File "download_glue_data.py", line 75, in format_mrpc
URLLIB.urlretrieve(TASK2PATH["MRPC"], os.path.join(mrpc_dir, "dev_ids.tsv"))
File "/home/howard/anaconda3/envs/py36/lib/python3.6/urllib/request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/home/howard/anaconda3/envs/py36/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/home/howard/anaconda3/envs/py36/lib/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/home/howard/anaconda3/envs/py36/lib/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/home/howard/anaconda3/envs/py36/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/home/howard/anaconda3/envs/py36/lib/python3.6/urllib/request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/home/howard/anaconda3/envs/py36/lib/python3.6/urllib/request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>
原因:毫无疑问连接不上
解决办法:翻墙,不多说了。