背景
wenetspeech是一个开源的ASR中文数据集。地址:wenet-e2e/WenetSpeech: A 10000+ hours dataset for Chinese speech recognition (github.com)
在填写申请表获得密码之后,项目提供了两种下载方式:
- from tencent meeting
- from modelscope
由于我本地连接tencent meeting下载地址失败,需通过modelscope下载。但按照项目README中的操作下载,会碰到如下诸多问题。除了readme中指出的修改外,还需要对`download_wenetspeech.sh`脚本进行更多修改:
1. 仍从http://wenet.meeting.tencent.com/下载User Agreement?
utils/download_wenetspeech.sh: Start to download WenetSpeech user agreement
--2024-08-27 16:19:00-- http://wenet.meeting.tencent.com/WenetSpeech/TERMS_OF_ACCESS
Resolving wenet.meeting.tencent.com (wenet.meeting.tencent.com)... failed: Name or service not known.
解决:修改脚本中的stage为1(或2),跳过步骤1
stage=1 # set stage to 1 or 2 to pass the following step
# User agreement
if [ $stage -le 0 ]; then
echo "$0: Start to download WenetSpeech user agreement"
wget -c -P $download_dir \
${WENETSPEECH_RELEASE_URL}/TERMS_OF_ACCESS || exit 1;
GREEN='\033[0;32m'
NC='\033[0m' # No Color
echo -e "${GREEN}"
echo -e "BY PROCEEDING YOU AGREE TO THE FOLLOWING WENETSPEECH TERMS OF ACCESS:"
echo -e ""
echo -e "=============== WENETSPEECH DATASET TERMS OF ACCESS ==============="
cat $download_dir/TERMS_OF_ACCESS
echo -e "=================================================================="
echo -e "$0: WenetSpeech downloading will start in 5 seconds"
echo -e ""
for t in $(seq 5 -1 1); do
echo -e "$t"
sleep 1
done
echo -e "${NC}"
fi
2. 并没有下载到指定的<DOWNLOAD_DIR>?
下载过程中发现并没有下载到指定的位置,而是仍下载到默认的cache地址~/.cache。原因如下:
我使用的python版本为3.12.4,modelscope版本为1.15.0
(p.s. 按照README中下载的modelscope默认下载最新版本,而最新版本需要python 3.8及以上版本,按照README中指定的3.7版本并不能正常执行。)
在该modelscope版本代码中,数据集下载所默认存储的CACHE路径变量名为`MS_DATASETS_CACHE`,而非脚本中指定的`CACHE_HOME`。修改如下:
# Download from list
if [ $stage -le 2 ]; then
echo "$0: Start to download WenetSpeech files(*.aes.tgz)"
if [ $modelscope == true ]; then
MS_DATASETS_CACHE=$download_dir python utils/download_from_modelscope.py # change CACHE_HOME to MS_DATASETS_CACHE
else
grep -v '^#' metadata/v1.list | (while read line; do
download_object_from_release $line || exit 1;
done) || exit 1;
fi
fi
3. 找不到TERMS_OF_ACCESS?
由于在1.中已经跳过了下载TERMS_OF_ACCESS这一步,因此需要将如下代码注释掉,不执行:
# Process data
if [ $stage -le 3 ]; then
echo "$0: Start to process the downloaded files(*.aes.tgz)"
# HERE!!!
# cp $download_dir/TERMS_OF_ACCESS $untar_dir
if [ $modelscope == true ]; then
ms_download_dir=$download_dir/modelscope/hub/datasets/downloads/wenet/WenetSpeech/master
for tgz in `ls $ms_download_dir | grep -v '\.'`; do
process_downloaded_object ${ms_download_dir}/$tgz || exit 1;
done || exit 1;
else
grep -v '^#' metadata/v1.list | (while read md5 tgz; do
process_downloaded_object ${download_dir}/$tgz || exit 1;
done) || exit 1;
fi
fi
4. 找不到下载数据集的地址?
报错:
ls: cannot access '<download_dir>/modelscope/hub/datasets/downloads/wenet/WenetSpeech/master': No such file or directory
原因:当前版本modelscope下载数据集的存储路径并非`<download_dir>/modelscope/hub/datasets/downloads/wenet/WenetSpeech/master`,而是:
`<download_dir>/wenet/WenetSpeech/master/`但仅修改为这样也不行,因为:
5. 下载路径下的文件是文件夹?
报错:
utils/download_wenetspeech.sh: Processing <download_dir>/wenet/WenetSpeech/master/data_files
error reading input file
4097C60A257F0000:error:80000015:system library:file_read:Is a directory:crypto/bio/bss_file.c:148:calling fread()
4097C60A257F0000:error:10080002:BIO routines:file_read:system lib:crypto/bio/bss_file.c:150:
存储具体文件的路径并不是`master`,而是更下一级的`data_files`
综合4.和5.,修改如下:
# Process data
if [ $stage -le 3 ]; then
echo "$0: Start to process the downloaded files(*.aes.tgz)"
# cp $download_dir/TERMS_OF_ACCESS $untar_dir
if [ $modelscope == true ]; then
# here!! change ms_download_dir
ms_download_dir=$download_dir/wenet/WenetSpeech/master/data_files
for tgz in `ls $ms_download_dir | grep -v '\.'`; do
process_downloaded_object ${ms_download_dir}/$tgz || exit 1;
done || exit 1;
else
grep -v '^#' metadata/v1.list | (while read md5 tgz; do
process_downloaded_object ${download_dir}/$tgz || exit 1;
done) || exit 1;
fi
fi
经过上述修改,可以正常完成从modelscope上下载wenetspeech。
其中2/4/5点问题,也许可以通过下载低版本的modelscope解决,但项目的README中也并没有说清他们在编写该脚本时使用的modelscope版本,所以……