WenetSpeech Modelscope下载踩坑解决

背景

wenetspeech是一个开源的ASR中文数据集。地址:wenet-e2e/WenetSpeech: A 10000+ hours dataset for Chinese speech recognition (github.com)

在填写申请表获得密码之后,项目提供了两种下载方式:

  • from tencent meeting
  • from modelscope

由于我本地连接tencent meeting下载地址失败,需通过modelscope下载。但按照项目README中的操作下载,会碰到如下诸多问题。除了readme中指出的修改外,还需要对`download_wenetspeech.sh`脚本进行更多修改:

1. 仍从http://wenet.meeting.tencent.com/下载User Agreement?

utils/download_wenetspeech.sh: Start to download WenetSpeech user agreement
--2024-08-27 16:19:00--  http://wenet.meeting.tencent.com/WenetSpeech/TERMS_OF_ACCESS
Resolving wenet.meeting.tencent.com (wenet.meeting.tencent.com)... failed: Name or service not known.

解决:修改脚本中的stage为1(或2),跳过步骤1

stage=1  # set stage to 1 or 2 to pass the following step

# User agreement
if [ $stage -le 0 ]; then
  echo "$0: Start to download WenetSpeech user agreement"
  wget -c -P $download_dir \
    ${WENETSPEECH_RELEASE_URL}/TERMS_OF_ACCESS || exit 1;
  GREEN='\033[0;32m'
  NC='\033[0m'       # No Color
  echo -e "${GREEN}"
  echo -e "BY PROCEEDING YOU AGREE TO THE FOLLOWING WENETSPEECH TERMS OF ACCESS:"
  echo -e ""
  echo -e "=============== WENETSPEECH DATASET TERMS OF ACCESS ==============="
  cat $download_dir/TERMS_OF_ACCESS
  echo -e "=================================================================="
  echo -e "$0: WenetSpeech downloading will start in 5 seconds"
  echo -e ""

  for t in $(seq 5 -1 1); do
    echo -e "$t"
    sleep 1
  done
  echo -e "${NC}"
fi

2. 并没有下载到指定的<DOWNLOAD_DIR>?

下载过程中发现并没有下载到指定的位置,而是仍下载到默认的cache地址~/.cache。原因如下:

我使用的python版本为3.12.4,modelscope版本为1.15.0

(p.s. 按照README中下载的modelscope默认下载最新版本,而最新版本需要python 3.8及以上版本,按照README中指定的3.7版本并不能正常执行。)

在该modelscope版本代码中,数据集下载所默认存储的CACHE路径变量名为`MS_DATASETS_CACHE`,而非脚本中指定的`CACHE_HOME`。修改如下:

# Download from list
if [ $stage -le 2 ]; then
  echo "$0: Start to download WenetSpeech files(*.aes.tgz)"
  if [ $modelscope == true ]; then
    MS_DATASETS_CACHE=$download_dir python utils/download_from_modelscope.py  # change CACHE_HOME to MS_DATASETS_CACHE
  else
    grep -v '^#' metadata/v1.list | (while read line; do
      download_object_from_release $line || exit 1;
    done) || exit 1;
  fi
fi

3. 找不到TERMS_OF_ACCESS?

由于在1.中已经跳过了下载TERMS_OF_ACCESS这一步,因此需要将如下代码注释掉,不执行:

# Process data
if [ $stage -le 3 ]; then
  echo "$0: Start to process the downloaded files(*.aes.tgz)"
  # HERE!!!
  # cp $download_dir/TERMS_OF_ACCESS $untar_dir
  if [ $modelscope == true ]; then                 
    ms_download_dir=$download_dir/modelscope/hub/datasets/downloads/wenet/WenetSpeech/master
    for tgz in `ls $ms_download_dir | grep -v '\.'`; do
      process_downloaded_object ${ms_download_dir}/$tgz || exit 1;
    done || exit 1;
  else
    grep -v '^#' metadata/v1.list | (while read md5 tgz; do
      process_downloaded_object ${download_dir}/$tgz || exit 1;
    done) || exit 1;
  fi
fi

4. 找不到下载数据集的地址?

报错:

ls: cannot access '<download_dir>/modelscope/hub/datasets/downloads/wenet/WenetSpeech/master': No such file or directory

原因:当前版本modelscope下载数据集的存储路径并非`<download_dir>/modelscope/hub/datasets/downloads/wenet/WenetSpeech/master`,而是:

`<download_dir>/wenet/WenetSpeech/master/`但仅修改为这样也不行,因为:

5. 下载路径下的文件是文件夹?

报错:

utils/download_wenetspeech.sh: Processing <download_dir>/wenet/WenetSpeech/master/data_files
error reading input file
4097C60A257F0000:error:80000015:system library:file_read:Is a directory:crypto/bio/bss_file.c:148:calling fread()
4097C60A257F0000:error:10080002:BIO routines:file_read:system lib:crypto/bio/bss_file.c:150:

存储具体文件的路径并不是`master`,而是更下一级的`data_files`

综合4.和5.,修改如下:

# Process data
if [ $stage -le 3 ]; then
  echo "$0: Start to process the downloaded files(*.aes.tgz)"
  # cp $download_dir/TERMS_OF_ACCESS $untar_dir
  if [ $modelscope == true ]; then
    # here!! change ms_download_dir
    ms_download_dir=$download_dir/wenet/WenetSpeech/master/data_files
    for tgz in `ls $ms_download_dir | grep -v '\.'`; do
      process_downloaded_object ${ms_download_dir}/$tgz || exit 1;
    done || exit 1;
  else
    grep -v '^#' metadata/v1.list | (while read md5 tgz; do
      process_downloaded_object ${download_dir}/$tgz || exit 1;
    done) || exit 1;
  fi
fi

经过上述修改,可以正常完成从modelscope上下载wenetspeech。

其中2/4/5点问题,也许可以通过下载低版本的modelscope解决,但项目的README中也并没有说清他们在编写该脚本时使用的modelscope版本,所以……

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值