音视频数据库 GRID 爬取

介绍如何爬取 GRID 数据库
地址在 http://spandh.dcs.shef.ac.uk/gridcorpus/

该网页比较简单,xpath直接找到需要的连接

找到所有的连接

# -*- coding:utf-8 -*
import urllib.request
from lxml import etree

# root_url="http://spandh.dcs.shef.ac.uk/gridcorpus/"
root_url="http://spandh.dcs.shef.ac.uk/"
url=f"{root_url}/gridcorpus"
def main():

    html = urllib.request.urlopen(url).read()
    tree = etree.HTML(html)

    links = tree.xpath(".//td/a/@href")    
    #print( links )
    with open("dw.raw.list", "w", encoding="utf-8") as fp:
            for e in links:
                f = True
                if "example" in e:
                    f = False
                
                # 这个是为了忽略高清视频
                if "part" in e:
                    f = False
                if f:
                    print( e )

if __name__ == '__main__':
    main()

制作 wget 命令

这一步就是对每个连接制作成 wget 命令,要设置好保存文件名,最后可以方便,parallel 并行爬取

一下代码为参考

f1="full_path.list" # 对前一步,补全路路径,也可以在本脚本中添加
f2="last.list"

with open(f1) as fp:
    lines = fp.readlines()

with open(f2, "w") as fp:
    for line in lines:
        line = line.strip()
        ss=line.split("/")[-2:]
        ss = "-".join(ss)

        # wrong? line=f'wget -c -P datasets/ {line} -O {ss}\n'
        line=f'wget -c -O datasets/{ss} {line} \n'
        fp.write(line)

并行下载

cat last.list | parallel

parllel 比 xargs 好用
https://www.jianshu.com/p/cc54a72616a1

解压

解压还需要重命名,否则会覆盖

文本和音频

for x in {1..34}
do
    {
        nm=s$x
        echo $nm
        mkdir -p ../db_uncompress/$nm/tmp
        mkdir -p ../db_uncompress/$nm/align
        mkdir -p ../db_uncompress/$nm/audio
        f1="../datasets/align-$nm.tar"
        tar -xf $f1 -C ../db_uncompress/$nm/tmp
        mv ../db_uncompress/$nm/tmp/align ../db_uncompress/$nm/

        f2="../datasets/audio-$nm.tar"
        tar -xf $f2 -C ../db_uncompress/$nm/tmp
        mv ../db_uncompress/$nm/tmp/$nm/* ../db_uncompress/$nm/audio

        #exit
    } &
done

视频

for x in {1..34}
do
    {
        nm=s$x
        echo $nm
        mkdir -p ../db_uncompress/$nm/video
        f1="../datasets/video-$nm.mpg_vcd.zip"
        unzip -q $f1 -d ../db_uncompress/$nm/tmp
        mv ../db_uncompress/$nm/tmp/$nm/* ../db_uncompress/$nm/video

        #exit
    } &
done

将align 文件拼接处目标文本

原本是按照 25kHz,采样点的范围,为了数据处理,先去掉这一部分

cnt=0
Dir["db_uncompress/s*/align/*.align"].each do |fnm|

    #puts fnm
    ss=File.open(fnm).readlines.map{|e| e.strip.split[-1] }.join " "
    #puts ss
    fnm.sub!("align","test")
    #puts fnm

    dir = fnm.split("/")[0...-1].join "/"
    Dir.mkdir dir unless File.exist? dir

    File.open(fnm, "w").puts ss

    cnt+=1
    puts cnt if cnt%1000 == 0
end

后面的其他处理 ----

转成音素

mp = {}
File.open("cmudict-0.7b.txt").each_with_index do |e, id|
    next if e.strip! == ""
    begin
        arr = e.split(" ", 2)
    rescue => ex
        puts e
        puts ex.message
        next
    end
    mp[ arr[0] ] = arr[1]
end
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using System.Data.OleDb; namespace 保存GRID数据示例 { public partial class Form1 : Form { public Form1() { InitializeComponent(); } private void button2_Click(object sender, EventArgs e) { //提示是否修改 #region//--------修改数据就将数据保存并显示 if (MessageBox.Show("是否保存数据?", "系统消息", MessageBoxButtons.OKCancel, MessageBoxIcon.Question, MessageBoxDefaultButton.Button2) == DialogResult.OK) { #region ..........这里是保存数据代码 //结束编辑 dataGridView1.EndEdit(); //重新用表格数据填充数据容器 OleDbDataAdapter Ada = new OleDbDataAdapter(); DataTable table = (DataTable)dataGridView1.DataSource; //重新启动连接 String ConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + System.Windows.Forms.Application.StartupPath + "/驱动.mdb"; //用Buider方法更新数据 using (OleDbConnection connection = new OleDbConnection(ConnectionString)) { Ada.SelectCommand = new OleDbCommand("SELECT * FROM 表", connection); OleDbCommandBuilder builder = new OleDbCommandBuilder(Ada); Ada.UpdateCommand = builder.GetUpdateCommand(); try { //更新数据表数据时 Ada.Update(table); table.AcceptChanges(); MessageBox.Show("操作已成功!数据将全部被保存......", "系统消息", MessageBoxButtons.OK, MessageBoxIcon.Information, MessageBoxDefaultButton.Button2); } catch (System.Data.OleDb.OleDbException ex) { throw new Exception(ex.Message); } } #endregion } #endregion #region //--------不修改就初始化显示以前数据 else { MessageBox.Show("用户取消操作,数据将恢复到初始状态......"); OleDbConnection A = new OleDbConnection(); A.ConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + System.Windows.Forms.Application.StartupPath + "/驱动.mdb"; try { A.Open(); DataSet B = new DataSet(); string sqlStr = "Select * from 表"; OleDbDataAdapter C = new OleDbDataAdapter(sqlStr, A); C.Fill(B); dataGridView1.DataSource = B.Tables[0]; } catch (System.Data.OleDb.OleDbException ex) { throw new Exception(ex.Message); } finally { A.Close(); } } #endregion } private void button1_Click(object sender, EventArgs e) { OleDbConnection A = new OleDbConnection(); A.ConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + System.Windows.Forms.Application.StartupPath + "/驱动.mdb"; try { A.Open(); DataSet B = new DataSet(); string sqlStr = "Select * from 表"; OleDbDataAdapter C = new OleDbDataAdapter(sqlStr, A); C.Fill(B); dataGridView1.DataSource = B.Tables[0]; } catch (System.Data.OleDb.OleDbException ex) { throw new Exception(ex.Message); } finally { A.Close(); } } private void comboBox1_SelectedIndexChanged(object sender, EventArgs e) { } } }
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值