Unity 讯飞实时语音转写(二)—— 接收转写结果

目录

Unity 讯飞实时语音转写(一)—— 使用WebSocket连接讯飞语音服务器
Unity 讯飞实时语音转写(二)—— 接收转写结果
Unity 讯飞实时语音转写(三)—— 分析转写结果

正文

一、官网示例
官方文档中,有一段说了转写结果的json字符串大概长什么样子

{
    	    "action":"result",
    	    "code":"0",
    		"data":"{\"cn\":{\"st\":{\"bg\":\"820\",\"ed\":\"0\",\"rt\":[{\"ws\":[{\"cw\":[{\"w\":\"啊\",\"wp\":\"n\"}],\"wb\":0,\"we\":0},{\"cw\":[{\"w\":\"喂\",\"wp\":\"n\"}],\"wb\":0,\"we\":0},{\"cw\":[{\"w\":\"!\",\"wp\":\"p\"}],\"wb\":0,\"we\":0},{\"cw\":[{\"w\":\"你好\",\"wp\":\"n\"}],\"wb\":0,\"we\":0},{\"cw\":[{\"w\":\"!\",\"wp\":\"p\"}],\"wb\":0,\"we\":0},{\"cw\":[{\"w\":\"我\",\"wp\":\"n\"}],\"wb\":0,\"we\":0},{\"cw\":[{\"w\":\"是\",\"wp\":\"n\"}],\"wb\":0,\"we\":0},{\"cw\":[{\"w\":\"上\",\"wp\":\"n\"}],\"wb\":0,\"we\":0}]}],\"type\":\"1\"}},\"seg_id\":5}\n",
    		"desc":"success",
    		"sid":"rta0000000e@ch312c0e3f6bcc9f0900"
	}

经过测试,我自己转写的结果

接收消息:{
"action":"result",
"code":"0",
"data":"{
	\"seg_id\":7,
	\"cn\":{
		\"st\":{
			\"rt\":[
				{\"ws\":[
					{\"cw\":[{\"w\":\"我们\",\"wp\":\"n\"}],\"wb\":23,\"we\":70},
					{\"cw\":[{\"w\":\"生活\",\"wp\":\"n\"}],\"wb\":71,\"we\":118},
					{\"cw\":[{\"w\":\"的\",\"wp\":\"n\"}],\"wb\":119,\"we\":130},
					{\"cw\":[{\"w\":\"世界\",\"wp\":\"n\"}],\"wb\":131,\"we\":172},
					{\"cw\":[{\"w\":\"里\",\"wp\":\"n\"}],\"wb\":173,\"we\":201},
					{\"cw\":[{\"w\":\"有\",\"wp\":\"n\"}],\"wb\":202,\"we\":226},
					{\"cw\":[{\"w\":\"两\",\"wp\":\"n\"}],\"wb\":227,\"we\":249},
					{\"cw\":[{\"w\":\"个\",\"wp\":\"n\"}],\"wb\":250,\"we\":263},
					{\"cw\":[{\"w\":\"世界\",\"wp\":\"n\"}
				],
				\"wb\":264,
				\"we\":320}
			]}
		],
		\"bg\":\"5120\",
		\"type\":\"0\",
		\"ed\":\"8520\"
		}
	},
	\"ls\":false
}",
"desc":"success",
"sid":"xxxxxxxxxxxxxxxxxxxxxxxxxx"
}

发现实际转写结果有一些细微差别,例如:增加了一个字段“ls”,尚不知道是干嘛的。不过,既然接收到了该字段,就说明以后可能会用到,先在代码中添加上就好了。

二、关键数据结构

/// <summary>
/// 返回值结果格式为json
/// </summary>
[Serializable]
public struct JsonData
{
    /// <summary>
    /// 结果标识,started:握手,result:结果,error:异常
    /// </summary>
    public string action;

    /// <summary>
    /// 结果码(具体见错误码)
    /// </summary>
    public string code;

    /// <summary>
    /// 转写结果数据
    /// </summary>
    public Data data;

    /// <summary>
    /// 描述
    /// </summary>
    public string desc;

    /// <summary>
    /// 会话ID
    /// 主要用于DEBUG追查问题,如果出现问题,可以提供sid帮助确认问题。
    /// </summary>
    public string sid;
}

结果格式为json,字段说明如下:

参数类型说明
actionstring结果标识,started:握手,result:结果,error:异常
codestring结果码(具体见错误码)
datastring转写结果
descstring描述
sidstring会话ID
/// <summary>
/// 语音识别的结果
/// </summary>
[Serializable]
public struct Data
{
    /// <summary>
    /// 转写结果序号	从0开始
    /// </summary>
    public string seg_id;

    [Serializable]
    public struct CN
    {
        [Serializable]
        public struct ST
        {
            [Serializable]
            public struct RT
            {
                [Serializable]
                public class WS
                {
                    [Serializable]
                    public class CW
                    {
                        /// <summary>
                        /// 词识别结果
                        /// </summary>
                        public string w;

                        /// <summary>
                        /// 词标识  n-普通词;s-顺滑词(语气词);p-标点
                        /// </summary>
                        public string wp;
                    }

                    public CW[] cw;

                    /// <summary>
                    /// 词在本句中的开始时间,单位是帧,1帧=10ms  即词在整段语音中的开始时间为(bg+wb*10)ms
                    /// 中间结果的 wb 为 0
                    /// </summary>
                    public string wb;

                    /// <summary>
                    /// 词在本句中的结束时间,单位是帧,1帧=10ms  即词在整段语音中的结束时间为(bg+we*10)ms
                    /// 中间结果的 we 为 0
                    /// </summary>
                    public string we;
                }

                public WS[] ws;
            }

            public RT rt;

            /// <summary>
            /// 句子在整段语音中的开始时间,单位毫秒(ms)
            /// 中间结果的bg为准确值
            /// </summary>
            public string bg;

            /// <summary>
            /// 结果类型标识	0-最终结果;1-中间结果
            /// </summary>
            public string type;

            /// <summary>
            /// 句子在整段语音中的结束时间,单位毫秒(ms)
            /// 中间结果的ed为0
            /// </summary>
            public string ed;
        }

        public ST st;
    }

    public CN cn;

    /// <summary>
    /// 
    /// </summary>
    public string ls;
}

转写结果data字段说明如下:

字段含义描述
seg_id转写结果序号从0开始
w词识别结果
wp词标识n-普通词;s-顺滑词(语气词);p-标点
wb词在本句中的开始时间,单位是帧,1帧=10ms 即词在整段语音中的开始时间为(bg+wb*10)ms中间结果的 wb 为 0
we词在本句中的结束时间,单位是帧,1帧=10ms 即词在整段语音中的结束时间为(bg+we*10)ms中间结果的 we 为 0
bg句子在整段语音中的开始时间,单位毫秒(ms)中间结果的bg为准确值
type结果类型标识0-最终结果;1-中间结果
ed句子在整段语音中的结束时间,单位毫秒(ms)中间结果的ed为0
ls官网未说明官网未说明

三、该篇全部代码

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using UnityEngine;
using System.Net.WebSockets;
using System.Security.Cryptography;
using System.Text;
using System.Threading;
using Newtonsoft.Json;

/// <summary>
/// 返回值结果格式为json
/// </summary>
[Serializable]
public struct JsonData
{
    /// <summary>
    /// 结果标识,started:握手,result:结果,error:异常
    /// </summary>
    public string action;

    /// <summary>
    /// 结果码(具体见错误码)
    /// </summary>
    public string code;

    /// <summary>
    /// 转写结果数据
    /// </summary>
    public Data data;

    /// <summary>
    /// 描述
    /// </summary>
    public string desc;

    /// <summary>
    /// 会话ID
    /// 主要用于DEBUG追查问题,如果出现问题,可以提供sid帮助确认问题。
    /// </summary>
    public string sid;
}

/// <summary>
/// 语音识别的结果
/// </summary>
[Serializable]
public struct Data
{
    /// <summary>
    /// 转写结果序号	从0开始
    /// </summary>
    public string seg_id;

    [Serializable]
    public struct CN
    {
        [Serializable]
        public struct ST
        {
            [Serializable]
            public struct RT
            {
                [Serializable]
                public class WS
                {
                    [Serializable]
                    public class CW
                    {
                        /// <summary>
                        /// 词识别结果
                        /// </summary>
                        public string w;

                        /// <summary>
                        /// 词标识  n-普通词;s-顺滑词(语气词);p-标点
                        /// </summary>
                        public string wp;
                    }

                    public CW[] cw;

                    /// <summary>
                    /// 词在本句中的开始时间,单位是帧,1帧=10ms  即词在整段语音中的开始时间为(bg+wb*10)ms
                    /// 中间结果的 wb 为 0
                    /// </summary>
                    public string wb;

                    /// <summary>
                    /// 词在本句中的结束时间,单位是帧,1帧=10ms  即词在整段语音中的结束时间为(bg+we*10)ms
                    /// 中间结果的 we 为 0
                    /// </summary>
                    public string we;
                }

                public WS[] ws;
            }

            public RT rt;

            /// <summary>
            /// 句子在整段语音中的开始时间,单位毫秒(ms)
            /// 中间结果的bg为准确值
            /// </summary>
            public string bg;

            /// <summary>
            /// 结果类型标识	0-最终结果;1-中间结果
            /// </summary>
            public string type;

            /// <summary>
            /// 句子在整段语音中的结束时间,单位毫秒(ms)
            /// 中间结果的ed为0
            /// </summary>
            public string ed;
        }

        public ST st;
    }

    public CN cn;

    /// <summary>
    /// 
    /// </summary>
    public string ls;
}

public class VisualCommunication : MonoBehaviour
{
    private string appid = "xxxxx";
    private string appkey = "xxxxxxxxxxxxxxxxxxxxxxxxx";

    private string timeStamp;
    private string baseString;
    private string toMd5;

    private string signa;

    public AudioClip RecordedClip;
    ClientWebSocket ws;
    CancellationToken ct;

    private int MAX_RECORD_LENGTH = 3599;

    /// <summary>
    /// 语音识别回调事件
    /// </summary>
    public event Action<string> asrCallback;

    void Start()
    {
        asrCallback += Output;
        //Debug.Log("\nappid: " + appid);
        //Debug.Log("\nappkey: " + appkey);
        //Debug.Log("\ntimeStamp: " + timeStamp);
        //Debug.Log("\nbaseString: " + baseString);
        //Debug.Log("\nToMD5: " + toMd5);
        //Debug.Log("\nsigna: " + signa);
    }

    void Output(string str)
    {
        Debug.Log("语音识别结果:" + str);
    }

    public void StartASR()
    {
        if (ws != null && ws.State == WebSocketState.Open)
        {
            Debug.LogWarning("开始语音识别失败!,等待上次识别连接结束");
            return;
        }

        if (Microphone.devices.Length == 0)
        {
            Debug.LogError("未检测到可用的麦克风");
            return;
        }

        ConnectASR_Aysnc();
        RecordedClip = Microphone.Start(null, false, MAX_RECORD_LENGTH, 16000);
    }

    public void StopASR()
    {
        if (ws != null)
        {
            //关掉发送音频的协程
            StopCoroutine(SendAudioClip());

            //Debug.Log("发送结束标识" + ws.State);
            //音频数据上传完成后,客户端需发送一个 {"end": true} 到服务端作为结束标识
            ws.SendAsync(new ArraySegment<byte>(Encoding.UTF8.GetBytes("{\"end\": true}")), WebSocketMessageType.Binary,
                true, new CancellationToken());
            //ws.CloseAsync(WebSocketCloseStatus.NormalClosure, "关闭WebSocket连接", new CancellationToken());
            Microphone.End(null);
            StartCoroutine(StopRecord());
        }
    }

    private IEnumerator StopRecord()
    {
        yield return new WaitUntil(() => ws.State != WebSocketState.Open);
        Debug.Log("识别结束,停止录音");
    }

    async void ConnectASR_Aysnc()
    {
        ws = new ClientWebSocket();
        ct = new CancellationToken();
        Uri url = GetUri();
        await ws.ConnectAsync(url, ct);

        StartCoroutine(SendAudioClip());
        StringBuilder stringBuilder = new StringBuilder();
        while (ws.State == WebSocketState.Open)
        {
            var result = new byte[4096];

            await ws.ReceiveAsync(new ArraySegment<byte>(result), ct); //接受数据
            List<byte> list = new List<byte>(result);
            while (list[list.Count - 1] == 0x00) list.RemoveAt(list.Count - 1); //去除空字节
            string str = Encoding.UTF8.GetString(list.ToArray());
            Debug.Log("接收消息:" + str);
            if (string.IsNullOrEmpty(str))
            {
                return;
            }

            JsonData jsonData = JsonUtility.FromJson<JsonData>(str);
            if (jsonData.action.Equals("started"))
            {
                Debug.Log("握手成功!");
            }
            else if (jsonData.action.Equals("result"))
            {
                stringBuilder.Append(AnalysisResult(jsonData));
            }
            else if (jsonData.action.Equals("error"))
            {
                Debug.Log("Error: " + jsonData.desc);
                ws.Abort();
            }
        }

        Debug.LogWarning("断开连接");
        string s = stringBuilder.ToString();
        if (!string.IsNullOrEmpty(s))
        {
            asrCallback?.Invoke(s);
            Debug.LogWarning("识别到声音:" + s);
        }
    }


    /// <summary>
    /// 发送音频片段
    /// </summary>
    /// <param name="socket"></param>
    /// <returns></returns>
    IEnumerator SendAudioClip()
    {
        yield return new WaitWhile(() => Microphone.GetPosition(null) <= 0);
        float t = 0;
        int position = Microphone.GetPosition(null);
        const float waitTime = 0.04f; //每隔40ms发送音频
        const int Maxlength = 1280; //最多发送1280字节
        int status = 0;
        int lastPosition = 0;
        while (position < RecordedClip.samples && ws.State == WebSocketState.Open)
        {
            t += waitTime;
            if (t >= MAX_RECORD_LENGTH)
            {
                Debug.Log("录音时长已用尽,结束语音识别!");
                break;
            }

            yield return new WaitForSecondsRealtime(waitTime);
            if (Microphone.IsRecording(null))
            {
                position = Microphone.GetPosition(null);
                //Debug.Log("录音时长:" + t + "position=" + position + ",lastPosition=" + lastPosition);
            }

            if (position <= lastPosition)
            {
                // 防止出现当前采样位置和上一帧采样位置一样,导致length为0
                // 那么在调用AudioClip.GetData(float[] data, int offsetSamples);时,将报错
                continue;
            }

            int length = position - lastPosition > Maxlength ? Maxlength : position - lastPosition;
            byte[] data = GetAudioClip(lastPosition, length, RecordedClip);

            ws.SendAsync(new ArraySegment<byte>(data), WebSocketMessageType.Binary, true,
                new CancellationToken()); //发送数据
            lastPosition = lastPosition + length;
            status = 1;
        }
    }

    private void OnApplicationQuit()
    {
        StopASR();
    }

    /// <summary>
    /// 获取识别并返回字符串
    /// </summary>
    /// <param name="data">所获取的识别的Json字符串</param>
    /// <returns>所识别的连贯的一句话</returns>
    string AnalysisResult(JsonData jsonData)
    {
        StringBuilder stringBuilder = new StringBuilder();
        var ws = jsonData.data.cn.st.rt.ws;
        if (ws != null)
        {
            foreach (var item in ws)
            {
                var cw = item.cw;
                foreach (var w in cw)
                {
                    stringBuilder.Append(w.w);
                }
            }

            return stringBuilder.ToString();
        }

        return "";
    }

    /// <summary>
    /// 获取音频流片段
    /// </summary>
    /// <param name="start">起始采样点</param>
    /// <param name="length">采样长度</param>
    /// <param name="recordedClip">音频</param>
    /// <returns></returns>
    public static byte[] GetAudioClip(int start, int length, AudioClip recordedClip)
    {
        float[] soundata = new float[length];
        recordedClip.GetData(soundata, start);
        int rescaleFactor = 32767;
        byte[] outData = new byte[soundata.Length * 2];
        for (int i = 0; i < soundata.Length; i++)
        {
            short temshort = (short) (soundata[i] * rescaleFactor);
            byte[] temdata = BitConverter.GetBytes(temshort);
            outData[i * 2] = temdata[0];
            outData[i * 2 + 1] = temdata[1];
        }

        return outData;
    }

    /// <summary>
    /// 获得请求URI
    /// </summary>
    /// <returns>请求的URI</returns>
    private Uri GetUri()
    {
        //精确到秒
        timeStamp = GetTimeStamp();

        //baseString由appid和当前时间戳ts拼接而成
        baseString = appid + timeStamp;

        //对baseString进行MD5
        toMd5 = ToMD5(baseString);

        //以apiKey为key对MD5之后的baseString进行HmacSHA1加密
        //然后再对加密后的字符串进行base64编码
        signa = ToHmacSHA1(toMd5, appkey);

        string requestUrl = string.Format("wss://rtasr.xfyun.cn/v1/ws?appid={0}&ts={1}&signa={2}&pd=tech", appid,
            timeStamp, UrlEncode(signa));
        Debug.Log("requestUrl: " + requestUrl);
        return new Uri(requestUrl);
    }

    /// <summary>
    /// 对字符串进行UrlEncode转码
    /// </summary>
    /// <param name="str">需要转码的字符串</param>
    /// <returns>经过UrlEncode转码的字符串</returns>
    public static string UrlEncode(string str)
    {
        StringBuilder sb = new StringBuilder();
        byte[] byStr = System.Text.Encoding.UTF8.GetBytes(str); //默认是System.Text.Encoding.Default.GetBytes(str)
        for (int i = 0; i < byStr.Length; i++)
        {
            sb.Append(@"%" + Convert.ToString(byStr[i], 16));
        }

        return (sb.ToString());
    }

    /// <summary>
    /// 获取时间戳
    /// </summary>
    /// <returns>时间戳,精确到秒</returns>
    public static string GetTimeStamp()
    {
        TimeSpan ts = DateTime.UtcNow - new DateTime(1970, 1, 1, 0, 0, 0, 0);
        return Convert.ToInt64(ts.TotalSeconds).ToString();
    }

    /// <summary>
    /// MD5字符串加密
    /// </summary>
    /// <param name="txt">需要加密的字符串</param>
    /// <returns>加密后字符串</returns>
    public static string ToMD5(string txt)
    {
        using (MD5 mi = MD5.Create())
        {
            byte[] buffer = Encoding.Default.GetBytes(txt);
            //开始加密
            byte[] newBuffer = mi.ComputeHash(buffer);
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < newBuffer.Length; i++)
            {
                sb.Append(newBuffer[i].ToString("x2"));
            }

            return sb.ToString();
        }
    }

    /// <summary>
    /// HMACSHA1算法加密并返回ToBase64String
    /// </summary>
    /// <param name="text">要加密的原串</param>
    ///<param name="key">私钥</param>
    /// <returns>返回一个签名值(即哈希值)</returns>
    public static string ToHmacSHA1(string text, string key)

    {
        //HMACSHA1加密
        HMACSHA1 hmacsha1 = new HMACSHA1();
        hmacsha1.Key = System.Text.Encoding.UTF8.GetBytes(key);

        byte[] dataBuffer = System.Text.Encoding.UTF8.GetBytes(text);
        byte[] hashBytes = hmacsha1.ComputeHash(dataBuffer);

        return Convert.ToBase64String(hashBytes);
    }
}

四、如何测试

  1. 正确填写讯飞控制台上自己创建的应用所对应的appid、appkey
  2. 将该脚本VisualCommunication.cs挂在场景任意对象上,
  3. 确保你的麦克风可用

五、说明
在调用StopASR()方法后,223行会报一个错误“WebSocketException: The remote party closed the WebSocket connection without completing the close handshake.”,尚未解决

六、该篇参考资料
官方文档:https://www.xfyun.cn/doc/asr/rtasr/API.html
代码实现:https://blog.csdn.net/chunyu90225/article/details/106172895

转载注明出处!

  • 7
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 8
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

jack_dos

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值