關於微軟TTS的筆記

CoffeeAndIce

已于 2024-05-06 10:12:28 修改

阅读量1k

点赞数

分类专栏：接入笔记文章标签： tts 微軟TTS 文本轉語音

于 2020-09-02 17:15:00 首次发布

本文链接：https://blog.csdn.net/CoffeeAndIce/article/details/108366155

版权

接入笔记专栏收录该内容

8 篇文章 4 订阅

订阅专栏

零、政策更新

1、 2024年之後將不支持標準語音，建議大家更換下神經語音\

2、southeastasia 已经不支持，变为 southeastasia （2024.4.28）

零、政策更新

1、 2024年之後將不支持標準語音，建議大家更換下神經語音\

2、southeastasia 已经不支持，变为 southeastasia （2024.4.28）

一、扯皮TTS

國內例如阿里、科大讯飞、腾讯都有自己的TTS，簡單來說就是文本到語音的一種，語音合成

但是這些都不是我要講的重點，而是國外的微軟TTS，遇到什麼就寫什麼才是我想要的

先說環境：

若為window server服務器的可以忽略這一部分

官方有提供的sdk，通常是為了微軟服務器所默認的，所以一般我們在自己的電腦的時候進行sdk開發是不存在問題的。但是一旦遇到linux 環境中，則會出現 UnsatisfiedLinkError 問題，主要出現在SpeechConfig.fromSubscription(SubscKey, Location)內出現問題

支持的環境：

幾乎支持所有平台：C ++ / Windows，Linux和macOS

Linux中 ：Ubuntu 16.04 / 18.04，Debian 9，Red Hat Enterprise Linux（RHEL）7/8和CentOS 7/8

對於linux環境，我按照官方文檔，依舊無法搭建，直接走的restful api模式，但是儘管如此，還是寫了一部分內容

SDK模式：

sdk模式，其實官方文檔也有說明，我在此基礎上對其進行的自我的封裝，希望後面自己看的時候還是可以回憶起來，整體為存放至特定文件夾

引入依賴

#倉庫地址
<repositories>
    <repository>
        <id>maven-cognitiveservices-speech</id>
        <name>Microsoft Cognitive Services Speech Maven Repository</name>
        <url>https://csspeechstorage.blob.core.windows.net/maven/</url>
    </repository>
</repositories>
    ...
<dependency>
    <groupId>com.microsoft.cognitiveservices.speech</groupId>
    <artifactId>client-sdk</artifactId>
    <version>1.13.0</version>
</dependency>

統一註冊事件

sdk中統一需要的註冊配置

SubscKey：申請TTS後會存在兩個密鑰，任選其一即可

Location：申請TTS後會存在一個地區，填上即可

#TVFSProgramConfig.class
public static SpeechConfig speechConfig;

public TVFSProgramConfig() {
    speechConfig = SpeechConfig.fromSubscription(SubscKey, Location);
}

①文本模式：

針對純文本，主要模式為對語音進行直接合成輸出

涉及調節內容：語音地區、語音發聲人

注意點：需要根據自己申請帳號的區域來選擇語音地區

#核心方法
//absolutePath 為絕對地址，例如 /opt/voice/test.wav
AudioConfig audioConfig = AudioConfig.fromWavFileOutput(absolutePath.toString());
SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);

//transformText 代表需要轉換的文本
SpeechSynthesisResult result = synthesizer.SpeakTextAsync(transformText).get();


//監聽響應情況
while (handlerStatus) {
    //成功
    if (result.getReason() == ResultReason.SynthesizingAudioCompleted) {
        System.out.println("Speech synthesized to speaker for text [" + transformText + "]");
        handlerStatus = Boolean.FALSE;
        synthesizer.close();
        speechConfig.close();
        audioConfig.close();
    //被取消
    } else if (result.getReason() == ResultReason.Canceled) {
        SpeechSynthesisCancellationDetails cancellation = SpeechSynthesisCancellationDetails.fromResult(result);
        System.out.println("CANCELED: Reason=" + cancellation.getReason());
        //若進入，則存在錯誤
        if (cancellation.getReason() == CancellationReason.Error) {
            System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
            System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
            System.out.println("CANCELED: Did you update the subscription info?");
        }
        handlerStatus = Boolean.FALSE;
        synthesizer.close();
        speechConfig.close();
        audioConfig.close();
        result.close();
    }
}

②SSML模式

針對純文本，主要模式為對語音合成的其他參數進行修正

涉及調節內容：語音地區、語音發聲人、音量、音調、語速

主要引用JXB

1、個人定義分為三種類型

基礎接口類 ：基於提取與複用關係，修正為基礎接口類

普通文本類：針對普通文本直接轉換為語音

語速調整類：針對語速、發音人、音量等編輯後轉換為語音

神經語言類：調用神經語言的模式編輯神經語言特有屬性

基礎接口類

public interface VoiceBase<T> {

    /**
     * 设置输出内容
     *
     * @param text
     * @return
     */
    T voiceText(String text);

    /**
     * 设置语言
     * xml上的name值
     *
     * @param name
     */
    void voiceLang(String name);


    /**
     * 将对象直接转换成String类型的 XML输出
     *
     * @return
     */
    default String convertToXml() {
        Object obj = this;
        // 创建输出流
        StringWriter sw = new StringWriter();
        try {
            // 利用jdk中自带的转换类实现
            JAXBContext context = JAXBContext.newInstance(obj.getClass());
            Marshaller marshaller = context.createMarshaller();
            // 格式化xml输出的格式
            marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
            // 将对象转换成输出流形式的xml
            marshaller.marshal(obj, sw);
        } catch (JAXBException e) {
            e.printStackTrace();
        }
        return sw.toString();
    }

}

普通文本類

/**
 * 普通解析語句，不存在語速等調整
 */
@XmlRootElement(name = "speak")
public class VoiceXml implements VoiceBase {

    @XmlAttribute(name = "version")
    private String version = "1.0";

    @XmlAttribute(name = "xmlns")
    private String xmlns = "http://www.w3.org/2001/10/synthesis";

    @XmlAttribute(name = "xml:lang")
    private String lang = "zh-HK";

    @XmlElement(name = "voice")
    private static Voice voice = new Voice();

    public VoiceXml() {
    }

    @Override
    public void voiceLang(String name) {
        voice.name = name;
    }

    public VoiceXml(String name) {
        if (null != name) {
            voice.name = name;
        }
    }


    @Override
    public VoiceRateXml voiceText(String text) {
        voice.html = text;
        return null;
    }

    public static class Voice {

        @XmlAttribute(name = "name")
        private String name = "zh-HK-TracyRUS";

        /**
         * 用於轉義成語音的文段
         */
        @XmlValue
        private String html = "This is awesome";
    }

    public static class Prosody {

        @XmlAttribute(name = "prosody")
        private String style = "cheerful";

        /**
         * 用於轉義成語音的文段
         */
        @XmlElement
        private String html = "This is awesome";

        public String getHtml() {
            return html;
        }

        public void setHtml(String html) {
            this.html = html;
        }
    }

}

語速調整類

/**
 * 語速調整語句，存在語速等調整
 */
@XmlRootElement(name = "speak")
public class VoiceRateXml implements VoiceBase<VoiceRateXml> {
    private static final String Default = "default";

    @XmlAttribute(name = "version")
    private String version = "1.0";

    @XmlAttribute(name = "xmlns")
    private String xmlns = "http://www.w3.org/2001/10/synthesis";

    @XmlAttribute(name = "xml:lang")
    private String lang = "zh-HK";

    @XmlElement(name = "voice")
    private static Voice voice = new Voice();

    public VoiceRateXml() {
    }

    @Override
    public void voiceLang(String lang) {
        if (null != lang) {
            lang = lang;
        }
    }

    public VoiceRateXml VoiceSpeakName(String name) {
        if (null != name) {
            voice.name = name;
        }
        return this;
    }

    public VoiceRateXml(String name) {
        if (null != name) {
            voice.name = name;
        }
    }


    @Override
    public VoiceRateXml voiceText(String text) {
        voice.prosody.html = text;
        return this;
    }

    /**
     * 取值为0 ~1.0 ,默认值为defalut
     * x-slow,slow,medium,fast,x-fast
     * {@link Prosody#rate}
     *
     * @param rate
     * @return
     */
    public VoiceRateXml VoiceRate(String rate) {
        if (StringUtils.isNotEmpty(rate)) {
            voice.prosody.rate = rate;
        } else {
            voice.prosody.rate = Default;
        }
        return this;
    }

    /**
     * 音量可選值：0.0 ~ 100.0  or silent, x-soft,soft,medium,loud,x-loud
     * 默認值為defalut
     * {@link Prosody#volume}
     *
     * @param volume
     * @return
     */
    public VoiceRateXml VoiceVolume(String volume) {
        if (StringUtils.isNotEmpty(volume)) {
            voice.prosody.volume = volume;
        } else {
            voice.prosody.volume = Default;
        }
        return this;
    }

    /**
     * 音量可選值：0.0 ~ 100.0  or x-low,low,medium,high,x-high
     * 默認值為defalut
     * {@link Prosody#pitch}
     *
     * @param pitch
     * @return
     */
    public VoiceRateXml VoicePitch(String pitch) {
        if (StringUtils.isNotEmpty(pitch)) {
            voice.prosody.pitch = pitch;
        } else {
            voice.prosody.pitch = Default;
        }
        return this;
    }

    public static class Voice {

        @XmlAttribute(name = "name")
        private String name = "zh-HK-TracyRUS";

        @XmlElement(name = "prosody")
        private Prosody prosody = new Prosody();

    }

    public static class Prosody {

        @XmlAttribute(name = "rate")
        private String rate = "default";

        @XmlAttribute(name = "volume")
        private String volume = "default";

        @XmlAttribute(name = "pitch")
        private String pitch = "default";

        /**
         * 用於轉義成語音的文段
         */
        @XmlValue
        private String html = "This is awesome";
    }
}

神經語言類

更多細節可以根據需要在下屬官方中尋找

https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp

通常為東亞地區，東亞地區是不符合神經語言的要求的，所以這裡我簡化了

/**
 * 神經語言語句，存在特殊定制等調整
 * <p>
 * only support {southeastasia、eastus、eastus2、westeurope}
 */
@XmlRootElement(name = "speak")
public class VoiceNatureXml implements VoiceBase {

    @XmlAttribute(name = "version")
    private String version = "1.0";

    @XmlAttribute(name = "xmlns")
    private String xmlns = "http://www.w3.org/2001/10/synthesis";

    @XmlAttribute(name = "xmlns:mstts")
    private String xmlns_mstts = "https://www.w3.org/2001/mstts";


    @XmlAttribute(name = "xml:lang")
    private String lang = "zh-HK";

    @XmlElement(name = "voice")
    private static Voice voice = new Voice();

    public VoiceNatureXml() {
    }

    @Override
    public void voiceLang(String name) {
        voice.name = name;
    }

    public VoiceNatureXml(String name) {
        if (null != name) {
            voice.name = name;
        }
    }

    @Override
    public VoiceRateXml voiceText(String text) {
        voice.mstts.html = text;
        return null;
    }

    public static class Voice {

        @XmlAttribute(name = "name")
        private String name = "zh-HK-HiuGaaiNeural";

        @XmlElement(name = "mstts:express-as")
        private Mstts mstts = new Mstts();

    }

    public static class Mstts {

        @XmlAttribute(name = "style")
        private String style = "cheerful";

        /**
         * 用於轉義成語音的文段
         */
        @XmlValue
        private String html = "This is awesome";

    }

}

2、主要核心方法

//關於解析器是一致的配置，根據需要再進行定制化處理
//SpeakSsmlAsync 就是處理解析的xml格式的String內容，通過調用方式獲取音頻流
Future<SpeechSynthesisResult> synResult = synthesizer.SpeakSsmlAsync(ssml);
        SpeechSynthesisResult result = synResult.get();
        while (handlerStatus) {
            System.out.println(result.getReason());
            System.out.println(ResultReason.SynthesizingAudioCompleted);
            if (result.getReason() == ResultReason.SynthesizingAudioCompleted) {
                System.out.println("Speech synthesized to speaker for text [" + transformText + "]");
                handlerStatus = Boolean.FALSE;
                synthesizer.close();
                speechConfig.close();

            } else if (result.getReason() == ResultReason.Canceled) {
                SpeechSynthesisCancellationDetails cancellation = SpeechSynthesisCancellationDetails.fromResult(result);
                System.out.println("CANCELED: Reason=" + cancellation.getReason());

                if (cancellation.getReason() == CancellationReason.Error) {
                    System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
                    System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
                    System.out.println("CANCELED: Did you update the subscription info?");
                }
                handlerStatus = Boolean.FALSE;
                synthesizer.close();
                speechConfig.close();
                result.close();
            }
        }

③關於解析器SpeechConfig

一般來說，屬於默認配置且在飛SSML下使用的配置，直達對應方法

https://docs.microsoft.com/zh-cn/java/api/com.microsoft.cognitiveservices.speech.SpeechConfig?view=azure-java-stable

//這邊相當於將其封裝了，其實還存在音頻
    private static SpeechConfig _translang(SpeechConfig speechConfig, String lang, String sex) {
        if (null == sex) {
            sex = HUMAN;
        }
        if (HUMAN.equals(sex)) {
            if (EN.equals(lang)) {
                speechConfig.setSpeechSynthesisLanguage("en-US");
                speechConfig.setSpeechSynthesisVoiceName("en-US-ZiraRUS");
            } else if (SC.equals(lang)) {
                speechConfig.setSpeechSynthesisLanguage("zh-CN");
                speechConfig.setSpeechSynthesisVoiceName("zh-CN-HuihuiRUS");
            } else {
                speechConfig.setSpeechSynthesisLanguage("zh-HK");
                speechConfig.setSpeechSynthesisVoiceName("zh-HK-TracyRUS");
            }
        } else {
            if (EN.equals(lang)) {
                speechConfig.setSpeechSynthesisLanguage("en-US");
                speechConfig.setSpeechSynthesisVoiceName("en-US-BenjaminRUS");
            } else if (SC.equals(lang)) {
                speechConfig.setSpeechSynthesisLanguage("zh-CN");
                speechConfig.setSpeechSynthesisVoiceName("zh-CN-Kangkang-Apollo");
            } else {
                speechConfig.setSpeechSynthesisLanguage("zh-HK");
                speechConfig.setSpeechSynthesisVoiceName("zh-HK-Danny-Apollo");
            }
        }
        speechConfig.requestWordLevelTimestamps();
        return speechConfig;
    }

RestFul模式：

更為簡單，其實確定好語音與區域之後只需要如下兩個接口即可

語音轉換：https://eastasia.tts.speech.microsoft.com/cognitiveservices/v1;

獲取令牌：https://eastasia.api.cognitive.microsoft.com/sts/v1.0/issueToken

1、獲取令牌

為post請求，頭部信息需要 Ocp-Apim-Subscription-Key

 //簡略可以這樣標識，作為頭部信息，然後會直接返回字符串，字符串內容即為token內容
 headMap.put("Ocp-Apim-Subscription-Key", yourSubscKey);

2、文本轉語音

我們需要將ssml格式的內容以請求體的形式發送post請求，以期待返回流信息

/**請求頭部信息
* token: 為步驟一獲取的令牌
*User-Agent：可以隨意自定義名稱，作為自己的標識
**/
headMap.put("Authorization", Token);
headMap.put("Content-Type", "application/ssml+xml");
headMap.put("X-Microsoft-OutputFormat", "audio-16khz-64kbitrate-mono-mp3");
headMap.put("User-Agent", "TEST");

對應git廠庫:

https://github.com/CoffeeAndIce/spring_prevalent_assemble/tree/master/microsoft

CoffeeAndIce

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
關於微軟TTS的筆記

一、扯皮TTS國內例如阿里、科大讯飞、腾讯都有自己的TTS，簡單來說就是文本到語音的一種，語音合成但是這些都不是我要講的重點，而是國外的微軟TTS，遇到什麼就寫什麼才是我想要的先說環境：若為window server服務器的可以忽略這一部分官方有提供的sdk，通常是為了微軟服務器所默認的，所以一般我們在自己的電腦的時候進行sdk開發是不存在問題的。但是一旦遇到linux 環境中，則會出現 UnsatisfiedLinkError 問題，主要出現在SpeechConfig..
复制链接

扫一扫

专栏目录