微软TTS延迟因素分析

最新推荐文章于 2024-10-14 16:06:48 发布

李昂的数字之旅

最新推荐文章于 2024-10-14 16:06:48 发布

阅读量1.2k

点赞数 15

分类专栏：聊天系统文章标签： microsoft 语音识别 TTS

本文链接：https://blog.csdn.net/xsgnzb/article/details/141262832

版权

聊天系统专栏收录该内容

6 篇文章

订阅专栏

微软合成语音的过程中受多方面因素影响，具体每个因素有多大的影响，下面通过具体的测试，给出结论。

请求方式

通过微软的SDK有8种请求合成语音的API，可以分成Text or SSML，同步 or 异步，流式 or 非流式 3种形式。我们分别看每种方式合成语音的延迟情况。另外，还补充不同区域、是否预初始化SpeechSynthesizer、重复内容合成延迟的差异。

SpeakText
SpeakTextAsync
StartSpeakingText
StartSpeakingTextAsync
SpeakSsml
SpeakSsmlAsync
StartSpeakingSsml
StartSpeakingSsmlAsync

结论

语音合成的延迟影响最大因素是区域、是否流式。

测试配置

音色：zh-CN-XiaochenMultilingualNeural

语音：en-US

输出格式：SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3

测试内容：随机生成30个中文

Text or SSML

text方式：speechSynthesizerZH.SpeakText(text);

ssml方式：speechSynthesizerZH.SpeakSsml(ssml);

结论

两种方式相差不到，Text比SSML稍微快3~5%。

测试结果数据

NO.	SpeakText	SpeakSsml	Difference
0	574	611	-37
1	773	697	76
2	580	670	-90
3	546	730	-184
4	808	684	124
5	686	761	-75
6	636	550	86
7	471	665	-194
8	553	665	-112
9	607	699	-92
10	609	580	29
11	531	985	-454
12	795	1030	-235
13	549	612	-63
14	622	638	-16
15	545	895	-350
16	563	621	-58
17	547	578	-31
18	580	652	-72
19	532	653	-121
20	1157	595	562
21	713	788	-75
22	609	606	3
23	595	776	-181
24	621	562	59
25	566	682	-116
26	564	669	-105
27	547	627	-80
28	592	727	-135
29	1158	597	561
30	542	578	-36
31	880	593	287
32	653	785	-132
33	598	591	7
34	697	593	104
35	701	653	48
36	565	880	-315
37	591	622	-31
38	637	635	2
39	676	562	114
40	548	742	-194
41	708	624	84
42	498	575	-77
43	610	524	86
44	634	563	71
45	687	618	69
46	684	698	-14
47	746	668	78
48	508	530	-22
49	636	610	26
50	742	655	87
avg	642.5490196	664.7843137	-22.23529412

测试代码

void textOrSsml() {
    SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
    config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);
    
    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        speechSynthesizerZH.SpeakText(text);

        var ssmlText = getRandomChinese(TEXT_LENGTH);
        String ssml = buildSsml(ssmlText, "zh-CN-XiaochenMultilingualNeural", "en-US");
        speechSynthesizerZH.SpeakSsml(ssml);
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var textTimes = new ArrayList<Long>();
    var ssmlTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            String ssml = buildSsml(text, "zh-CN-XiaochenMultilingualNeural", "en-US");
            long s = System.currentTimeMillis();
            SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.SpeakSsml(ssml);
            ssmlTimes.add(System.currentTimeMillis() - s);
        }
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.SpeakText(text);
            textTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tSpeakText\tSpeakSsml\tDifference\n");
    for (int i = 0; i < textTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(textTimes.get(i)).append("\t");
        report.append(ssmlTimes.get(i)).append("\t");
        report.append(textTimes.get(i) - ssmlTimes.get(i)).append("\n");
    }
    double textTimeAvg = textTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double ssmlTimeAvg = ssmlTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(textTimeAvg).append("\t");
    report.append(ssmlTimeAvg).append("\t");
    report.append(textTimeAvg - ssmlTimeAvg).append("\n");
    System.out.println(report);
}

同步 or 异步

同步方式：speechSynthesizerZH.SpeakText(text);

异步方式：speechSynthesizerZH.SpeakTextAsync(text).get();

结论

异步方式是使用线程池去调用SpeakText(text)方法，从单次请求看，延迟几乎没区别。SSML方式结果也是一样。

测试结果数据

NO.	SpeakText	SpeakTextAsync	Difference
0	606	567	39
1	589	604	-15
2	717	689	28
3	544	519	25
4	665	652	13
5	623	596	27
6	636	574	62
7	655	639	16
8	591	578	13
9	636	759	-123
10	563	503	60
11	653	622	31
12	536	719	-183
13	633	621	12
14	516	532	-16
15	687	470	217
16	590	609	-19
17	582	610	-28
18	590	590	0
19	623	500	123
20	639	716	-77
21	522	581	-59
22	908	556	352
23	581	574	7
24	593	565	28
25	607	591	16
26	594	502	92
27	550	834	-284
28	624	717	-93
29	518	590	-72
30	666	488	178
31	659	605	54
32	470	548	-78
33	539	496	43
34	547	515	32
35	517	644	-127
36	575	513	62
37	559	562	-3
38	683	773	-90
39	537	716	-179
40	777	565	212
41	560	1126	-566
42	552	578	-26
43	1086	655	431
44	789	791	-2
45	641	730	-89
46	485	590	-105
47	867	576	291
48	578	1076	-498
49	657	579	78
50	681	537	144
avg	623.4509804	624.3529412	-0.901960784

测试代码

void syncOrAsync() throws ExecutionException, InterruptedException {
    SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
    config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text1 = getRandomChinese(TEXT_LENGTH);
        var text2 = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        speechSynthesizerZH.SpeakText(text1);
        speechSynthesizerZH.SpeakTextAsync(text2).get();
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var textTimes = new ArrayList<Long>();
    var testAsyncTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizerZH.SpeakText(text);
            textTimes.add(System.currentTimeMillis() - s);
        }
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizerZH.SpeakTextAsync(text).get();
            testAsyncTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tSpeakText\tSpeakTextAsync\tDifference\n");
    for (int i = 0; i < textTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(textTimes.get(i)).append("\t");
        report.append(testAsyncTimes.get(i)).append("\t");
        report.append(textTimes.get(i) - testAsyncTimes.get(i)).append("\n");
    }
    double textTimeAvg = textTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double ssmlTimeAvg = testAsyncTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(textTimeAvg).append("\t");
    report.append(ssmlTimeAvg).append("\t");
    report.append(textTimeAvg - ssmlTimeAvg).append("\n");
    System.out.println(report);
}

流式 or 非流式

流式方式：speechSynthesizerZH.StartSpeakingText(text);

非流式方式：speechSynthesizerZH.SpeakText(text);

结论

流式接口返回的首个音频数据，比非流式完整返回的延迟低了30%。在10个字的情况下，低了20%。文本越长，流式的延迟效果越好。

测试结果数据

NO.	NotStream	Strream	Difference
0	728	746	-18
1	701	483	218
2	744	452	292
3	728	505	223
4	628	481	147
5	657	502	155
6	701	437	264
7	713	489	224
8	657	469	188
9	731	608	123
10	656	472	184
11	681	502	179
12	727	467	260
13	796	531	265
14	689	439	250
15	711	474	237
16	790	410	380
17	719	698	21
18	671	423	248
19	791	595	196
20	736	410	326
21	775	488	287
22	644	465	179
23	700	460	240
24	718	441	277
25	819	530	289
26	671	426	245
27	595	438	157
28	746	516	230
29	853	504	349
30	700	485	215
31	787	466	321
32	734	470	264
33	680	426	254
34	928	458	470
35	729	434	295
36	689	699	-10
37	761	428	333
38	722	472	250
39	661	591	70
40	719	502	217
41	756	482	274
42	708	426	282
43	858	456	402
44	728	380	348
45	729	472	257
46	667	443	224
47	762	441	321
48	699	380	319
49	594	379	215
50	625	499	126
avg	719.8431373	483.3333333	236.5098039

测试代码

void streamOrNotStream() {
    SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
    config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text1 = getRandomChinese(TEXT_LENGTH);
        var text2 = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        speechSynthesizerZH.SpeakText(text1);
        SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.StartSpeakingText(text2);
        AudioDataStream audioDataStream = AudioDataStream.fromResult(speechSynthesisResult);
        byte[] buffer = new byte[10000];
        long filledSize = audioDataStream.readData(buffer);
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var notStreamTimes = new ArrayList<Long>();
    var streamTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizerZH.SpeakText(text);
            notStreamTimes.add(System.currentTimeMillis() - s);
        }
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.StartSpeakingText(text);
            AudioDataStream audioDataStream = AudioDataStream.fromResult(speechSynthesisResult);
            byte[] buffer = new byte[8000];
            long filledSize = audioDataStream.readData(buffer);
            streamTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tNotStream\tStream\tDifference\n");
    for (int i = 0; i < notStreamTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(notStreamTimes.get(i)).append("\t");
        report.append(streamTimes.get(i)).append("\t");
        report.append(notStreamTimes.get(i) - streamTimes.get(i)).append("\n");
    }
    double notStreamTimeAvg = notStreamTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double streamTimeAvg = streamTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(notStreamTimeAvg).append("\t");
    report.append(streamTimeAvg).append("\t");
    report.append(notStreamTimeAvg - streamTimeAvg).append("\n");
    System.out.println(report);
}

不同区域

测试区域：eastasia（东亚）、southeastasia（东南亚）、japanwest（日本西）、japaneast（日本东）

请求区域：东京

结论

japanwest（日本西：505ms）< japaneast（日本东：542）< eastasia（东亚：630）< southeastasia（东南亚：733）

从东亚换到日本西，延迟降低20%

测试结果数据

NO.	japaneast	eastasia	japanwest	southeastasia
1	507	843	771	607
2	507	626	465	1072
3	500	609	445	693
4	594	597	488	815
5	627	679	482	663
6	563	626	440	642
7	724	828	585	714
8	502	599	469	848
9	573	534	599	675
10	506	627	420	703
11	519	597	522	611
12	579	600	448	917
13	599	653	608	704
14	540	624	439	871
15	606	625	517	647
16	557	530	468	730
17	580	601	468	792
18	590	575	414	633
19	521	520	447	690
20	643	617	448	687
21	542	772	448	688
22	486	521	414	664
23	421	662	727	678
24	607	559	439	728
25	453	593	1277	619
26	455	538	422	814
27	547	574	470	625
28	604	619	469	654
29	525	601	538	637
30	482	510	456	793
31	556	556	426	830
32	480	566	448	642
33	585	633	699	739
34	530	637	452	822
35	520	521	438	765
36	457	533	532	717
37	548	631	572	1102
38	508	841	427	655
39	508	583	463	802
40	623	793	427	693
41	453	1141	513	745
42	584	606	432	669
43	538	703	538	1022
44	537	715	437	668
45	549	526	513	714
46	534	569	407	707
47	582	585	444	683
48	492	629	452	681
49	511	572	603	764
50	573	705	463	648
avg	542.54	630.08	505.78	733.64

测试代码

static String diffRegion(int textLength, int loopTimes, Pair<String, String>... regions) {
    Map<String, SpeechSynthesizer> regionSpeechSynthesizer = new HashMap<>();
    Map<String, List<Long>> regionTimes = new HashMap<>();
    for (Pair<String, String> pair : regions) {
        val region = pair.getRight();
        SpeechConfig config = SpeechConfig.fromSubscription(pair.getLeft(), region);
        config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
        config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
        SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config, null);

        regionSpeechSynthesizer.put(region, speechSynthesizer);
        regionTimes.put(region, new ArrayList<>());
    }

    // ---- Init Link ---- //
    {
        long s = System.currentTimeMillis();
        for (Map.Entry<String, SpeechSynthesizer> entry : regionSpeechSynthesizer.entrySet()) {
            var text = getRandomChinese(textLength);
            SpeechSynthesisResult speechSynthesisResult = entry.getValue().SpeakText(text);
            if (speechSynthesisResult.getReason() == ResultReason.Canceled) {
                SpeechSynthesisCancellationDetails speechSynthesisCancellationDetails = SpeechSynthesisCancellationDetails.fromResult(speechSynthesisResult);
                System.out.println("failed: " + entry.getKey());
                System.out.println(speechSynthesisCancellationDetails.getErrorDetails());
                ;
            } else {
                System.out.println("ok：" + entry.getKey());
            }
        }
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    for (int i = 1; i <= loopTimes; i++) {
        System.out.println("loop NO." + i);
        for (Map.Entry<String, SpeechSynthesizer> entry : regionSpeechSynthesizer.entrySet()) {
            var text = getRandomChinese(textLength);
            long s = System.currentTimeMillis();
            entry.getValue().SpeakText(text);
            regionTimes.get(entry.getKey()).add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.");
    for (String region : regionTimes.keySet()) {
        report.append("\t" + region);
    }
    report.append("\n");

    for (int i = 0; i < loopTimes; i++) {
        report.append(i + 1).append("\t");
        for (List<Long> times : regionTimes.values()) {
            report.append(times.get(i)).append("\t");
        }
        report.append("\n");
    }
    // 统计平均时间
    report.append("avg").append("\t");
    for (List<Long> times : regionTimes.values()) {
        double avg = times.stream().mapToLong(Long::longValue).average().orElse(0);
        report.append(avg).append("\t");
    }
    report.append("\n");
    return report.toString();
}

预连接SpeechSynthesizer

预连接方式：

SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, (AudioConfig) null);
Connection connection = Connection.fromSpeechSynthesizer(synthesizer);
connection.openConnection(true);

结论

合成语音延迟没有差别。

测试结果数据

NO.	PreConnect	NotPreConnect	Difference
0	350	334	16
1	347	305	42
2	350	410	-60
3	350	350	0
4	335	350	-15
5	349	336	13
6	364	348	16
7	334	349	-15
8	347	350	-3
9	348	349	-1
10	347	365	-18
11	349	339	10
12	354	351	3
13	336	348	-12
14	350	368	-18
15	330	368	-38
16	337	349	-12
17	336	353	-17
18	367	350	17
19	348	346	2
20	348	348	0
21	350	335	15
22	351	353	-2
23	350	349	1
24	352	352	0
25	355	351	4
26	335	337	-2
27	349	336	13
28	349	352	-3
29	352	363	-11
30	349	353	-4
31	350	335	15
32	352	303	49
33	349	349	0
34	352	349	3
35	351	349	2
36	349	351	-2
37	351	349	2
38	352	347	5
39	348	351	-3
40	349	351	-2
41	321	349	-28
42	349	349	0
43	365	351	14
44	350	350	0
45	349	351	-2
46	349	336	13
47	335	335	0
48	350	337	13
49	350	348	2
50	351	349	2
avg	347.8431373	347.7647059	0.078431373

测试代码

void preConnect() throws InterruptedException {
    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    var reuseTimes = new ArrayList<Long>();
    var notReuseTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
            config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
            config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
            SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config1, null);
            Connection connection = Connection.fromSpeechSynthesizer(speechSynthesizer);
            connection.openConnection(true);

            Thread.sleep(200);

            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizer.SpeakText(text);
            reuseTimes.add(System.currentTimeMillis() - s);
        }

        {
            SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
            config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
            config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
            SpeechSynthesizer notPreConnectSpeechSynthesizer = new SpeechSynthesizer(config2, null);

            Thread.sleep(200);

            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            notPreConnectSpeechSynthesizer.SpeakText(text);
            notReuseTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tPreConnect\tNotPreConnect\tDifference\n");
    for (int i = 0; i < reuseTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(reuseTimes.get(i)).append("\t");
        report.append(notReuseTimes.get(i)).append("\t");
        report.append(reuseTimes.get(i) - notReuseTimes.get(i)).append("\n");
    }
    double avg1 = reuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double avg2 = notReuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(avg1).append("\t");
    report.append(avg2).append("\t");
    report.append(avg1 - avg2).append("\n");
    System.out.println(report);
}

重用SpeechSynthesizer

重用：从池子里获取SpeechSynthesizer对象

不重用：每次请求重新new SpeechSynthesizer对象

结论

合成语音延迟没有差别。

测试结果数据

NO.	Reuse	NotReuse	Difference
0	339	336	3
1	348	364	-16
2	335	353	-18
3	349	349	0
4	349	336	13
5	349	353	-4
6	363	349	14
7	363	353	10
8	348	351	-3
9	350	335	15
10	352	350	2
11	348	350	-2
12	351	349	2
13	355	349	6
14	352	352	0
15	350	349	1
16	368	350	18
17	350	350	0
18	349	348	1
19	352	364	-12
20	349	352	-3
21	349	351	-2
22	351	348	3
23	351	349	2
24	351	351	0
25	365	367	-2
26	350	381	-31
27	352	349	3
28	351	336	15
29	352	350	2
30	351	349	2
31	351	331	20
32	349	367	-18
33	351	352	-1
34	349	352	-3
35	349	350	-1
36	351	365	-14
37	353	346	7
38	351	351	0
39	348	350	-2
40	350	349	1
41	379	363	16
42	350	347	3
43	335	346	-11
44	352	347	5
45	350	352	-2
46	351	347	4
47	350	368	-18
48	353	347	6
49	350	335	15
50	366	350	16
avg	351.5686275	350.745098	0.823529412

测试代码

void reuseOrNot() {
    SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
    config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer reuseSpeechSynthesizer = new SpeechSynthesizer(config1, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text1 = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        reuseSpeechSynthesizer.SpeakText(text1);
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var reuseTimes = new ArrayList<Long>();
    var notReuseTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            reuseSpeechSynthesizer.SpeakText(text);
            reuseTimes.add(System.currentTimeMillis() - s);
        }
        {
            long s = System.currentTimeMillis();

            SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
            config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
            config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
            SpeechSynthesizer notPreInitSpeechSynthesizer = new SpeechSynthesizer(config2, null);

            var text = getRandomChinese(TEXT_LENGTH);
            notPreInitSpeechSynthesizer.SpeakText(text);
            notReuseTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tReuse\tNotReuse\tDifference\n");
    for (int i = 0; i < reuseTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(reuseTimes.get(i)).append("\t");
        report.append(notReuseTimes.get(i)).append("\t");
        report.append(reuseTimes.get(i) - notReuseTimes.get(i)).append("\n");
    }
    double reuseTimeAvg = reuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double notReuseTimeAvg = notReuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(reuseTimeAvg).append("\t");
    report.append(notReuseTimeAvg).append("\t");
    report.append(reuseTimeAvg - notReuseTimeAvg).append("\n");
    System.out.println(report);
}

重复内容合成

相同内容，重复多次调用。

结论

合成语音延迟没有差别。

测试结果数据

NO.	SameText	NotSameText	Difference
0	363	365	-2
1	350	346	4
2	349	350	-1
3	333	349	-16
4	349	348	1
5	350	352	-2
6	335	350	-15
7	335	347	-12
8	350	365	-15
9	348	348	0
10	350	347	3
11	350	350	0
12	349	352	-3
13	350	350	0
14	334	348	-14
15	348	350	-2
16	352	349	3
17	335	351	-16
18	351	347	4
19	348	353	-5
20	349	347	2
21	335	347	-12
22	349	347	2
23	331	351	-20
24	350	352	-2
25	349	334	15
26	350	363	-13
27	349	349	0
28	350	347	3
29	350	352	-2
30	348	351	-3
31	351	350	1
32	350	320	30
33	348	349	-1
34	380	351	29
35	347	353	-6
36	347	340	7
37	349	347	2
38	349	351	-2
39	333	350	-17
40	366	353	13
41	363	347	16
42	353	348	5
43	350	353	-3
44	351	349	2
45	350	347	3
46	349	351	-2
47	348	348	0
48	337	350	-13
49	332	352	-20
50	352	349	3
avg	347.9215686	349.3137255	-1.392156863

测试代码

void sameText() throws InterruptedException {
    SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
    config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer sameSpeechSynthesizer = new SpeechSynthesizer(config1, null);

    SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
    config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config2, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    var sameTextTimes = new ArrayList<Long>();
    var notSameTextTimes = new ArrayList<Long>();

    var sameText = getRandomChinese(TEXT_LENGTH);
    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            long s = System.currentTimeMillis();
            sameSpeechSynthesizer.SpeakText(sameText);
            sameTextTimes.add(System.currentTimeMillis() - s);
        }

        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizer.SpeakText(text);
            notSameTextTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tSameText\tNotSameText\tDifference\n");
    for (int i = 0; i < sameTextTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(sameTextTimes.get(i)).append("\t");
        report.append(notSameTextTimes.get(i)).append("\t");
        report.append(sameTextTimes.get(i) - notSameTextTimes.get(i)).append("\n");
    }
    double avg1 = sameTextTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double avg2 = notSameTextTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(avg1).append("\t");
    report.append(avg2).append("\t");
    report.append(avg1 - avg2).append("\n");
    System.out.println(report);
}

公共方法

public static String getRandomChinese(int length) {
    StringBuilder sb = new StringBuilder();
    Random random = new Random();

    for (int i = 0; i < length; i++) {
        int codePoint = 0x4e00 + random.nextInt(0x9fa5 - 0x4e00 + 1);
        sb.append((char) codePoint);
    }

    return sb.toString();
}

private final static String SSML_TEMPLATE = """
            <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="{lang}">
              <voice name="{voiceName}">
                    {text}
              </voice>
            </speak>
            """;
    
public static String buildSsml(String text, String voiceName, String lang) {
    return SSML_TEMPLATE
            // 设置了<lang xml:lang="{lang}">标签里的多语言，就无法识别多语音
            .replaceAll("\\{lang\\}", lang)
            .replaceAll("\\{voiceName\\}", voiceName)
            .replaceAll("\\{text\\}", text);
}