微软TTS延迟因素分析

微软合成语音的过程中受多方面因素影响,具体每个因素有多大的影响,下面通过具体的测试,给出结论。

请求方式

通过微软的SDK有8种请求合成语音的API,可以分成Text or SSML,同步 or 异步,流式 or 非流式 3种形式。我们分别看每种方式合成语音的延迟情况。另外,还补充不同区域、是否预初始化SpeechSynthesizer、重复内容合成延迟的差异。

  1. SpeakText

  2. SpeakTextAsync

  3. StartSpeakingText

  4. StartSpeakingTextAsync

  5. SpeakSsml

  6. SpeakSsmlAsync

  7. StartSpeakingSsml

  8. StartSpeakingSsmlAsync

结论

语音合成的延迟影响最大因素是区域、是否流式。

测试配置

音色:zh-CN-XiaochenMultilingualNeural

语音:en-US

输出格式:SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3

测试内容:随机生成30个中文

Text or SSML

text方式:speechSynthesizerZH.SpeakText(text);

ssml方式:speechSynthesizerZH.SpeakSsml(ssml);

结论

两种方式相差不到,Text比SSML稍微快3~5%。

测试结果数据

NO.

SpeakText

SpeakSsml

Difference

0

574

611

-37

1

773

697

76

2

580

670

-90

3

546

730

-184

4

808

684

124

5

686

761

-75

6

636

550

86

7

471

665

-194

8

553

665

-112

9

607

699

-92

10

609

580

29

11

531

985

-454

12

795

1030

-235

13

549

612

-63

14

622

638

-16

15

545

895

-350

16

563

621

-58

17

547

578

-31

18

580

652

-72

19

532

653

-121

20

1157

595

562

21

713

788

-75

22

609

606

3

23

595

776

-181

24

621

562

59

25

566

682

-116

26

564

669

-105

27

547

627

-80

28

592

727

-135

29

1158

597

561

30

542

578

-36

31

880

593

287

32

653

785

-132

33

598

591

7

34

697

593

104

35

701

653

48

36

565

880

-315

37

591

622

-31

38

637

635

2

39

676

562

114

40

548

742

-194

41

708

624

84

42

498

575

-77

43

610

524

86

44

634

563

71

45

687

618

69

46

684

698

-14

47

746

668

78

48

508

530

-22

49

636

610

26

50

742

655

87

avg

642.5490196

664.7843137

-22.23529412

测试代码

void textOrSsml() {
    SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
    config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);
    
    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        speechSynthesizerZH.SpeakText(text);

        var ssmlText = getRandomChinese(TEXT_LENGTH);
        String ssml = buildSsml(ssmlText, "zh-CN-XiaochenMultilingualNeural", "en-US");
        speechSynthesizerZH.SpeakSsml(ssml);
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var textTimes = new ArrayList<Long>();
    var ssmlTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            String ssml = buildSsml(text, "zh-CN-XiaochenMultilingualNeural", "en-US");
            long s = System.currentTimeMillis();
            SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.SpeakSsml(ssml);
            ssmlTimes.add(System.currentTimeMillis() - s);
        }
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.SpeakText(text);
            textTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tSpeakText\tSpeakSsml\tDifference\n");
    for (int i = 0; i < textTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(textTimes.get(i)).append("\t");
        report.append(ssmlTimes.get(i)).append("\t");
        report.append(textTimes.get(i) - ssmlTimes.get(i)).append("\n");
    }
    double textTimeAvg = textTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double ssmlTimeAvg = ssmlTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(textTimeAvg).append("\t");
    report.append(ssmlTimeAvg).append("\t");
    report.append(textTimeAvg - ssmlTimeAvg).append("\n");
    System.out.println(report);
}

同步 or 异步

同步方式:speechSynthesizerZH.SpeakText(text);

异步方式:speechSynthesizerZH.SpeakTextAsync(text).get();

结论

异步方式是使用线程池去调用SpeakText(text)方法,从单次请求看,延迟几乎没区别。SSML方式结果也是一样。

测试结果数据

NO.

SpeakText

SpeakTextAsync

Difference

0

606

567

39

1

589

604

-15

2

717

689

28

3

544

519

25

4

665

652

13

5

623

596

27

6

636

574

62

7

655

639

16

8

591

578

13

9

636

759

-123

10

563

503

60

11

653

622

31

12

536

719

-183

13

633

621

12

14

516

532

-16

15

687

470

217

16

590

609

-19

17

582

610

-28

18

590

590

0

19

623

500

123

20

639

716

-77

21

522

581

-59

22

908

556

352

23

581

574

7

24

593

565

28

25

607

591

16

26

594

502

92

27

550

834

-284

28

624

717

-93

29

518

590

-72

30

666

488

178

31

659

605

54

32

470

548

-78

33

539

496

43

34

547

515

32

35

517

644

-127

36

575

513

62

37

559

562

-3

38

683

773

-90

39

537

716

-179

40

777

565

212

41

560

1126

-566

42

552

578

-26

43

1086

655

431

44

789

791

-2

45

641

730

-89

46

485

590

-105

47

867

576

291

48

578

1076

-498

49

657

579

78

50

681

537

144

avg

623.4509804

624.3529412

-0.901960784

测试代码

void syncOrAsync() throws ExecutionException, InterruptedException {
    SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
    config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text1 = getRandomChinese(TEXT_LENGTH);
        var text2 = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        speechSynthesizerZH.SpeakText(text1);
        speechSynthesizerZH.SpeakTextAsync(text2).get();
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var textTimes = new ArrayList<Long>();
    var testAsyncTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizerZH.SpeakText(text);
            textTimes.add(System.currentTimeMillis() - s);
        }
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizerZH.SpeakTextAsync(text).get();
            testAsyncTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tSpeakText\tSpeakTextAsync\tDifference\n");
    for (int i = 0; i < textTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(textTimes.get(i)).append("\t");
        report.append(testAsyncTimes.get(i)).append("\t");
        report.append(textTimes.get(i) - testAsyncTimes.get(i)).append("\n");
    }
    double textTimeAvg = textTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double ssmlTimeAvg = testAsyncTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(textTimeAvg).append("\t");
    report.append(ssmlTimeAvg).append("\t");
    report.append(textTimeAvg - ssmlTimeAvg).append("\n");
    System.out.println(report);
}

流式 or 非流式

流式方式:speechSynthesizerZH.StartSpeakingText(text);

非流式方式:speechSynthesizerZH.SpeakText(text);

结论

流式接口返回的首个音频数据,比非流式完整返回的延迟低了30%。在10个字的情况下,低了20%。文本越长,流式的延迟效果越好。

测试结果数据

NO.

NotStream

Strream

Difference

0

728

746

-18

1

701

483

218

2

744

452

292

3

728

505

223

4

628

481

147

5

657

502

155

6

701

437

264

7

713

489

224

8

657

469

188

9

731

608

123

10

656

472

184

11

681

502

179

12

727

467

260

13

796

531

265

14

689

439

250

15

711

474

237

16

790

410

380

17

719

698

21

18

671

423

248

19

791

595

196

20

736

410

326

21

775

488

287

22

644

465

179

23

700

460

240

24

718

441

277

25

819

530

289

26

671

426

245

27

595

438

157

28

746

516

230

29

853

504

349

30

700

485

215

31

787

466

321

32

734

470

264

33

680

426

254

34

928

458

470

35

729

434

295

36

689

699

-10

37

761

428

333

38

722

472

250

39

661

591

70

40

719

502

217

41

756

482

274

42

708

426

282

43

858

456

402

44

728

380

348

45

729

472

257

46

667

443

224

47

762

441

321

48

699

380

319

49

594

379

215

50

625

499

126

avg

719.8431373

483.3333333

236.5098039

测试代码

void streamOrNotStream() {
    SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
    config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text1 = getRandomChinese(TEXT_LENGTH);
        var text2 = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        speechSynthesizerZH.SpeakText(text1);
        SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.StartSpeakingText(text2);
        AudioDataStream audioDataStream = AudioDataStream.fromResult(speechSynthesisResult);
        byte[] buffer = new byte[10000];
        long filledSize = audioDataStream.readData(buffer);
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var notStreamTimes = new ArrayList<Long>();
    var streamTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizerZH.SpeakText(text);
            notStreamTimes.add(System.currentTimeMillis() - s);
        }
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.StartSpeakingText(text);
            AudioDataStream audioDataStream = AudioDataStream.fromResult(speechSynthesisResult);
            byte[] buffer = new byte[8000];
            long filledSize = audioDataStream.readData(buffer);
            streamTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tNotStream\tStream\tDifference\n");
    for (int i = 0; i < notStreamTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(notStreamTimes.get(i)).append("\t");
        report.append(streamTimes.get(i)).append("\t");
        report.append(notStreamTimes.get(i) - streamTimes.get(i)).append("\n");
    }
    double notStreamTimeAvg = notStreamTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double streamTimeAvg = streamTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(notStreamTimeAvg).append("\t");
    report.append(streamTimeAvg).append("\t");
    report.append(notStreamTimeAvg - streamTimeAvg).append("\n");
    System.out.println(report);
}

不同区域

测试区域:eastasia(东亚)、southeastasia(东南亚)、japanwest(日本西)、japaneast(日本东)

请求区域:东京

结论

japanwest(日本西:505ms)< japaneast(日本东:542)< eastasia(东亚:630)< southeastasia(东南亚:733)

从东亚换到日本西,延迟降低20%

测试结果数据

NO.

japaneast

eastasia

japanwest

southeastasia

1

507

843

771

607

2

507

626

465

1072

3

500

609

445

693

4

594

597

488

815

5

627

679

482

663

6

563

626

440

642

7

724

828

585

714

8

502

599

469

848

9

573

534

599

675

10

506

627

420

703

11

519

597

522

611

12

579

600

448

917

13

599

653

608

704

14

540

624

439

871

15

606

625

517

647

16

557

530

468

730

17

580

601

468

792

18

590

575

414

633

19

521

520

447

690

20

643

617

448

687

21

542

772

448

688

22

486

521

414

664

23

421

662

727

678

24

607

559

439

728

25

453

593

1277

619

26

455

538

422

814

27

547

574

470

625

28

604

619

469

654

29

525

601

538

637

30

482

510

456

793

31

556

556

426

830

32

480

566

448

642

33

585

633

699

739

34

530

637

452

822

35

520

521

438

765

36

457

533

532

717

37

548

631

572

1102

38

508

841

427

655

39

508

583

463

802

40

623

793

427

693

41

453

1141

513

745

42

584

606

432

669

43

538

703

538

1022

44

537

715

437

668

45

549

526

513

714

46

534

569

407

707

47

582

585

444

683

48

492

629

452

681

49

511

572

603

764

50

573

705

463

648

avg

542.54

630.08

505.78

733.64

测试代码

static String diffRegion(int textLength, int loopTimes, Pair<String, String>... regions) {
    Map<String, SpeechSynthesizer> regionSpeechSynthesizer = new HashMap<>();
    Map<String, List<Long>> regionTimes = new HashMap<>();
    for (Pair<String, String> pair : regions) {
        val region = pair.getRight();
        SpeechConfig config = SpeechConfig.fromSubscription(pair.getLeft(), region);
        config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
        config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
        SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config, null);

        regionSpeechSynthesizer.put(region, speechSynthesizer);
        regionTimes.put(region, new ArrayList<>());
    }

    // ---- Init Link ---- //
    {
        long s = System.currentTimeMillis();
        for (Map.Entry<String, SpeechSynthesizer> entry : regionSpeechSynthesizer.entrySet()) {
            var text = getRandomChinese(textLength);
            SpeechSynthesisResult speechSynthesisResult = entry.getValue().SpeakText(text);
            if (speechSynthesisResult.getReason() == ResultReason.Canceled) {
                SpeechSynthesisCancellationDetails speechSynthesisCancellationDetails = SpeechSynthesisCancellationDetails.fromResult(speechSynthesisResult);
                System.out.println("failed: " + entry.getKey());
                System.out.println(speechSynthesisCancellationDetails.getErrorDetails());
                ;
            } else {
                System.out.println("ok:" + entry.getKey());
            }
        }
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    for (int i = 1; i <= loopTimes; i++) {
        System.out.println("loop NO." + i);
        for (Map.Entry<String, SpeechSynthesizer> entry : regionSpeechSynthesizer.entrySet()) {
            var text = getRandomChinese(textLength);
            long s = System.currentTimeMillis();
            entry.getValue().SpeakText(text);
            regionTimes.get(entry.getKey()).add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.");
    for (String region : regionTimes.keySet()) {
        report.append("\t" + region);
    }
    report.append("\n");

    for (int i = 0; i < loopTimes; i++) {
        report.append(i + 1).append("\t");
        for (List<Long> times : regionTimes.values()) {
            report.append(times.get(i)).append("\t");
        }
        report.append("\n");
    }
    // 统计平均时间
    report.append("avg").append("\t");
    for (List<Long> times : regionTimes.values()) {
        double avg = times.stream().mapToLong(Long::longValue).average().orElse(0);
        report.append(avg).append("\t");
    }
    report.append("\n");
    return report.toString();
}

预连接SpeechSynthesizer

预连接方式:

SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, (AudioConfig) null);
Connection connection = Connection.fromSpeechSynthesizer(synthesizer);
connection.openConnection(true);

结论

合成语音延迟没有差别。

测试结果数据

NO.

PreConnect

NotPreConnect

Difference

0

350

334

16

1

347

305

42

2

350

410

-60

3

350

350

0

4

335

350

-15

5

349

336

13

6

364

348

16

7

334

349

-15

8

347

350

-3

9

348

349

-1

10

347

365

-18

11

349

339

10

12

354

351

3

13

336

348

-12

14

350

368

-18

15

330

368

-38

16

337

349

-12

17

336

353

-17

18

367

350

17

19

348

346

2

20

348

348

0

21

350

335

15

22

351

353

-2

23

350

349

1

24

352

352

0

25

355

351

4

26

335

337

-2

27

349

336

13

28

349

352

-3

29

352

363

-11

30

349

353

-4

31

350

335

15

32

352

303

49

33

349

349

0

34

352

349

3

35

351

349

2

36

349

351

-2

37

351

349

2

38

352

347

5

39

348

351

-3

40

349

351

-2

41

321

349

-28

42

349

349

0

43

365

351

14

44

350

350

0

45

349

351

-2

46

349

336

13

47

335

335

0

48

350

337

13

49

350

348

2

50

351

349

2

avg

347.8431373

347.7647059

0.078431373

测试代码

void preConnect() throws InterruptedException {
    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    var reuseTimes = new ArrayList<Long>();
    var notReuseTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
            config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
            config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
            SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config1, null);
            Connection connection = Connection.fromSpeechSynthesizer(speechSynthesizer);
            connection.openConnection(true);

            Thread.sleep(200);

            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizer.SpeakText(text);
            reuseTimes.add(System.currentTimeMillis() - s);
        }

        {
            SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
            config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
            config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
            SpeechSynthesizer notPreConnectSpeechSynthesizer = new SpeechSynthesizer(config2, null);

            Thread.sleep(200);

            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            notPreConnectSpeechSynthesizer.SpeakText(text);
            notReuseTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tPreConnect\tNotPreConnect\tDifference\n");
    for (int i = 0; i < reuseTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(reuseTimes.get(i)).append("\t");
        report.append(notReuseTimes.get(i)).append("\t");
        report.append(reuseTimes.get(i) - notReuseTimes.get(i)).append("\n");
    }
    double avg1 = reuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double avg2 = notReuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(avg1).append("\t");
    report.append(avg2).append("\t");
    report.append(avg1 - avg2).append("\n");
    System.out.println(report);
}

重用SpeechSynthesizer

重用:从池子里获取SpeechSynthesizer对象

不重用:每次请求重新new SpeechSynthesizer对象

结论

合成语音延迟没有差别。

测试结果数据

NO.

Reuse

NotReuse

Difference

0

339

336

3

1

348

364

-16

2

335

353

-18

3

349

349

0

4

349

336

13

5

349

353

-4

6

363

349

14

7

363

353

10

8

348

351

-3

9

350

335

15

10

352

350

2

11

348

350

-2

12

351

349

2

13

355

349

6

14

352

352

0

15

350

349

1

16

368

350

18

17

350

350

0

18

349

348

1

19

352

364

-12

20

349

352

-3

21

349

351

-2

22

351

348

3

23

351

349

2

24

351

351

0

25

365

367

-2

26

350

381

-31

27

352

349

3

28

351

336

15

29

352

350

2

30

351

349

2

31

351

331

20

32

349

367

-18

33

351

352

-1

34

349

352

-3

35

349

350

-1

36

351

365

-14

37

353

346

7

38

351

351

0

39

348

350

-2

40

350

349

1

41

379

363

16

42

350

347

3

43

335

346

-11

44

352

347

5

45

350

352

-2

46

351

347

4

47

350

368

-18

48

353

347

6

49

350

335

15

50

366

350

16

avg

351.5686275

350.745098

0.823529412

测试代码

void reuseOrNot() {
    SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
    config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer reuseSpeechSynthesizer = new SpeechSynthesizer(config1, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    // ---- Init Link ---- //
    {
        var text1 = getRandomChinese(TEXT_LENGTH);
        long s = System.currentTimeMillis();
        reuseSpeechSynthesizer.SpeakText(text1);
        System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
    }

    var reuseTimes = new ArrayList<Long>();
    var notReuseTimes = new ArrayList<Long>();

    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            reuseSpeechSynthesizer.SpeakText(text);
            reuseTimes.add(System.currentTimeMillis() - s);
        }
        {
            long s = System.currentTimeMillis();

            SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
            config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
            config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
            SpeechSynthesizer notPreInitSpeechSynthesizer = new SpeechSynthesizer(config2, null);

            var text = getRandomChinese(TEXT_LENGTH);
            notPreInitSpeechSynthesizer.SpeakText(text);
            notReuseTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tReuse\tNotReuse\tDifference\n");
    for (int i = 0; i < reuseTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(reuseTimes.get(i)).append("\t");
        report.append(notReuseTimes.get(i)).append("\t");
        report.append(reuseTimes.get(i) - notReuseTimes.get(i)).append("\n");
    }
    double reuseTimeAvg = reuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double notReuseTimeAvg = notReuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(reuseTimeAvg).append("\t");
    report.append(notReuseTimeAvg).append("\t");
    report.append(reuseTimeAvg - notReuseTimeAvg).append("\n");
    System.out.println(report);
}

重复内容合成

相同内容,重复多次调用。

结论

合成语音延迟没有差别。

测试结果数据

NO.

SameText

NotSameText

Difference

0

363

365

-2

1

350

346

4

2

349

350

-1

3

333

349

-16

4

349

348

1

5

350

352

-2

6

335

350

-15

7

335

347

-12

8

350

365

-15

9

348

348

0

10

350

347

3

11

350

350

0

12

349

352

-3

13

350

350

0

14

334

348

-14

15

348

350

-2

16

352

349

3

17

335

351

-16

18

351

347

4

19

348

353

-5

20

349

347

2

21

335

347

-12

22

349

347

2

23

331

351

-20

24

350

352

-2

25

349

334

15

26

350

363

-13

27

349

349

0

28

350

347

3

29

350

352

-2

30

348

351

-3

31

351

350

1

32

350

320

30

33

348

349

-1

34

380

351

29

35

347

353

-6

36

347

340

7

37

349

347

2

38

349

351

-2

39

333

350

-17

40

366

353

13

41

363

347

16

42

353

348

5

43

350

353

-3

44

351

349

2

45

350

347

3

46

349

351

-2

47

348

348

0

48

337

350

-13

49

332

352

-20

50

352

349

3

avg

347.9215686

349.3137255

-1.392156863

测试代码

void sameText() throws InterruptedException {
    SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
    config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer sameSpeechSynthesizer = new SpeechSynthesizer(config1, null);

    SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
    config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
    config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
    SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config2, null);

    final int TEXT_LENGTH = 30;
    final int LOOP_TIMES = 50;

    var sameTextTimes = new ArrayList<Long>();
    var notSameTextTimes = new ArrayList<Long>();

    var sameText = getRandomChinese(TEXT_LENGTH);
    for (int i = 0; i <= LOOP_TIMES; i++) {
        {
            long s = System.currentTimeMillis();
            sameSpeechSynthesizer.SpeakText(sameText);
            sameTextTimes.add(System.currentTimeMillis() - s);
        }

        {
            var text = getRandomChinese(TEXT_LENGTH);
            long s = System.currentTimeMillis();
            speechSynthesizer.SpeakText(text);
            notSameTextTimes.add(System.currentTimeMillis() - s);
        }
    }

    // ---- 生成结果 ---- //
    StringBuilder report = new StringBuilder();
    report.append("NO.\tSameText\tNotSameText\tDifference\n");
    for (int i = 0; i < sameTextTimes.size(); i++) {
        report.append(i).append("\t");
        report.append(sameTextTimes.get(i)).append("\t");
        report.append(notSameTextTimes.get(i)).append("\t");
        report.append(sameTextTimes.get(i) - notSameTextTimes.get(i)).append("\n");
    }
    double avg1 = sameTextTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    double avg2 = notSameTextTimes.stream().mapToLong(Long::longValue).average().orElse(0);
    report.append("avg").append("\t");
    report.append(avg1).append("\t");
    report.append(avg2).append("\t");
    report.append(avg1 - avg2).append("\n");
    System.out.println(report);
}

公共方法

public static String getRandomChinese(int length) {
    StringBuilder sb = new StringBuilder();
    Random random = new Random();

    for (int i = 0; i < length; i++) {
        int codePoint = 0x4e00 + random.nextInt(0x9fa5 - 0x4e00 + 1);
        sb.append((char) codePoint);
    }

    return sb.toString();
}
private final static String SSML_TEMPLATE = """
            <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="{lang}">
              <voice name="{voiceName}">
                    {text}
              </voice>
            </speak>
            """;
    
public static String buildSsml(String text, String voiceName, String lang) {
    return SSML_TEMPLATE
            // 设置了<lang xml:lang="{lang}">标签里的多语言,就无法识别多语音
            .replaceAll("\\{lang\\}", lang)
            .replaceAll("\\{voiceName\\}", voiceName)
            .replaceAll("\\{text\\}", text);
}

引用

  • 9
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
对于PHP实现微软TTS(TextToSpeech)的需求,可以借助微软提供的Speech SDK来实现。首先,需要安装Speech SDK并获取API密钥,然后可以使用以下代码示例来实现TTS功能: ```php <?php require 'vendor/autoload.php'; // 引入Speech SDK的PHP库 use Microsoft\CognitiveServices\Speech\SpeechConfig; use Microsoft\CognitiveServices\Speech\SpeechSynthesizer; use Microsoft\CognitiveServices\Speech\Audio\AudioConfig; // 设置Speech服务的订阅密钥和区域(根据实际情况修改) $speechKey = 'YOUR_SPEECH_KEY'; $serviceRegion = 'YOUR_SERVICE_REGION'; // 创建Speech配置 $speechConfig = SpeechConfig::fromSubscription($speechKey, $serviceRegion); // 创建音频配置 $audioConfig = AudioConfig::fromDefaultSpeakerOutput(); // 创建Speech合成器 $speechSynthesizer = new SpeechSynthesizer($speechConfig, $audioConfig); // 文本合成为语音 $textToSynthesize = '需要合成的文本内容'; $result = $speechSynthesizer->speakTextAsync($textToSynthesize)->waitFor(); // 检查合成结果 if ($result->reason === ResultReason::SynthesizingAudioCompleted) { // 获取合成的音频数据 $audioData = $result->audioData; // 处理音频数据,例如保存为文件或播放等 } else { // 合成失败,处理错误信息 $errorDetails = $result->errorDetails; // 处理错误信息 } ?> ``` 以上代码使用Speech SDK提供的PHP库来实现TTS功能。首先,需要设置Speech服务的订阅密钥和区域。然后,创建Speech配置和音频配置,并使用它们创建Speech合成器。最后,通过调用`speakTextAsync`方法并传入需要合成的文本内容来进行文本合成为语音。合成的音频数据可以通过`$result->audioData`获取,可以根据需要进行处理,例如保存为文件或进行播放等。如果合成失败,可以通过`$result->errorDetails`获取错误信息进行处理。 请注意替换代码中的`YOUR_SPEECH_KEY`和`YOUR_SERVICE_REGION`为您的实际订阅密钥和区域。 希望以上信息对您有帮助。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [微软TTS,Neospeech TTS 简单使用](https://blog.csdn.net/weixin_39541693/article/details/119695949)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] - *2* [文字转语音 - 搭建微软tts整合web服务提供api接口(免费)](https://blog.csdn.net/wkh___/article/details/130274523)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

李昂的数字之旅

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值