PhD Thesis Summary

Study of Unit Selection Text-To-Speech Synthesis Algorithms

Diagnosing the "Black Box" of speech synthesis: Improving search efficiency, prosodic control, and concatenation quality in corpus-based systems.

David Guennec
University of Rennes 1 / IRISA
2016

Summary

My PhD thesis focused on the automatic speech synthesis field, and more specifically on Corpus-Based Speech Synthesis (Unit Selection). A deep analysis and a diagnosis of the unit selection algorithm (lattice search algorithm) is provided.

The importance of the solution optimality is discussed and a new unit selection implementation based on a A* algorithm is presented. Three cost function enhancements are also presented.

The first one is a new way – in the target cost – to minimize important spectral differences by selecting sequences of candidate units that minimize a mean cost instead of an absolute one. This cost is tested on a phonemic duration distance but can be applied to others.

Our second proposition is a target sub-cost addressing intonation that is based on coefficients extracted through a generalized version of Fujisaki's command-response model. This model features gamma functions modeling F0 called atoms.

Finally, our third contribution concerns a penalty system that aims at enhancing the concatenation cost. It penalizes units in function of classes defining the risk a concatenation artifact occurs when concatenating on a phone of this class. This system is different to others in the literature in that it is tempered by a fuzzy function that allows to soften penalties for units presenting low concatenation costs.

The Challenge: Optimizing the "Black Box"

A Unit Selection synthesizer acts like a massive puzzle solver. It converts text into a target sequence of phonemes and then searches a massive graph of recorded audio to find the best matching sequence.

  • Search Efficiency

    Is the industry-standard Viterbi algorithm actually the best way to navigate millions of potential speech units?

  • Prosodic Control

    How do we force the system to respect rhythm and intonation without degrading audio quality?

  • Concatenation Artifacts

    How do we prevent the system from joining audio segments that sound terrible together?

The Selection Lattice Problem
Phoneme 1 Phoneme 2 Phoneme 3 Phoneme 4 Finding the lowest cost path through millions of units

Key Technical Contributions

1

The Search Engine: A* vs. Viterbi

Most TTS systems use the Viterbi algorithm with beam-search pruning. I implemented and evaluated a system based on the A* (A-Star) algorithm.

  • Result: Demonstrated that "good enough" solutions are often perceptually indistinguishable from optimal ones.
  • Benefit: A* proved to be a more flexible architecture for implementing complex heuristics.
2

Adaptive Duration Cost

Standard systems try to match duration unit-by-unit, leading to "jittery" rhythm. I developed a Global Adaptive Duration Cost.

Standard (Jittery) Adaptive (Smooth)
  • Outcome: Prevents the selection of "outlier" units that disrupt rhythm, creating smoother speech flow.
3

Atom-Based Pitch Control

Controlling intonation is notoriously difficult. I integrated an Atom-Based Decomposition method (generalization of Fujisaki model).

Atom Decomposition
  • Innovation: Decomposed pitch curves into "Atoms" (Gamma functions representing muscular impulses) to drive selection.
4

Fuzzy Vocalic Sandwiches

Concatenating inside a vowel often causes clicks. I implemented a Fuzzy Logic Penalty System based on "Vocalic Sandwiches" (Safe Consonant Zones).

Consonant
Vowel (Risky)
Consonant
  • Outcome: If a vowel concatenation is acoustically perfect, the system allows it. If it is risky, the system avoids it ("Fuzzy" penalty).

Methodology & Tools

Development

Co-developed the IRISA TTS engine from scratch to ensure total control over the algorithmic stack.

Data

Utilized the ROOTS toolkit for corpus management as well as 2 high quality corpora totalizing about 17h of speech.