David Guennec's Personal Pages

Summary

My PhD thesis focused on the automatic speech synthesis field, and more specifically on Corpus-Based Speech Synthesis (Unit Selection). A deep analysis and a diagnosis of the unit selection algorithm (lattice search algorithm) is provided.

The importance of the solution optimality is discussed and a new unit selection implementation based on a A* algorithm is presented. Three cost function enhancements are also presented.

The first one is a new way – in the target cost – to minimize important spectral differences by selecting sequences of candidate units that minimize a mean cost instead of an absolute one. This cost is tested on a phonemic duration distance but can be applied to others.

Our second proposition is a target sub-cost addressing intonation that is based on coefficients extracted through a generalized version of Fujisaki's command-response model. This model features gamma functions modeling F0 called atoms.

Finally, our third contribution concerns a penalty system that aims at enhancing the concatenation cost. It penalizes units in function of classes defining the risk a concatenation artifact occurs when concatenating on a phone of this class. This system is different to others in the literature in that it is tempered by a fuzzy function that allows to soften penalties for units presenting low concatenation costs.

The Challenge: Optimizing the "Black Box"

A Unit Selection synthesizer acts like a massive puzzle solver. It converts text into a target sequence of phonemes and then searches a massive graph of recorded audio to find the best matching sequence.

Search Efficiency

Is the industry-standard Viterbi algorithm actually the best way to navigate millions of potential speech units?
Prosodic Control

How do we force the system to respect rhythm and intonation without degrading audio quality?
Concatenation Artifacts

How do we prevent the system from joining audio segments that sound terrible together?

The Selection Lattice Problem

Key Technical Contributions

1

The Search Engine: A* vs. Viterbi

Most TTS systems use the Viterbi algorithm with beam-search pruning. I implemented and evaluated a system based on the A* (A-Star) algorithm.

Result: Demonstrated that "good enough" solutions are often perceptually indistinguishable from optimal ones.
Benefit: A* proved to be a more flexible architecture for implementing complex heuristics.

2

Adaptive Duration Cost

Standard systems try to match duration unit-by-unit, leading to "jittery" rhythm. I developed a Global Adaptive Duration Cost.

Outcome: Prevents the selection of "outlier" units that disrupt rhythm, creating smoother speech flow.

3

Atom-Based Pitch Control

Controlling intonation is notoriously difficult. I integrated an Atom-Based Decomposition method (generalization of Fujisaki model).

Atom Decomposition

Innovation: Decomposed pitch curves into "Atoms" (Gamma functions representing muscular impulses) to drive selection.

4

Fuzzy Vocalic Sandwiches

Concatenating inside a vowel often causes clicks. I implemented a Fuzzy Logic Penalty System based on "Vocalic Sandwiches" (Safe Consonant Zones).

Consonant

Vowel (Risky)

Consonant

Outcome: If a vowel concatenation is acoustically perfect, the system allows it. If it is risky, the system avoids it ("Fuzzy" penalty).

Methodology & Tools

Development

Co-developed the IRISA TTS engine from scratch to ensure total control over the algorithmic stack.

Data

Utilized the ROOTS toolkit for corpus management as well as 2 high quality corpora totalizing about 17h of speech.