Diagnosing the "Black Box" of speech synthesis: Improving search efficiency, prosodic control, and concatenation quality in corpus-based systems.
My PhD thesis focused on the automatic speech synthesis field, and more specifically on Corpus-Based Speech Synthesis (Unit Selection). A deep analysis and a diagnosis of the unit selection algorithm (lattice search algorithm) is provided.
The importance of the solution optimality is discussed and a new unit selection implementation based on a A* algorithm is presented. Three cost function enhancements are also presented.
The first one is a new way – in the target cost – to minimize important spectral differences by selecting sequences of candidate units that minimize a mean cost instead of an absolute one. This cost is tested on a phonemic duration distance but can be applied to others.
Our second proposition is a target sub-cost addressing intonation that is based on coefficients extracted through a generalized version of Fujisaki's command-response model. This model features gamma functions modeling F0 called atoms.
Finally, our third contribution concerns a penalty system that aims at enhancing the concatenation cost. It penalizes units in function of classes defining the risk a concatenation artifact occurs when concatenating on a phone of this class. This system is different to others in the literature in that it is tempered by a fuzzy function that allows to soften penalties for units presenting low concatenation costs.
A Unit Selection synthesizer acts like a massive puzzle solver. It converts text into a target sequence of phonemes and then searches a massive graph of recorded audio to find the best matching sequence.
Is the industry-standard Viterbi algorithm actually the best way to navigate millions of potential speech units?
How do we force the system to respect rhythm and intonation without degrading audio quality?
How do we prevent the system from joining audio segments that sound terrible together?
Most TTS systems use the Viterbi algorithm with beam-search pruning. I implemented and evaluated a system based on the A* (A-Star) algorithm.
Standard systems try to match duration unit-by-unit, leading to "jittery" rhythm. I developed a Global Adaptive Duration Cost.
Controlling intonation is notoriously difficult. I integrated an Atom-Based Decomposition method (generalization of Fujisaki model).
Concatenating inside a vowel often causes clicks. I implemented a Fuzzy Logic Penalty System based on "Vocalic Sandwiches" (Safe Consonant Zones).
Co-developed the IRISA TTS engine from scratch to ensure total control over the algorithmic stack.
Utilized the ROOTS toolkit for corpus management as well as 2 high quality corpora totalizing about 17h of speech.