Deep learning speech synthesis

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Formulation

Given an input text or some sequence of linguistic units , the target speech can be derived by

where is the set of model parameters.

Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the loss function is typically L1 loss (Mean Absolute Error, MAE) or L2 loss (Mean Square Error, MSE). These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:

where is the loss from human voice band and is a scalar, typically around 0.5. The acoustic feature is typically a spectrogram or Mel scale. These features capture the time-frequency relation of the speech signal, and thus are sufficient to generate intelligent outputs. The Mel-frequency cepstrum feature used in the speech recognition task is not suitable for speech synthesis, as it reduces too much information.

History

A stack of dilated casual convolutional layers used in WaveNet[1]

In September 2016, DeepMind proposed WaveNet, a deep generative model of raw audio waveforms, demonstrating that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original.[1]

In early 2017, Mila proposed char2wav, a model to produce raw waveform in an end-to-end method. In the same year, Google and Facebook proposed Tacotron and VoiceLoop, respectively, to generate acoustic features directly from the input text; months later, Google proposed Tacotron2, which combined the WaveNet vocoder with the revised Tacotron architecture to perform end-to-end speech synthesis. Tacotron2 can generate high-quality speech approaching the human voice.[citation needed]

Semi-supervised learning

Currently, self-supervised learning has gained much attention through better use of unlabelled data. Research has shown that, with the aid of self-supervised loss, the need for paired data decreases.[2][3]

Zero-shot speaker adaptation

Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings.[4] The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.

Neural vocoder

Speech synthesis example using the HiFi-GAN neural vocoder

In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The WaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform as a product of conditional probabilities as follows

where is the model parameter including many dilated convolution layers. Thus, each audio sample is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet[5] was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained by knowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile, Nvidia proposed a flow-based WaveGlow[6] model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN,[7] which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.

References

  1. ^ a b van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet". DeepMind. Retrieved 2022-06-05.
  2. ^ Chung, Yu-An (2018). "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis". arXiv:1808.10128 [cs.CL].
  3. ^ Ren, Yi (2019). "Almost Unsupervised Text to Speech and Automatic Speech Recognition". arXiv:1905.06791 [cs.CL].
  4. ^ Jia, Ye (2018). "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis". arXiv:1806.04558 [cs.CL].
  5. ^ van den Oord, Aaron (2018). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv:1711.10433 [cs.CL].
  6. ^ Prenger, Ryan (2018). "WaveGlow: A Flow-based Generative Network for Speech Synthesis". arXiv:1811.00002 [cs.SD].
  7. ^ Yamamoto, Ryuichi (2019). "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram". arXiv:1910.11480 [eess.AS].

Read other articles:

Sir David Davies (1792-1865) Sir David Davies (1792 - 1865) was a Welsh physician and physician to King William IV and Queen Adelaide. The son of Robert and Eleanor Davies of Llanddewi Brefi, Cardiganshire, he was christened at Llanddewi Brefi church, 5 September 1792. [1] Entering the medical profession whilst still quite a young man, he moved to London, and worked as an assistant to one of the physicians to Queen Adelaide. He was later appointed physician to King William IV and Ade...

 

 

Lawrence JoelLahir(1928-02-22)22 Februari 1928Winston-Salem, Carolina Utara, ASMeninggal4 Februari 1984(1984-02-04) (umur 55)Winston-Salem, Carolina UtaraDikebumikanArlington National CemeteryPengabdianAmerika SerikatDinas/cabangAngkatan Darat Amerika SerikatLama dinas1946–1973Pangkat Sersan kelas satuKesatuan1st Battalion (Abn), 503d Infantry 173rd Airborne BrigadePerang/pertempuranPerang KoreaPerang Vietnam (WIA) Operasi Hump Penghargaan Medal of Honor Silver Star Purple He...

 

 

Vaccine candidate against COVID-19 NDV-HXP-SPackaging for the Brazilian version of NDV-HXP-S, ButanVacVaccine descriptionTargetSARS-CoV-2Vaccine typeviral vector or inactivatedClinical dataTrade namesButanVac (Brazil) COVIVAC (Vietnam) HXP-GPOVac (Thailand) Patria (Mexico)Other namesADAPTCOVRoutes ofadministrationIntramuscular,[1] Intranasal Part of a series on theCOVID-19 pandemicScientifically accurate atomic model of the external structure of SARS-CoV-2. Each ball is an atom. COVID...

EuroLeague season EuroleagueThe Sinan Erdem Dome in Istanbul hosted the Final FourSeason2011–12Duration19 October 2011 – 13 May 2012Number of teams24Regular seasonSeason MVP Andrei KirilenkoFinalsChampions Olympiacos (2nd title)  Runners-up CSKA MoscowThird place FC Barcelona RegalFourth place PanathinaikosFinal Four MVP Vassilis SpanoulisStatistical leadersPoints Bo McCalebb 16.9Rebounds Andrei Kirilenko 7.5Assists Omar Cook 5.7Index Rating Andrei Kirilenko 24.2← 2010–11 20...

 

 

River in Germany SiegDrainage basin map of the SiegLocationCountryGermanyPhysical characteristicsSource  • locationSiegerland • elevation603 m (1,978 ft) Mouth  • locationRhine • coordinates50°46′7″N 7°4′32″E / 50.76861°N 7.07556°E / 50.76861; 7.07556Length155.1 km (96.4 mi) [1]Basin size2,857 km2 (1,103 sq mi) [1]Discharge ...

 

 

KorçëNegaraAlbaniaCountyCounty KorçëDistrikKorçë DistrikKetinggian850 m (2,800 ft)Populasi (2009)[1] • Total104.697Zona waktuUTC+1 (Central European Time) • Musim panas (DST)UTC+2 (CEST)Postal code7001-7004Kode area telepon082Car PlatesKOSitus webwww.bashkiakorce.gov.alSkënderbeu Korçë Korçë Korçë (bahasa Albania: Korçë atau Korça; bahasa Aromania: Curceaua; bahasa Bulgaria: Корча, Korcha atau Корче, Korche; bahasa Yunan...

American actress and film director (born 1962) For other people named Rebecca Miller, see Rebecca Miller (disambiguation). Rebecca MillerMiller at the 2023 BerlinaleBornRebecca Augusta Miller (1962-09-15) September 15, 1962 (age 61)Roxbury, Connecticut, U.S.OccupationScreenwriter, director, novelistAlma materYale UniversityYears active1988–presentSpouse Daniel Day-Lewis ​(m. 1996)​Children2ParentsArthur MillerInge MorathRelativesJoan Copeland (aun...

 

 

Северный морской котик Самец Научная классификация Домен:ЭукариотыЦарство:ЖивотныеПодцарство:ЭуметазоиБез ранга:Двусторонне-симметричныеБез ранга:ВторичноротыеТип:ХордовыеПодтип:ПозвоночныеИнфратип:ЧелюстноротыеНадкласс:ЧетвероногиеКлада:АмниотыКлада:Синапси...

 

 

FairphoneLogo Stato Paesi Bassi Forma societariaBesloten vennootschap Fondazione2013[1] a Amsterdam[1] Fondata daBas van AbelTessa Wernink Sede principaleAmsterdam [2] Persone chiaveEva Gouwens[2] SettoreTelecomunicazioni Prodotti smartphone componenti di ricambio accessori Sito webwww.fairphone.com/ Modifica dati su Wikidata · Manuale Fairphone è un'azienda nata nei Paesi Bassi nel gennaio 2013,[3] guidata da Bas van Abel, che con l'aiu...

Sungai Effra dilihat dari Jembatan VauxhallLokasiNegaraInggrisCiri-ciri fisikHulu sungaiUpper Norwood Recreation Ground, Crystal Palace, London - koordinat51°29′14″N 0°07′33″W / 51.4872°N 0.1257°W / 51.4872; -0.1257Koordinat: 51°29′14″N 0°07′33″W / 51.4872°N 0.1257°W / 51.4872; -0.1257 Muara sungaiRiver Thames Sungai Effra adalah sebuah sungai bawah tanah yang berlokasi di London Selatan, London, Inggris. Sungai...

 

 

Javanese ghost from folklore This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Pocong – news · newspapers · books · scholar · JSTOR (June 2017) (Learn how and when to remove this message) PocongOther name(s)hantu bungkus (Malay), bobongkong (Banten Sundanese)CountryIndonesiaRegion Origin Central and easte...

 

 

株式会社ひのき 種類 株式会社市場情報 非上場略称 CUEtv(キューテレビ)本社所在地 日本〒771-0205徳島県板野郡北島町江尻字妙蛇池27-8設立 1988年(昭和63年)9月30日業種 情報・通信業法人番号 6480001005341事業内容 有線テレビジョン放送事業電気通信事業代表者 檜悟(代表取締役)資本金 2000万円外部リンク https://www.cue.tv/特記事項:指定番号 SK0055-0053テンプレートを表示...

WILD ADAPTER ジャンル ハードボイルド 漫画:WILD ADAPTER 作者 峰倉かずや 出版社 徳間書店 掲載誌 CharaコミックZERO-SUM増刊WARDゼロサムオンライン レーベル キャラコミックスZERO-SUMコミックス 巻数 7巻(続刊中) OVA:WILD ADAPTER 原作 峰倉かずや 監督 葛谷直行 シリーズ構成 星野尚 キャラクターデザイン 原田峰文 アニメーション制作 Anpro(全巻)Creators in Pack(下巻のみ) 製�...

 

 

  هذه المقالة عن فريق الرجال. لفريق السيدات، طالع منتخب السودان لكرة القدم للسيدات. منتخب السودان لكرة القدم معلومات عامة اللقب صقور الجديان[1] بلد الرياضة  السودان الفئة كرة القدم للرجال  رمز الفيفا SDN الاتحاد الاتحاد السوداني لكرة القدم كونفدرالية كاف (أفري�...

 

 

Form of mixed climbing on bare rock Part of a series onClimbingClimber dry-tooling in the Alps Lists Climbers Piolet d'Or winners IFSC victories Equipment Knots Historical events Grade milestones Eight-thousanders Terminology Types of rock climbing Aid Big wall Multi-pitch Bouldering Highball Competition Speed Free Sport Traditional Solo Free solo Deep-water solo Rope solo Top roping Types of mountaineering Alpine Mixed Via ferrata Himalayan Alpine style Expedition style Ice Dry-tooling Scram...

Clive JamesInformación personalNacimiento 7 de octubre de 1939 Kogarah (Australia) Fallecimiento 24 de noviembre de 2019 (80 años)Cambridge (Reino Unido) Causa de muerte Leucemia Nacionalidad Australiana y británicaReligión Ateísmo FamiliaHijos 2 EducaciónEducación Bachelor of Arts (Honours) Educado en Universidad de SídneyPembroke CollegeSydney Technical High School Información profesionalOcupación Escritor, broadcaster, crítico literario, poeta y presentador de televisión Emplea...

 

 

African American abolitionist, Freemason, and Mormon elder Kwaku Walker Lewis Personal detailsBorn(1798-08-03)August 3, 1798Barre, Massachusetts, U.S.DiedOctober 26, 1856(1856-10-26) (aged 58)Lowell, Massachusetts, U.S.Resting placeLowell CemeterySpouse(s)Elizabeth LovejoyChildrenEnoch Lovejoy LewisLydia Elizabeth WalkerLucy Minor LewisWalker Lovejoy LewisParentsPeter P. LewisMinor Walker Biography portal   LDS movement portal Kwaku Walker Lewis[1] (August 3, 1798 �...

 

 

Pour les articles homonymes, voir 1er corps d'armée. 1er corps d'arméeQuân đoàn II Corps Drapeau du 1er corps d'armée Création 1957 Dissolution 1975 Pays République du Viêt Nam Type Corps d'armée Fait partie de Forces armées de la république du Viêt Nam Garnison Province de Quảng Trị Province de Thừa Thiên-Huế Province de Quảng Nam Province de Quảng Tín (en) Province de Quảng Ngãi Devise Bến Hải Hưng Binh, Tiên Phong Diệt Cộng (Soldats patriotes d...

1396 battle during the Ottoman wars in Europe For other uses, see Battle of Nicopolis (disambiguation). Battle of NicopolisPart of the Ottoman wars in Europeand the CrusadesMiniature by Jean Colombe (c. 1475)Date25 September 1396LocationNicopolis, Ottoman Empire43°42′21″N 24°53′45″E / 43.70583°N 24.89583°E / 43.70583; 24.89583Result Ottoman victoryTerritorialchanges Crusader failure to capture Nicopolis from the OttomansBelligerents Ottoman EmpireMoravian S...

 

 

Roman panel painting The Severan TondoYearEarly 3rd century ADDimensions30.5 cm diameter (12.0 in)LocationAntikensammlung, Altes Museum, Berlin The Severan Tondo or Berlin Tondo from c. 200 AD is one of the few preserved examples of panel painting from classical antiquity, depicting the first two generations of the imperial Severan dynasty, whose members ruled the Roman Empire in the late 2nd and early 3rd centuries. It depicts the Roman emperor Septimius Severus (r. 193...