Deep learning speech synthesis

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Formulation

Given an input text or some sequence of linguistic units , the target speech can be derived by

where is the set of model parameters.

Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the loss function is typically L1 loss (Mean Absolute Error, MAE) or L2 loss (Mean Square Error, MSE). These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:

where is the loss from human voice band and is a scalar, typically around 0.5. The acoustic feature is typically a spectrogram or Mel scale. These features capture the time-frequency relation of the speech signal, and thus are sufficient to generate intelligent outputs. The Mel-frequency cepstrum feature used in the speech recognition task is not suitable for speech synthesis, as it reduces too much information.

History

A stack of dilated casual convolutional layers used in WaveNet[1]

In September 2016, DeepMind proposed WaveNet, a deep generative model of raw audio waveforms, demonstrating that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original.[1]

In early 2017, Mila proposed char2wav, a model to produce raw waveform in an end-to-end method. In the same year, Google and Facebook proposed Tacotron and VoiceLoop, respectively, to generate acoustic features directly from the input text; months later, Google proposed Tacotron2, which combined the WaveNet vocoder with the revised Tacotron architecture to perform end-to-end speech synthesis. Tacotron2 can generate high-quality speech approaching the human voice.[citation needed]

Semi-supervised learning

Currently, self-supervised learning has gained much attention through better use of unlabelled data. Research has shown that, with the aid of self-supervised loss, the need for paired data decreases.[2][3]

Zero-shot speaker adaptation

Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings.[4] The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.

Neural vocoder

Speech synthesis example using the HiFi-GAN neural vocoder

In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The WaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform as a product of conditional probabilities as follows

where is the model parameter including many dilated convolution layers. Thus, each audio sample is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet[5] was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained by knowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile, Nvidia proposed a flow-based WaveGlow[6] model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN,[7] which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.

References

  1. ^ a b van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet". DeepMind. Retrieved 2022-06-05.
  2. ^ Chung, Yu-An (2018). "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis". arXiv:1808.10128 [cs.CL].
  3. ^ Ren, Yi (2019). "Almost Unsupervised Text to Speech and Automatic Speech Recognition". arXiv:1905.06791 [cs.CL].
  4. ^ Jia, Ye (2018). "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis". arXiv:1806.04558 [cs.CL].
  5. ^ van den Oord, Aaron (2018). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv:1711.10433 [cs.CL].
  6. ^ Prenger, Ryan (2018). "WaveGlow: A Flow-based Generative Network for Speech Synthesis". arXiv:1811.00002 [cs.SD].
  7. ^ Yamamoto, Ryuichi (2019). "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram". arXiv:1910.11480 [eess.AS].

Read other articles:

David MorrisseyMorrissey tahun 2015Lahir21 Juni 1964 (umur 59)Liverpool, InggrisPekerjaanAktorpembuat filmTahun aktif1982–sekarangSuami/istriEsther Freud ​ ​(m. 2006; sep. 2020)​Anak3 David Mark Joseph Morrissey[1] (lahir 21 Juni 1964)[2] adalah aktor dan pembuat film asal Inggris. Tercatat atas persiapan dan penelitian cermat yang dia lakukan untuk setiap peran, dia telah digambarkan oleh British Film Institute seba...

 

 

Dr. BrainHangulDr. 브레인 PembuatKim Jee-woonBerdasarkanDr. Brainoleh HongjacgaDitulis olehKim Jin AKoh YoungJaeKim Jee-woonSutradaraKim Jee-woonPenata musikMowgNegara asalKorea SelatanBahasa asliKoreaJmlh. episode6ProduksiProduser eksekutif Kim Jee-woon Samuel Yeunju Ha Jamie Yuan Lai Ham Jung Yeub Daniel Han Joy Jinsoo Lee Min Young Hong Antonio H.W. Lee SinematografiKim Cheon-seokPenyuntingYang Jin-moHan Mi-yeonRumah produksi Bound Entertainment Kakao Entertainment Dark Circle Pi...

 

 

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Februari 2023. Grüner See (secara harfiah berarti Danau Hijau) dapat mengacu kepada: Grüner See (Steiermark), di dekat Tragöß, Austria Grüner See (Niedersachsen) Grüner See (Hundelshausen) Halaman disambiguasi ini berisi artikel dengan judul yang sering dikait...

Чандрабинду ◌̐ Изображение ◄ ◌̌ ◌̍ ◌̎ ◌̏ ◌̐ ◌̑ ◌̒ ◌̓ ◌̕ ► Характеристики Название combining candrabindu Юникод U+0310 HTML-код ̐ или ̐ UTF-16 0x310 URL-код %CC%90 Чандрабинду, бенгальский чондробинду (ँ, бенг. চন্দ্রবিন্দু (côndrôbindu); от санскр. चन्द्रबिन्दु IAS...

 

 

Ekonomi Kota Malang merupakan salah satu yang terbesar di Jawa Timur. Perekonomian kota ini ditunjang oleh berbagai sektor. Sektor yang menyumbang terbanyak ialah perdagangan. Pariwasata pun tidak terkalahkan dengannya. Tinjauan ekonomi Produk domestik regional bruto (nominal Tahun 2012 2013 2014 2015 2016 Nilai PDRB (miliar rupiah) 38.747,01 42.819,87 45.563,21 51.824,39 57.171,60 PDRB per kapita (ribu rupiah) 46.429,90 50.927,35 55.041,02 60.876,91 66.369,48 Sumber:[1] Struktur ekon...

 

 

Pemberontakan Komunis di SarawakBagian dari Konfrontasi Indonesia–Malaysia dan Perang DinginPara prajurit bersenjata menjaga sekelompok penduduk desa keturunan Tionghoa yang sedang memakai permandian komunal pada 1965 dalam rangka agar mereka tidak ikut serta dengan gerilyawan Komunis dan melindungi kawasan tersebut dari bala bantuan Indonesia.TanggalDesember 1962–3 November 1989[2][7]LokasiSarawak, MalaysiaHasil Deklarasi Damai Sri Aman 1973.[8][9] Pembuba...

County in Missouri, United States County in MissouriPike CountyCountyThe Pike County Courthouse in Bowling GreenLocation within the U.S. state of MissouriMissouri's location within the U.S.Coordinates: 39°20′N 91°10′W / 39.34°N 91.17°W / 39.34; -91.17Country United StatesState MissouriFoundedDecember 14, 1818Named forZebulon M. PikeSeatBowling GreenLargest cityBowling GreenArea • Total685 sq mi (1,770 km2) • Land...

 

 

Annual LGBTQ+ event in Norwich, England Norwich PrideFrequencyAnnuallyLocation(s)Norwich, EnglandFounded2009; 15 years ago (2009)FoundersNick O'BrienMost recent29 July 2023Next event27 July 2024Websitenorwichpride.org.uk Norwich Pride is an annual LGBT pride event and registered charity in the city of Norwich, England, first founded in 2009 by the Norwich Pride Committee. It organises a pride parade from City Hall to Chapelfield Gardens, where it is often centered, as well a...

 

 

Athletics at the 2003 All-Africa Games Track events 100 m   men   women 200 m men women 400 m men women 800 m men women 1500 m men women 5000 m men women 10,000 m men women 100 m hurdles women 110 m hurdles men 400 m hurdles men women 3000 msteeplechase men 4×100 m relay men women 4×400 m relay men women Road events Marathon men women 20 km walk men women Field events High jump men women Pole vault men women Long jump men women Triple jump men women Shot put men women Dis...

Indonesian chicken noodle dish Bakmi ayamBakmi ayam with mushroom, chinese cabbage and chicken broth soup.Alternative namesMi ayam cincang, bakmi ayam, Chicken noodlesCourseMain coursePlace of originIndonesia[1]Region or stateNationwideAssociated cuisineIndonesiaServing temperatureHotMain ingredientsNoodle, chicken meat, soy sauce, garlic, cooking oil (from chicken fat or vegetable oil), chicken broth, chinese cabbage, scallionsFood energy(per serving)1 bowl of mie ayam contains 500 c...

 

 

Основная статья: Гидрология Круговоро́т воды́ в приро́де (гидрологи́ческий цикл), влагооборо́т — процесс циклического перемещения воды в земной биосфере. Состоит из испарения воды, переноса паров воздушными течениями, их конденсации, выпадения в виде осадков (дождь, ...

 

 

For other uses, see Freeport. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Freeport, Bahamas – news · newspapers · books · scholar · JSTOR (July 2009) (Learn how and when to remove this message) City in Grand Bahama, The BahamasFreeportCityCity of FreeportNicknames: The Industrial CapitalThe Second C...

KAvZNative nameООО «Курганский автобусный завод»FormerlyКурганский автобусный завод им. 60-летия Союза ССРCompany typePrivate limited companyIndustryMechanical engineeringFounded1958 (1958)HeadquartersKurgan, KGN, RussiaKey peopleAlexander Viktorovich Alsarayev, acting managing directorProductsMiddle range busesServicesBus manufacturingNumber of employeescirca 600 (2013)ParentGAZ Group Bus DivisionWebsitebus.ru KA...

 

 

Sports season 1921 Chicago Cardinals seasonHead coachPaddy DriscollHome fieldNormal Park (Chicago)ResultsRecord6–3–2 Overall3–3–2 APFALeague place8th APFA ← 1920 Cardinals seasons 1922 → Chicago Cardinals team The 1921 Chicago Cardinals season was their second in the American Professional Football Association. The Cardinals failed to improve on their previous output of 6–2–2, winning only three APFA games.[1] They finished eighth in the league. A...

 

 

Батиметрия — изучение рельефа подводной части водных бассейнов: как мирового океана, так и озёр, рек и т. д. В среде специалистов данный термин может использоваться как совокупность данных о глубинах водного объекта, результат батиметрической съёмки. Другими сл�...

John HeardHeard tahun 2010LahirJohn Heard Jr.(1946-03-07)7 Maret 1946Washington, D.C., Amerika SerikatMeninggal21 Juli 2017(2017-07-21) (umur 71)Palo Alto, California, Amerika SerikatMakamOld South Cemetery, Ipswich, Massachusetts, Amerika SerikatPekerjaanAktorTahun aktif1975–2017Suami/istriMargot Kidder ​ ​(m. 1979; c. 1980)​Sharon Heard ​ ​(m. 1988; c. 1996)​Lana Pritchard ​...

 

 

Danish actress This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: Solbjørg Højfeldt – news · newspapers · books · scholar · JSTOR (February 2015) (Learn how and when to remove this message) Solbj...

 

 

Halaman ini berisi artikel tentang kabupaten. Untuk kecamatan bernama sama, lihat Kecamatan Ngawi. Ngawi beralih ke halaman ini. Untuk kegunaan lain, lihat Ngawi (disambiguasi). NgawiKabupatenTranskripsi bahasa daerah • Hanacarakaꦔꦮꦶ • Pegonڠاوي • Alfabet JawaNgawìDari atas ke bawah: Monumen Soerjo, Patung Gajah di Museum Trinil, Kebun teh Jamus, Benteng Van den Bosch, Air terjun Srambang di Jogorogo, Waduk Pondok di Bringin BenderaLambangJu...

Russian journalist and politician You can help expand this article with text translated from the corresponding article in Russian. (February 2024) Click [show] for important translation instructions. Machine translation, like DeepL or Google Translate, is a useful starting point for translations, but translators must revise errors as necessary and confirm that the translation is accurate, rather than simply copy-pasting machine-translated text into the English Wikipedia. Do not translate...

 

 

هذه مقالة غير مراجعة. ينبغي أن يزال هذا القالب بعد أن يراجعها محرر؛ إذا لزم الأمر فيجب أن توسم المقالة بقوالب الصيانة المناسبة. يمكن أيضاً تقديم طلب لمراجعة المقالة في الصفحة المخصصة لذلك. (أغسطس 2021) هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة و�...