Text-to-image model

An image conditioned on the prompt "an astronaut riding a horse, by Hiroshige", generated by Stable Diffusion 3.5, a large-scale text-to-image model first released in 2022

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in deep neural networks. In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2, Google Brain's Imagen, Stability AI's Stable Diffusion, and Midjourney—began to be considered to approach the quality of real photographs and human-drawn art.

Text-to-image models are generally latent diffusion models, which combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.[1]

History

Before the rise of deep learning,[when?] attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.[2][3]

The inverse task, image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models.[4]

The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences.[4] Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the training set.[4][5]

Eight images generated from the text prompt "A stop sign is flying in blue skies." by AlignDRAW (2015). Enlarged to show detail.[6]

In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task.[5][7] With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.[5] Later systems include VQGAN-CLIP,[8] XMC-GAN, and GauGAN2.[9]

DALL·E 2's (top, April 2022) and DALL·E 3's (bottom, September 2023) generated images for the prompt "A stop sign is flying in blue skies"

One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021.[10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022,[11] followed by Stable Diffusion that was publicly released in August 2022.[12] In August 2022, text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion, namely, finding a new text term that correspond to these images.

Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video,[13] Imagen Video,[14] Midjourney,[15] and Phenaki[16] can generate video from text and/or text/image prompts.[17]

Architecture and training

High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. For the image generation step, conditional generative adversarial networks (GANs) have been commonly used, with diffusion models also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details.

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.[18]

Datasets

Examples of images and captions from three public datasets which are commonly used to train text-to-image models

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter.[7]

Quality evaluation

Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement.[7]

A common algorithmic metric for assessing image quality and diversity is the Inception Score (IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related Fréchet inception distance, which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model.[7]

Impact and applications

AI has the potential for a societal transformation, which may include enabling the expansion of noncommercial niche genres (such as cyberpunk derivatives like solarpunk) by amateurs, novel entertainment, fast prototyping,[19] increasing art-making accessibility,[19] and artistic output per effort and/or expenses and/or time[19]—e.g., via generating drafts, draft-refinitions, and image components (inpainting). Generated images are sometimes used as sketches,[20] low-cost experiments,[21] inspiration, or illustrations of proof-of-concept-stage ideas. Additional functionalities or improvements may also relate to post-generation manual editing (i.e., polishing), such as subsequent tweaking with an image editor.[21]

List of notable text-to-image models

Name Release date Developer License
DALL-E January 2021 OpenAI Proprietary
DALL-E 2 April 2022
DALL-E 3 September 2023
Ideogram 2.0 August 2024 Ideogram
Imagen April 2023 Google
Imagen 2 December 2023[22]
Imagen 3 May 2024
Parti Unreleased
Firefly March 2023 Adobe Inc.
Midjourney July 2022 Midjourney, Inc.
Stable Diffusion August 2022 Stability AI Stability AI Community License
FLUX.1 August 2024 Black Forest Labs Apache License
RunwayML 2018 Runway AI, Inc. Proprietary

See also

References

  1. ^ Vincent, James (May 24, 2022). "All these images were generated by Google's latest text-to-image AI". The Verge. Vox Media. Retrieved May 28, 2022.
  2. ^ Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019), A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis, arXiv:1910.09399
  3. ^ Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007). "A text-to-picture synthesis system for augmenting communication" (PDF). AAAI. 7: 1590–1595.
  4. ^ a b c Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan (November 2015). "Generating Images from Captions with Attention". ICLR. arXiv:1511.02793.
  5. ^ a b c Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak (June 2016). "Generative Adversarial Text to Image Synthesis" (PDF). International Conference on Machine Learning. arXiv:1605.05396.
  6. ^ Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan (February 29, 2016). "Generating Images from Captions with Attention". International Conference on Learning Representations. arXiv:1511.02793.
  7. ^ a b c d Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. arXiv:2101.09983. doi:10.1016/j.neunet.2021.07.019. PMID 34500257. S2CID 231698782.
  8. ^ Rodriguez, Jesus (27 September 2022). "🌅 Edge#229: VQGAN + CLIP". thesequence.substack.com. Retrieved 2022-10-10.
  9. ^ Rodriguez, Jesus (4 October 2022). "🎆🌆 Edge#231: Text-to-Image Synthesis with GANs". thesequence.substack.com. Retrieved 2022-10-10.
  10. ^ Coldewey, Devin (5 January 2021). "OpenAI's DALL-E creates plausible images of literally anything you ask it to". TechCrunch.
  11. ^ Coldewey, Devin (6 April 2022). "OpenAI's new DALL-E model draws anything — but bigger, better and faster than before". TechCrunch.
  12. ^ "Stable Diffusion Public Release". Stability.Ai. Retrieved 2022-10-27.
  13. ^ Kumar, Ashish (2022-10-03). "Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text". MarkTechPost. Retrieved 2022-10-03.
  14. ^ Edwards, Benj (2022-10-05). "Google's newest AI generator creates HD video from text prompts". Ars Technica. Retrieved 2022-10-25.
  15. ^ Rodriguez, Jesus (25 October 2022). "🎨 Edge#237: What is Midjourney?". thesequence.substack.com. Retrieved 2022-10-26.
  16. ^ "Phenaki". phenaki.video. Retrieved 2022-10-03.
  17. ^ Edwards, Benj (9 September 2022). "Runway teases AI-powered text-to-video editing using written prompts". Ars Technica. Retrieved 12 September 2022.
  18. ^ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].
  19. ^ a b c Elgan, Mike (1 November 2022). "How 'synthetic media' will transform business forever". Computerworld. Retrieved 9 November 2022.
  20. ^ Roose, Kevin (21 October 2022). "A.I.-Generated Art Is Already Transforming Creative Work". The New York Times. Retrieved 16 November 2022.
  21. ^ a b Leswing, Kif. "Why Silicon Valley is so excited about awkward drawings done by artificial intelligence". CNBC. Retrieved 16 November 2022.
  22. ^ "Imagen 2 on Vertex AI is now generally available". Google Cloud Blog. Retrieved 2024-01-02.

Read other articles:

Artikel ini tidak memiliki referensi atau sumber tepercaya sehingga isinya tidak bisa dipastikan. Tolong bantu perbaiki artikel ini dengan menambahkan referensi yang layak. Tulisan tanpa sumber dapat dipertanyakan dan dihapus sewaktu-waktu.Cari sumber: Celengan – berita · surat kabar · buku · cendekiawan · JSTOR Celengan terakota Majapahit, abad 14-15 Masehi. Trowulan, Jawa Timur. (Koleksi Museum Nasional, Jakarta.) Celengan berbentuk ayam Celengan mer...

 

 

Artikel ini mungkin terdampak dengan peristiwa terkini: Invasi Rusia ke Ukraina 2022. Informasi di halaman ini bisa berubah setiap saat. Angkatan Darat Federasi RusiaСухопутные войска Российской ФедерацииSukhoputnye voyska Rossiyskoy FederatsiiLambang Angkatan Darat RusiaAktif1 Mei 1992 – saat iniNegara RusiaTipe unitAngkatan daratJumlah personel280.000 personel aktif (2021)[1]Bagian dariAngkatan Bersenjata RusiaMarkasMoskwa, RusiaJulukanСВ ...

 

 

Об экономическом термине см. Первородный грех (экономика). ХристианствоБиблия Ветхий Завет Новый Завет Евангелие Десять заповедей Нагорная проповедь Апокрифы Бог, Троица Бог Отец Иисус Христос Святой Дух История христианства Апостолы Хронология христианства Ран�...

Vous lisez un « bon article » labellisé en 2009. Une porte des étoiles est un appareil de transport interplanétaire fictif du film Stargate, la porte des étoiles. Il est l'élément central de l'univers de fiction Stargate qui comprend les séries de télévision Stargate SG-1, Stargate Atlantis, Stargate Infinity, Stargate Universe et Stargate Origins. Ces appareils sont décrits comme ayant été créés par un peuple appelé « les Anciens », et ils servent à m...

 

 

U.S. Wilderness Area in Colorado Eagles Nest WildernessIUCN category Ib (wilderness area)Eagles Nest Wilderness Area near Vail, ColoradoLocationEagle/Summit counties, Colorado, United StatesNearest cityVail, COCoordinates39°42′00″N 106°15′00″W / 39.70000°N 106.25000°W / 39.70000; -106.25000[1]Area135,114 acres (546.79 km2)[2]Established1978Governing bodyU.S. Forest Service The Eagles Nest Wilderness is a U.S. Wilderness Area l...

 

 

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (يوليو 2019) فريدريك بيري (بالألمانية: Friedrich Bury)‏    معلومات شخصية الميلاد 21 مارس 1763   هاناو  الوفاة 18 مايو 1823 (60 سنة) [1][2]  آخن  مواطنة انتخابية هسن ...

Indian state government This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Government of Andhra Pradesh – news · newspapers · books · scholar · JSTOR (January 2024) (Learn how and when to remove this template message) Government of Andhra Pradeshఆంధ్ర రాష్ట్ర ప్రభుత్వంC...

 

 

Spanish general (1851-1935) In this Spanish name, the first or paternal surname is Rubín and the second or maternal family name is Homent. Antero RubínBorn(1851-02-15)February 15, 1851Redondela, Province of Pontevedra, SpainDiedMay 1, 1935(1935-05-01) (aged 84)Ourense, Province of Ourense, SpainAllegiance SpainBranch Spanish ArmyYears of service1868–1923Rank Teniente generalBattles/warsTen Years' WarSpanish–American War Antero Rubín Homent (February 15, 1851 – ...

 

 

Questa voce sull'argomento calciatori peruviani è solo un abbozzo. Contribuisci a migliorarla secondo le convenzioni di Wikipedia. Segui i suggerimenti del progetto di riferimento. Julio Edson Uribe Nazionalità  Perù Calcio Ruolo Centrocampista Squadra  Cusco Carriera Squadre di club1 1999-2000 Juan Aurich4 (0)2001 Dep. Maldonado10 (0)2001-2002 Jaguares0 (0)2002-2003 Huracán3 (0)2003 Alianza Atlético1 (0)2004-2005 Est. de Medicina7 (0)2005...

† Человек прямоходящий Научная классификация Домен:ЭукариотыЦарство:ЖивотныеПодцарство:ЭуметазоиБез ранга:Двусторонне-симметричныеБез ранга:ВторичноротыеТип:ХордовыеПодтип:ПозвоночныеИнфратип:ЧелюстноротыеНадкласс:ЧетвероногиеКлада:АмниотыКлада:Синапсиды�...

 

 

Coppa Aldo Fiorini Sport Calcio Edizione Unica Organizzatore Direttorio Divisioni Superiori Date dal 9 maggio 1943al 25 luglio 1943[1] Luogo  Italia Partecipanti 64 Formula Torneo a eliminazione diretta Risultati Vincitore  Casale Finalista  Treviso Presentazione della finale di ritorno sul Littoriale del 21 luglio 1943 Manuale La Coppa Aldo Fiorini fu una competizione calcistica italiana organizzata nel 1943 dal Direttorio Divisioni Superiori della FIGC tra 64...

 

 

О произведениях с таким названием см. «Сарданапал». Сарданапал Пол мужской Отец Анасиндаракс[d] Упоминания «Божественная Комедия» Данте. Нелестный отзыв о нём. Рай песнь XV ..и не было ещё Сарданапала, Дабы явить, чем может стать чертог.. В иных культурах Ашшурбанапа...

Wewaton Sriwedari Sampul depanPengarangKoemisi Kasoesastran ing SriwedariJudul asliWawaton panjeratipoen temboeng Djawi mawi sastra Djawi dalasan angka BahasaJawaPenerbitLandsdrukkerijTanggal terbit1926Jenis mediaCetakHalaman24 Wewaton Sriwedari (Pedoman atau Ketetapan Sriwedari), atau lengkapnya Wawaton Panyeratipun Tembung Jawi mawi Sastra Jawi dalasan Angka (Pedoman Penulisan Kata Jawa dengan Aksara Jawa dan Angka) merupakan pedoman penulisan aksara Jawa yang pertama kali di...

 

 

Tekstur denim. Denim adalah kain katun yang kokoh dengan benang pakan melewati dua atau lebih benang lungsin. Kain yang digunakan menghadap sisi benang lungsin.[1] Penggunaan silang kepar menghasilkan pola serong yang membedakannya dengan kain kanvas duk. Pendahulu denim, kain dungaree, telah diproduksi di India selama beratus-ratus tahun.[2] Denim yang paling umum dijumpai adalah denim berwarna indigo, dengan benang lungsinya yang telah dicelup warna dan benang pakannya dibia...

 

 

State bank of Indiana from 1833 to 1859 The New Albany branch building for the Bank of Indiana The state Bank of Indiana was a government chartered banking institution established in 1833 in response to the state's shortage of capital caused by the closure of the Second Bank of the United States by the administration of President Andrew Jackson.[1] The bank operated for twenty-six years and allowed the state to finance its internal improvements, stabilized the state's currency problem...

2015 single by Dolly StyleHello HiSingle by Dolly StyleReleased7 February 2015 (2015-02-07)Length2:58LabelCapitol Music GroupSongwriter(s) Emma Nors Palle Hammarlund Jimmy Jansson Dolly Style singles chronology Hello Hi (2015) Cherry Gum (2015) Hello Hi is the debut single by Swedish girl group Dolly Style. The song was taken part in Melodifestivalen 2015 and qualified to Andra Chansen (Second Chance) through the first semi-final on 7 February 2015,[1] but failed to mak...

 

 

Practice of selling securities or other financial instruments that are not currently owned Schematic representation of physical short selling in two steps. The short seller borrows shares and immediately sells them. The short seller then expects the price to decrease, after which the seller can profit by purchasing the shares to return to the lender. Securities Securities Banknote Bond Debenture Derivative Stock Markets Stock market Commodity market Foreign exchange market Futures exchange Ov...

 

 

Kereta NS-74 di Jalur 5, angkutan cepat Santiago. Jalur 5 di angkutan cepat Santiago (bahasa Spanyol: Línea 5 del Metro de Santiago) merupakan jalur angkutan cepat ketiga yang dibangun di Santiago, Chili. Konstruksi jalur ini dimulai pada tahun 1994 dan pertama kali dibuka pada tanggal 5 April 1997. Jalur ini dioperasikan oleh Metro S.A. Sejarah 5 April 1997: Bagian pertama jalur 5 dibuka antara Baquedano (koneksi dengan Jalur 1) dan Bellavista de La Florida. 4 Maret 2000: Jalur tersebu...

جزء من سلسلة مقالات حولالحقوق النسوية المرأة والأنثويةامرأة . أنوثة التاريخالاجتماعي: تاريخ المرأة . تاريخ نسوي . تاريخ الحركة النسوية . الجدول الزمني لحقوق المرأة حق الاقتراع: تصويت النساء . الجدول الزمني . نيوزيلندا . المملكة المتحدة . الولايات المتحدة موجات: الأولى . الثا�...

 

 

Point in Tasmania, Australia Eddystone Point Lighthouse Eddystone Point lies on the north-east coast of Tasmania, Australia at 40.994 S/148.349 E. History The first European to sight Eddystone Point was the Dutch navigator, Abel Tasman. In December 1642, Tasman sailed along the entire east coast of Van Diemen's Land (Tasmania).[1] He recorded that he tried to follow the coast around this headland, but he could not penetrate the wind wall.[2] The howling westerly gale indicated...