Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data (exclusively used in more expensive and time-consuming supervised learning paradigm), followed by a large amount of unlabeled data (used exclusively in unsupervised learning paradigm). In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam.
The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
More formally, semi-supervised learning assumes a set of independently identically distributed examples with corresponding labels and unlabeled examples are processed. Semi-supervised learning combines this information to surpass the classification performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.
Semi-supervised learning may refer to either transductive learning or inductive learning.[1] The goal of transductive learning is to infer the correct labels for the given unlabeled data only. The goal of inductive learning is to infer the correct mapping from to .
It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.
Assumptions
In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions:[2]
Continuity / smoothness assumption
Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes.[3]
Cluster assumption
The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.
The data lie approximately on a manifold of much lower dimension than the input space. In this case learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold.
The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds,[4] and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively.
History
The heuristic approach of self-training (also known as self-learning or self-labeling) is historically the oldest approach to semi-supervised learning,[2] with examples of applications starting in the 1960s.[5]
The transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s.[6] Interest in inductive learning using generative models also began in the 1970s. A probably approximately correct learning bound for semi-supervised learning of a Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995.[7]
Methods
Generative models
Generative approaches to statistical learning first seek to estimate ,[disputed – discuss] the distribution of data points belonging to each class. The probability that a given point has label is then proportional to by Bayes' rule. Semi-supervised learning with generative models can be viewed either as an extension of supervised learning (classification plus information about ) or as an extension of unsupervised learning (clustering plus some labels).
Generative models assume that the distributions take some particular form parameterized by the vector . If these assumptions are incorrect, the unlabeled data may actually decrease the accuracy of the solution relative to what would have been obtained from labeled data alone.[8]
However, if the assumptions are correct, then the unlabeled data necessarily improves performance.[7]
The unlabeled data are distributed according to a mixture of individual-class distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models.
The parameterized joint distribution can be written as by using the chain rule. Each parameter vector is associated with a decision function .
The parameter is then chosen based on fit to both the labeled and unlabeled data, weighted by :
Another major class of methods attempts to place boundaries in regions with few data points (labeled or unlabeled). One of the most commonly used algorithms is the transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learning as well). Whereas support vector machines for supervised learning seek a decision boundary with maximal margin over the labeled data, the goal of TSVM is a labeling of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standard hinge loss for labeled data, a loss function is introduced over the unlabeled data by letting . TSVM then selects from a reproducing kernel Hilbert space by minimizing the regularizedempirical risk:
An exact solution is intractable due to the non-convex term , so research focuses on useful approximations.[9]
Other approaches that implement low-density separation include Gaussian process models, information regularization, and entropy minimization (of which TSVM is a special case).
Laplacian regularization
Laplacian regularization has been historically approached through graph-Laplacian.
Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its nearest neighbors or to examples within some distance . The weight of an edge between and is then set to .
Within the framework of manifold regularization,[10][11] the graph serves as a proxy for the manifold. A term is added to the standard Tikhonov regularization problem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes
where is a reproducing kernel Hilbert space and is the manifold on which the data lie. The regularization parameters and control smoothness in the ambient and intrinsic spaces respectively. The graph is used to approximate the intrinsic regularization term. Defining the graph Laplacian where and is the vector , we have
The Laplacian can also be used to extend the supervised learning algorithms: regularized least squares and support vector machines (SVM) to semi-supervised versions Laplacian regularized least squares and Laplacian SVM.
Heuristic approaches
Some methods for semi-supervised learning are not intrinsically geared to learning from both unlabeled and labeled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labeled and unlabeled examples may inform a choice of representation, distance metric, or kernel for the data in an unsupervised first step. Then supervised learning proceeds from only the labeled examples. In this vein, some methods learn a low-dimensional representation using the supervised data and then apply either low-density separation or graph-based methods to the learned representation.[12][13] Iteratively refining the representation and then performing semi-supervised learning on said representation may further improve performance.
Self-training is a wrapper method for semi-supervised learning.[14] First a supervised learning algorithm is trained based on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for the supervised learning algorithm. Generally only the labels the classifier is most confident in are added at each step.[15] In natural language processing, a common self-training algorithm is the Yarowsky algorithm for problems like word sense disambiguation, accent restoration, and spelling correction.[16]
Co-training is an extension of self-training in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labeled examples for one another.[17]
In human cognition
Human responses to formal semi-supervised learning problems have yielded varying conclusions about the degree of influence of the unlabeled data.[18] More natural learning problems may also be viewed as instances of semi-supervised learning. Much of human concept learning involves a small amount of direct instruction (e.g. parental labeling of objects during childhood) combined with large amounts of unlabeled experience (e.g. observation of objects without naming or counting them, or at least without feedback).
Human infants are sensitive to the structure of unlabeled natural categories such as images of dogs and cats or male and female faces.[19] Infants and children take into account not only unlabeled examples, but the sampling process from which labeled examples arise.[20][21]
^Scudder, H. (July 1965). "Probability of error of some adaptive pattern-recognition machines". IEEE Transactions on Information Theory. 11 (3): 363–371. doi:10.1109/TIT.1965.1053799. ISSN1557-9654.
^Vapnik, V.; Chervonenkis, A. (1974). Theory of Pattern Recognition (in Russian). Moscow: Nauka. cited in Chapelle, Schölkopf & Zien 2006, p. 3
^Burkhart, Michael C.; Shan, Kyle (2020). "Deep Low-Density Separation for Semi-supervised Classification". International Conference on Computational Science (ICCS). Lecture Notes in Computer Science. Vol. 12139. pp. 297–311. arXiv:2205.11995. doi:10.1007/978-3-030-50420-5_22. ISBN978-3-030-50419-9.
^Triguero, Isaac; García, Salvador; Herrera, Francisco (2013-11-26). "Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study". Knowledge and Information Systems. 42 (2): 245–284. doi:10.1007/s10115-013-0706-y. ISSN0219-1377. S2CID1955810.
^Didaci, Luca; Fumera, Giorgio; Roli, Fabio (2012-11-07). Gimel’farb, Georgy; Hancock, Edwin; Imiya, Atsushi; Kuijper, Arjan; Kudo, Mineichi; Omachi, Shinichiro; Windeatt, Terry; Yamada, Keiji (eds.). Analysis of Co-training Algorithm with Very Small Training Sets. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 719–726. doi:10.1007/978-3-642-34166-3_79. ISBN9783642341656. S2CID46063225.
^
Zhu, Xiaojin (2009). Introduction to semi-supervised learning. Goldberg, A. B. (Andrew B.). [San Rafael, Calif.]: Morgan & Claypool Publishers. ISBN978-1-59829-548-1. OCLC428541480.
^Younger B. A.; Fearing D. D. (1999). "Parsing Items into Separate Categories: Developmental Change in Infant Categorization". Child Development. 70 (2): 291–303. doi:10.1111/1467-8624.00022.
Chapelle, Olivier; Schölkopf, Bernhard; Zien, Alexander (2006). Semi-supervised learning. Cambridge, Mass.: MIT Press. ISBN978-0-262-03358-9.
External links
Manifold Regularization A freely available MATLAB implementation of the graph-based semi-supervised algorithms Laplacian support vector machines and Laplacian regularized least squares.
Jacob's adalah nama merek untuk beberapa lini biskuit dan kraker di Irlandia dan Britania Raya. Nama merek ini dimiliki oleh Jacob Fruitfield Food Group, bagian dari Valeo Foods, yang memproduksi makanan ringan untuk pasar Irlandia. Di Britania Raya, nama merek ini digunakan di bawah lisensi oleh United Biscuits, bagian dari Pladis. Sejarah Merek dagang terdaftar 1885 Jacob's digunakan pada beberapa lini biskuit Pemandangan lantai produksi, pabrik Jacob's di Irlandia, 1910 Pencetus nama merek...
Ini adalah nama Korea; marganya adalah Jun. Jun Hyun-mooIn March 2015Lahir7 November 1977 (umur 46)Seoul, Korea SelatanPendidikanUniversitas Yonsei (Sosiologi dan Sastra Inggris)PekerjaanPembawa acaraAgenS.M. Culture & ContentsNama KoreaHangul전현무 Hanja全炫茂 Alih AksaraJeon Hyeon-muMcCune–ReischauerChŏn Hyŏn-mu Jun Hyun-moo (lahir 7 November 1977) adalah pembawa acara dan tokoh televisi asal Korea Selatan. Pendidikan Jeon lulus dari Universitas Yonsei dengan gelar di bid...
Questa voce o sezione sull'argomento stagioni delle società calcistiche italiane non cita le fonti necessarie o quelle presenti sono insufficienti. Puoi migliorare questa voce aggiungendo citazioni da fonti attendibili secondo le linee guida sull'uso delle fonti. Segui i suggerimenti del progetto di riferimento. Voce principale: Associazione Calcio Dilettantistica Città di Vittoria. Club Calcio VittoriaStagione 1978-1979 Sport calcio SquadraFootball Club Vittoria Allenatore Giancarlo ...
Painting by Jean-Paul Laurens The HostagesArtistJean-Paul Laurens[1]Year1896[2]Mediumoil on canvasDimensions140 cm × 146 cm (55 in × 57 in)LocationMuseum of Fine Arts of Lyon, Lyon The Hostages is an 1896 oil on canvas painting created by French painter and sculptor Jean-Paul Laurens, the last in a series of historical works by him. He does not give a specific historical setting, although he evokes the Princes in the Tower and Richar...
Administrative centre of Orenburg Oblast, Russia For other uses, see Orenburg (disambiguation). This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Orenburg – news · newspapers · books · scholar · JSTOR (January 2012) (Learn how and when to remove this template message) City in Orenburg Oblast, RussiaOrenburg О...
Nonprofit organization Committee on Publication EthicsAbbreviationCOPEFormation1997; 27 years ago (1997)Registration no.1123023Official language EnglishWebsitewww.publicationethics.org This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources. ...
Questa voce o sezione sull'argomento aviazione non cita le fonti necessarie o quelle presenti sono insufficienti. Puoi migliorare questa voce aggiungendo citazioni da fonti attendibili secondo le linee guida sull'uso delle fonti. Segui i suggerimenti del progetto di riferimento. La Red Bull Air Race, istituita nel 2003 e creata dalla Red Bull, è una competizione aeronautica di velocità nella quale i concorrenti devono percorrere con il proprio velivolo un percorso ad ostacoli il più ...
Ini adalah nama Melayu; nama Ahmad merupakan patronimik, bukan nama keluarga, dan tokoh ini dipanggil menggunakan nama depannya, Abdul Ajib. Yang Berhormat[note 1] Dato' HajiAbdul Ajib AhmadDGSM DPMJ BSIعبدالعخيب بن أحمد Menteri Besar Johor ke-12Masa jabatan29 April 1982 (1982-04-29) – 12 Agustus 1986 (1986-08-12)Penguasa monarkiIskandarPendahuluOthman SaatPenggantiMuhyiddin YassinMenteri di Departemen Perdana Menteri (Agama Islam)Masa jabatan11...
Team sport Beach volleyballThe blocker (left) attempts to stop the opposing team's attack over the net.Highest governing bodyFIVBFirst played1915, Waikiki, Hawaii, United StatesCharacteristicsContactNoTeam members2 or more per sideMixed-sexSingle and mixedTypeOutdoor, team sport, net sportEquipmentBeach volleyballGlossaryVolleyball jargonPresenceCountry or regionWorldwideOlympicSince 1996World Games1993 Beach volleyball is a team sport played by two teams of two players each on a sa...
Pour les articles homonymes, voir Séguéla. Jacques SéguélaJacques Séguéla en 2017.BiographieNaissance 23 février 1934 (90 ans)Paris (France)Nationalité FranceFormation Lycée François-AragoActivités Publicitaire, journaliste, pharmacienEnfant Tristan SéguélaAutres informationsPartenaire Bernard RouxSite web www.jacques-seguela.comDistinction Officier de la Légion d'honneurmodifier - modifier le code - modifier Wikidata Jacques Séguéla, né le 23 février 1934 à Paris,...
В Википедии есть статьи о других людях с именем Донат. Донатлат. Donatus Дата рождения 1-е тысячелетие Дата смерти 412(0412) Место смерти Паннония, Древний Рим Род деятельности правитель гуннов Дона́т (лат. Donatus; убит в 412) — один из правителей (риксов) гуннов в начал...
Huida a EgiptoArtiste Francisco de GoyaDate 1771-1774Type GravureTechnique Eau-forteDimensions (H × L) 130 × 95 mmNo d’inventaire Gassier-Wilson : 52Localisation Bibliothèque nationale d'Espagne, Madrid (Espagne)modifier - modifier le code - modifier Wikidata Huida a Egipto (« Fuite en Égypte ») est une eau-forte réalisée par Francisco de Goya entre 1771 et 1774 faisant partie d'une petite série composée de trois incisions et deux dessi...
Potret Kaisar Guangxu, dengan lukisan Naga Kuning yang dibordir di jubahnya. Naga Kuning (Hanzi Tradisional: 黃龍; Hanzi Sederhana: 黄龙; Pinyin: Huánglóng; Alih aksara Yale: Wong4 Lung4; bahasa Jepang: Kōryū or Ōryū; bahasa Korea: Hwang-Ryong; bahasa Vietnam: Hoàng Long) adalah titisan Dewa Kuning berwujud hewan, pusat alam semesta dalam kepercayaan tradisional Tionghoa dan mitologi Tiongkok.[1] Kaisar Kuning atau...
مخطوط لكتاب عربي يظهر فيه حساب وقت الخسوف والكسوف خسوف القمر هو ظاهرة فلكية تحدث عندما يحجب ظل الأرض ضوء الشمس المنعكس على القمر في الأوضاع العادية.[1] وتحدث هذه الظاهرة عندما تكون الشمس والأرض والقمر في حالة اقتران كوكبي كامل (فيكون خسوفا كليا) أو تقريبي (فيكون خسوفا جز...
Ethyl propionate Skeletal formula of ethyl propionate Ball-and-stick model of ethyl propionate Names Preferred IUPAC name Ethyl propanoate Other names Ethyl propionaten-Ethyl propanoatePropanoic acid ethyl ester Identifiers CAS Number 105-37-3 3D model (JSmol) Interactive image Beilstein Reference 506287 ChemSpider 7463 ECHA InfoCard 100.002.993 EC Number 203-291-4 PubChem CID 7749 RTECS number UF3675000 UNII AT9K8FY49U Y UN number N119 CompTox Dashboard (EPA) DTXSID1040110 InChI InChI=...
Berikut ini adalah daftar maskapai penerbangan yang sekarang beroperasi di Bhutan. Maskapai penerbangan terjadwal Maskapai penerbangan Gambar IATA ICAO Tanda panggilan Mulaiberoperasi Catatan Bhutan Airlines B3 BTN BHUTAN AIR 2011 Druk Air KB DRK ROYAL BHUTAN 1981 Lihat pula Daftar maskapai penerbangan Daftar maskapai penerbangan Asia yang dibubarkan lbsDaftar maskapai penerbangan AsiaNegaraberdaulat Afganistan Arab Saudi Armenia1 Azerbaijan1 Bahrain Bangladesh Bhutan Brunei Filipina Georgia1...
American central banker and politician For the Colorado state representative, see Tim Geitner (Colorado politician). Timothy GeithnerOfficial portrait, 200975th United States Secretary of the TreasuryIn officeJanuary 26, 2009 – January 25, 2013PresidentBarack ObamaDeputyNeal S. WolinPreceded byHenry PaulsonSucceeded byJack Lew9th President of the Federal Reserve Bank of New YorkIn officeNovember 17, 2003 – January 26, 2009Preceded byWilliam Joseph McDonoughSucceeded byWi...
Giornale di BresciaLogoStato Italia Linguaitaliano Periodicitàquotidiano Generestampa locale Formatoberliner FondatoreComitato di Liberazione Nazionale Fondazione1945 SedeVia Solferino, 22 - 25121 Brescia EditoreEditoriale Bresciana Diffusione cartacea26.865 (ADS 2021) DirettoreNunzia Vallini ISSN1590-346X (WC · ACNP) e 2499-099X (WC · ACNP) Sito webwww.giornaledibrescia.it Modifica dati su Wikidata · Manuale Il Giornale di Brescia è un quotidiano della città ...
Native American news magazine ICT NewsTypeDaily digital news platformOwner(s)IndiJ Public MediaFounder(s)Tim GiagoPublisherKaren MichelEditorMark TrahantManaging editorJourdan Bennett-BegayeFounded1981 (print newspaper The Lakota Times)LanguageEnglishRelaunched2018 (daily digital news platform)HeadquartersPhoenix, ArizonaCityPhoenix, ArizonaCountryUSAISSN1066-5501WebsiteICTFree online archivesYes ICT (formerly known as Indian Country Today) is a daily digital news platform that covers the Ind...
This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Kitamuro District, Mie – news · newspapers · books · scholar · JSTOR (December 2009) (Learn how and when to remove this message) Japan > Mie Prefecture > Kitamuro District Kitamuro District Kitamuro District (北牟婁郡, Kitamuro-gun) is a rural district located in Mie P...