Solving multiple machine learning tasks at the same time
Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.[1][2][3]
Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks.[4]
Early versions of MTL were called "hints".[5][6]
In a widely cited 1997 paper, Rich Caruana gave the following characterization:
Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.[3]
In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.[citation needed] Further examples of settings for MTL include multiclass classification and multi-label classification.[7]
Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.[8] However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.[8][9]
Methods
The key challenge in multi-task learning, is how to combine learning signals from multiple tasks into a single model. This may strongly depend on how well different task agree with each other, or contradict each other. There are several ways to address this challenge:
Task grouping and overlap
Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.[10] Task relatedness can be imposed a priori or learned from the data.[7][11] Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.[8][12] For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.[8]
Exploiting unrelated tasks
One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.[9]
Transfer of knowledge
Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural networkGoogLeNet,[13] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.[14]
Multiple non-stationary tasks
Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL).[15] Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.
Multi-task optimization
Multitask optimization: In some cases, the simultaneous training of seemingly related tasks may hinder performance compared to single-task models.[16] Commonly, MTL models employ task-specific modules on top of a joint feature representation obtained using a shared module. Since this joint representation must capture useful features across all tasks, MTL may hinder individual task performance if the different tasks seek conflicting representation, i.e., the gradients of different tasks point to opposing directions or differ significantly in magnitude. This phenomenon is commonly referred to as negative transfer. To mitigate this issue, various MTL optimization methods have been proposed. Commonly, the per-task gradients are combined into a joint update direction through various aggregation algorithms or heuristics.
Mathematics
Reproducing Hilbert space of vector valued functions (RKHSvv)
The MTL problem can be cast within the context of RKHSvv (a completeinner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015.[7]
RKHSvv concepts
Suppose the training data set is , with , , where t indexes task, and . Let . In this setting there is a consistent input and output space and the same loss function for each task: . This results in the regularized machine learning problem:
1
where is a vector valued reproducing kernel Hilbert space with functions having components .
The reproducing kernel for the space of functions is a symmetric matrix-valued function , such that and the following reproducing property holds:
2
The reproducing kernel gives rise to a representer theorem showing that any solution to equation 1 has the form:
3
Separable kernels
The form of the kernel Γ induces both the representation of the feature space and structures the output across tasks. A natural simplification is to choose a separable kernel, which factors into separate kernels on the input space X and on the tasks . In this case the kernel relating scalar components and is given by . For vector valued functions we can write , where k is a scalar reproducing kernel, and A is a symmetric positive semi-definite matrix. Henceforth denote .
This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by A. Methods for non-separable kernels Γ is a current field of research.
For the separable case, the representation theorem is reduced to . The model output on the training data is then KCA , where K is the empirical kernel matrix with entries , and C is the matrix of rows .
With the separable kernel, equation 1 can be rewritten as
P
where V is a (weighted) average of L applied entry-wise to Y and KCA. (The weight is zero if is a missing observation).
Note the second term in P can be derived as follows:
Known task structure
Task structure representations
There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.
Regularizer — With the separable kernel, it can be shown (below) that , where is the element of the pseudoinverse of , and is the RKHS based on the scalar kernel , and . This formulation shows that controls the weight of the penalty associated with . (Note that arises from .)
Proof
Output metric — an alternative output metric on can be induced by the inner product . With the squared loss there is an equivalence between the separable kernels under the alternative metric, and , under the canonical metric.
Output mapping — Outputs can be mapped as to a higher dimensional space to encode complex structures such as trees, graphs and strings. For linear maps L, with appropriate choice of separable kernel, it can be shown that .
Task structure examples
Via the regularizer formulation, one can represent a variety of task structures easily.
Letting (where is the TxT identity matrix, and is the TxT matrix of ones) is equivalent to letting Γ control the variance of tasks from their mean . For example, blood levels of some biomarker may be taken on T patients at time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
Letting , where is equivalent to letting control the variance measured with respect to a group mean: . (Here the cardinality of group r, and is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.
Letting , where is the Laplacian for the graph with adjacency matrixM giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks t and s when they are more similar (according to the weight ,) i.e. regularizes .
All of the above choices of A also induce the additional regularization term which penalizes complexity in f more broadly.
Learning tasks together with their structure
Learning problem P can be generalized to admit learning task matrix A as follows:
Q
Choice of must be designed to learn matrices A of a given type. See "Special cases" below.
Restricting to the case of convex losses and coercive penalties Ciliberto et al. have shown that although Q is not convex jointly in C and A, a related problem is jointly convex.
Specifically on the convex set , the equivalent problem
R
is convex with the same minimum value. And if is a minimizer for R then is a minimizer for Q.
R may be solved by a barrier method on a closed set by introducing the following perturbation:
S
The perturbation via the barrier forces the objective functions to be equal to on the boundary of .
S can be solved with a block coordinate descent method, alternating in C and A. This results in a sequence of minimizers in S that converges to the solution in R as , and hence gives the solution to Q.
Special cases
Spectral penalties - Dinnuzo et al[17] suggested setting F as the Frobenius norm . They optimized Q directly using block coordinate descent, not accounting for difficulties at the boundary of .
Clustered tasks learning - Jacob et al[18] suggested to learn A in the setting where T tasks are organized in R disjoint clusters. In this case let be the matrix with . Setting , and , the task matrix can be parameterized as a function of : , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation . In this formulation, .
Generalizations
Non-convex penalties - Penalties can be constructed such that A is constrained to be a graph Laplacian, or that A has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.
Non-separable kernels - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.
Software package
A Matlab package called Multi-Task Learning via StructurAl Regularization (MALSAR) [19] implements the following multi-task learning algorithms: Mean-Regularized Multi-Task Learning,[20][21] Multi-Task Learning with Joint Feature Selection,[22] Robust Multi-Task Feature Learning,[23] Trace-Norm Regularized Multi-Task Learning,[24] Alternating Structural Optimization,[25][26] Incoherent Low-Rank and Sparse Learning,[27] Robust Low-Rank Multi-Task Learning, Clustered Multi-Task Learning,[28][29] Multi-Task Learning with Graph Structures.
Literature
Multi-Target Prediction: A Unifying View on Problems and Methods Willem Waegeman, Krzysztof Dembczynski, Eyke Huellermeier https://arxiv.org/abs/1809.02352v1
^Baxter, J. (2000). A model of inductive bias learning" Journal of Artificial Intelligence Research 12:149--198, On-line paper
^Thrun, S. (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. Paper at Citeseer
^Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.
^ abcCiliberto, C. (2015). "Convex Learning of Multiple Tasks and their Structure". arXiv:1504.03101 [cs.LG].
^ abcdHajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. arXiv:1810.09433
^Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf
^Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning (ICML), Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf
^Szegedy, Christian; Wei Liu, Youssef; Yangqing Jia, Tomaso; Sermanet, Pierre; Reed, Scott; Anguelov, Dragomir; Erhan, Dumitru; Vanhoucke, Vincent; Rabinovich, Andrew (2015). "Going deeper with convolutions". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9. arXiv:1409.4842. doi:10.1109/CVPR.2015.7298594. ISBN978-1-4673-6964-0. S2CID206592484.
^Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).
Saint Lawrence SeawaySaint Lawrence SeawayStati Canada Stati Uniti SuddivisioniQuebec Suddivisioni (2)Ontario Suddivisioni (3)New York Lunghezza600 km Modifica dati su Wikidata · Manuale Le chiuse di Eisenhower a Massena stato di New York. La Saint Lawrence Seaway Saint Lawrence Seaway, il canale di navigazione separato a Montréal. La Saint Lawrence Seaway (via marittima San Lorenzo) è un sistema di chiuse e canali, in Canada e negli Stati Uniti, che consente alle navi ocean...
This section has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article may be in need of reorganization to comply with Wikipedia's layout guidelines. Please help by editing the article to make improvements to the overall structure. (January 2023) (Learn how and when to remove this template message) This article includes a list of references, related reading, or external links, but its sources remai...
Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Desember 2022. Lexy KolkerKolker, 2018LahirAlexa Rose Kolker17 Agustus 2009 (umur 14)Los Angeles, CaliforniaPekerjaanAktrisTahun aktif2015–sekarang Alexa Rose Lexy Kolker (lahir 17 Agustus 2009),[1] lebih dikenal sebagai Lexy Kolker, adalah seora...
Former tidal organ in England Blackpool's High Tide Organ viewed from its side The High Tide Organ was a tidal organ 15 metres (49 ft 3 in) tall constructed in 2002 as part of The Great Promenade Show series of sculptures situated along Blackpool's New Promenade[1] in the UK. It was removed in early 2022. The artwork, described as a musical manifestation of the sea, is one of a few examples of a tidal organ; others include the San Francisco Wave Organ[2] and the Sea ...
Paladium pada karbon Nama Nama IUPAC Paladium Nama lain Paladium pada karbon, Pd/C, Pd-C Penanda Nomor CAS 7440-05-3 Model 3D (JSmol) Gambar interaktif 3DMet {{{3DMet}}} Nomor EC PubChem CID 23938 Nomor RTECS {{{value}}} UNII 5TWQ1V240M InChI InChI=1S/PdKey: KDLHZDBZIXYQEI-UHFFFAOYSA-N SMILES [Pd] Sifat Rumus kimia Pd Massa molar 106,42 Penampilan Bubuk hitam Kelarutan Air raja Kecuali dinyatakan lain, data di atas berlaku pada suhu dan tekanan standar (25 °C [7...
artikel ini perlu dirapikan agar memenuhi standar Wikipedia. Tidak ada alasan yang diberikan. Silakan kembangkan artikel ini semampu Anda. Merapikan artikel dapat dilakukan dengan wikifikasi atau membagi artikel ke paragraf-paragraf. Jika sudah dirapikan, silakan hapus templat ini. (Pelajari cara dan kapan saatnya untuk menghapus pesan templat ini) CibarusahKecamatanNegara IndonesiaProvinsiJawa BaratKabupatenBekasiPemerintahan • CamatMuhamad KurnaepiPopulasi • Tot...
Wakil Bupati DonggalaPetahanaLowongsejak 3 November 2023Masa jabatan5 tahunDibentuk1999Pejabat pertamaAhmad Abd. RaufSitus webwww.donggala.go.id Berikut ini adalah daftar Wakil Bupati Donggala dari masa ke masa. No. Potret Wakil Bupati Mulai menjabat Akhir menjabat Partai Bupati Periode Ref. 1 Ahmad Abd. Rauf 1999 2004 Nabi Bidja 12 2 Habir Ponulele 2004 16 November 2006 Adam Ardjad Lamarauna 13 [Ket. 1][1] Jabatan kosong 16 November 2006 2008 Habir Ponulele 2008 23 Desem...
American politician Politte ElvinsFrom Volume IV (1921) of Centennial History of MissouriMember of the U.S. House of Representativesfrom Missouri's 13th districtIn officeMarch 4, 1909 – March 3, 1911Preceded byMadison R. SmithSucceeded byWalter L. Hensley Personal detailsBorn(1878-03-16)March 16, 1878French Village, Missouri, U.S.DiedJanuary 14, 1943(1943-01-14) (aged 64)McAllen, Texas, U.S.Political partyRepublicanAlma materUniversity of Missouri Politte Elvins (M...
This article relies largely or entirely on a single source. Relevant discussion may be found on the talk page. Please help improve this article by introducing citations to additional sources.Find sources: State Highway 34 Kerala – news · newspapers · books · scholar · JSTOR (March 2012) State Highway 34Koyilandy – Edavanna RoadSH 34 highlighted in redRoute informationMaintained by Kerala Public Works DepartmentLength44 km (27 mi)...
جورج كارلوس فونسيكا Jor Carlos Fonseca جورج كارلوس فونسيكا في أوت 2014 مناصب رئيس الرأس الأخضر في المنصب9 سبتمبر 2011 – 9 نوفمبر 2021 بيدرو بيريس خوسيه ماريا نيفيس معلومات شخصية الميلاد 20 أكتوبر 1950 (العمر 73 سنة)ميندالو، الرأس الأخضر مواطنة الرأس الأخضر الديانة الر...
Australian aviator Ray Parer & John McIntosh in front of their battered Airco de Havilland DH.9 on 31 August 1920 at Flemington Racecourse, Melbourne (Australia's Federal capital at the time) after handing a bottle of Peter Dawson whisky, that had travelled with them from England to Melbourne, to Australian Prime Minister Billy Hughes. The Airco de Havilland DH.9 Parer and McIntosh flew from the UK to Australia on display at the Australian War Memorial in 2018 Raymond John Paul Parer AFC ...
Fashion industry event For the Death Grips album, see Fashion Week (album). Karmen Pedaru modeling for Michael Kors, Spring/Summer New York Fashion Week, 2013 A fashion week is a week-long fashion industry event where fashion designers, brands, or houses display their latest collections in runway fashion shows to buyers and the media which influences upcoming fashion trends for the current and approaching seasons.[1][2] The most prominent fashion weeks are held in the fashion ...
English architect For other architects named John Johnson, see John Johnson § Architects. John JohnsonJohn Johnson (painted by John Russell)Born(1732-04-22)22 April 1732Leicester, England, United KingdomDied17 August 1814(1814-08-17) (aged 82)Leicester, England, United KingdomNationalityEnglishOccupation(s)Arcitecht and Surveyor Shire Hall, Chelmsford Memorial to John Johnson in Leicester Cathedral John Johnson (22 April 1732 – 17 August 1814) was an English architect and surveyor to ...
هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (يونيو 2013) يقارن الجدول التالي بين أبرز برامج توليد الطلب. معلومات عامة الحزمة الترخيص التسعير التسويق بنظام برمجي البرمجيات كخدمة[1] $500+/شهر[2] الأوكوا (Eloqua) الب...
American multinational banking and financial services corporation This article is about a commercial bank unaffiliated with any government. For the central bank of the United States, see Federal Reserve System. Bank of America CorporationThe Bank of America Corporate Center, headquarters of Bank of America in Charlotte, North Carolina.Company typePublicTraded asNYSE: BACS&P 100 componentS&P 500 componentISINUS0605051046IndustryFinancial servicesPredecessorsBankAmericaNationsBankF...
Kejuaraan Dunia 2007 (musim ke-16) digelar di Kuala Lumpur, ibu kota dari negara tetangga Malaysia, dimulai dari tanggal 13 Agustus hingga berakhir pada tanggal 19 Agustus, 2007. Arena Putra Indoor Stadium, Bukit Jalil Negara yang Berpartisipasi Sebanyak 55 negara ikut berpartisipasi pada kejuaraan ini. Berikut merupakan daftar negara dan jumlah pemain yang diikut-sertakan. Aljazair (1) Australia (4) Austria (3) Belgia (3) Belarus (1) Brasil (4) Bulga...
Pour les articles homonymes, voir Westminster (homonymie). Cet article est une ébauche concernant une localité de Caroline du Sud. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Westminster La gare Administration Pays États-Unis État Caroline du Sud Comté Oconee Type de localité City Code ZIP 29693 Code FIPS 45-76165 GNIS 1227504 Indicatif(s) téléphonique(s) local (locaux) 864 Démographie Population 2&...
Village in Nevşehir, TurkeyÇavuşinVillageAerial view of ÇavuşinÇavuşinLocation in TurkeyShow map of TurkeyÇavuşinÇavuşin (Turkey Central Anatolia)Show map of Turkey Central AnatoliaCoordinates: 38°40′22″N 34°50′22″E / 38.6728°N 34.8394°E / 38.6728; 34.8394CountryTurkeyProvinceNevşehirDistrictAvanosPopulation (2022)421Time zoneUTC+3 (TRT) View of rock ridge above Çavuşin. Çavuşin is a village in the Avanos District in Nevşehir Province...