The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.
Formally, the standard (unit) softmax function , where , takes a vector and computes each component of vector with
In words, the softmax applies the standard exponential function to each element of the input vector (consisting of real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of is approximately , which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).
In general, instead of e a different baseb > 0 can be used. As above, if b > 1 then larger input components will result in larger output probabilities, and increasing the value of b will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if 0 < b < 1 then smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values. Writing or [a] (for real β)[b] yields the expressions:[c]
A value proportional to the reciprocal of β is sometimes referred to as the temperature: , where k is typically 1 or the Boltzmann constant and T is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.
In some fields, the base is fixed, corresponding to a fixed scale,[d] while in others the parameter β (or T) is varied.
The Softmax function is a smooth approximation to the arg max function: the function whose value is the index of a vector's largest element. The name "softmax" may be misleading. Softmax is not a smooth maximum (that is, a smooth approximation to the maximum function). The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.[3][4] This section uses the term "softargmax" for clarity.
Formally, instead of considering the arg max as a function with categorical output (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique maximum arg):
where the output coordinate if and only if is the arg max of , meaning is the unique maximum value of . For example, in this encoding since the third argument is the maximum.
This can be generalized to multiple arg max values (multiple equal being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, since the second and third argument are both the maximum. In case all arguments are equal, this is simply Points z with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a jump discontinuity) – while points with a single arg max are known as non-singular or regular points.
With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as , softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input z as , However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, but and for all inputs: the closer the points are to the singular set , the slower they converge. However, softargmax does converge compactly on the non-singular set.
Conversely, as , softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively min-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization".
It is also the case that, for any fixed β, if one input is much larger than the others relative to the temperature, , the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1:
However, if the difference is small relative to the temperature, the value is not close to the arg max. For example, a difference of 10 is small relative to a temperature of 100:
As , temperature goes to zero, , so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.
This can be seen as the composition of K linear functions and the softmax function (where denotes the inner product of and ). The operation is equivalent to applying a linear operator defined by to vectors , thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space .
Neural networks
The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.
Since the function maps a vector and a specific index to a real value, the derivative needs to take the index into account:
This expression is symmetrical in the indexes and thus may also be expressed as
Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).
To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.
If the function is scaled with the parameter , then these expressions must be multiplied by .
See multinomial logit for a probability model which uses the softmax activation function.
Reinforcement learning
In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[8]
where the action value corresponds to the expected reward of following action a and is called a temperature parameter (in allusion to statistical mechanics). For high temperatures (), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (), the probability of the action with the highest expected reward tends to 1.
Computational complexity and remedies
In neural network applications, the number K of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words.[9] This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the , followed by the application of the softmax function itself) computationally expensive.[9][10] What's more, the gradient descentbackpropagation method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.[9][10]
Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax.[9] The hierarchical softmax (introduced by Morin and Bengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming latent variables.[10][11] The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.[10] Ideally, when the tree is balanced, this would reduce the computational complexity from to .[11] In practice, results depend on choosing a good strategy for clustering the outcomes into classes.[10][11] A Huffman tree was used for this in Google's word2vec models (introduced in 2013) to achieve scalability.[9]
A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.[9] These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).[9][10]
Mathematical properties
Geometrically the softmax function maps the vector space to the boundary of the standard -simplex, cutting the dimension by one (the range is a -dimensional simplex in -dimensional space), due to the linear constraint that all output sum to 1 meaning it lies on a hyperplane.
Along the main diagonal softmax is just the uniform distribution on outputs, : equal scores yield equal probabilities.
More generally, softmax is invariant under translation by the same value in each coordinate: adding to the inputs yields , because it multiplies each exponent by the same factor, (because ), so the ratios do not change:
Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: where ), and then the softmax takes the hyperplane of points that sum to zero, , to the open simplex of positive values that sum to 1, analogously to how the exponent takes 0 to 1, and is positive.
By contrast, softmax is not invariant under scaling. For instance, but
The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the (x, y) plane. One variable is fixed at 0 (say ), so , and the other variable can vary, denote it , so the standard logistic function, and its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line , with outputs and
In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, Bridle (1990a):[14]: 1 and Bridle (1990b):[3]
We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.[15]: 227
For any input, the outputs must all be positive and they must sum to unity. ...
Given a set of unconstrained values, , we can ensure both conditions by using a Normalised Exponential transformation:
This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value. For this reason we like to refer to it as softmax.[16]: 213
Example
With an input of (1, 2, 3, 4, 1, 2, 3), the softmax is approximately (0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175). The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: a change of temperature changes the output. When the temperature is multiplied by 10, the inputs are effectively (0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3) and the softmax is approximately (0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153). This shows that high temperatures de-emphasize the maximum value.
The softmax function generates probability predictions densely distributed over its support. Other functions like sparsemax or α-entmax can be used when sparse probability predictions are desired.[17] Also the Gumbel-softmax reparametrization trick can be used when sampling from a discrete-discrete distribution needs to be mimicked in a differentiable manner.
Exponential tilting – a generalization of Softmax to more general probability distributions
Notes
^Positive β corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative −β corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the Gibbs distribution, interpreting β as coldness.
^Goodfellow, Bengio & Courville 2016, pp. 183–184: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is . It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.
^Boltzmann, Ludwig (1868). "Studien über das Gleichgewicht der lebendigen Kraft zwischen bewegten materiellen Punkten" [Studies on the balance of living force between moving material points]. Wiener Berichte. 58: 517–560.
^ abGao, Bolin; Pavel, Lacra (2017). "On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning". arXiv:1704.00805 [math.OC].
^Bridle, John S. (1990a). Soulié F.F.; Hérault J. (eds.). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. Neurocomputing: Algorithms, Architectures and Applications (1989). NATO ASI Series (Series F: Computer and Systems Sciences). Vol. 68. Berlin, Heidelberg: Springer. pp. 227–236. doi:10.1007/978-3-642-76153-9_28.
Cabai panggul-kelabu Cabai panggul-kelabu di Pulau Siau, Sulawesi Utara Status konservasi Risiko Rendah (IUCN 3.1)[1] Klasifikasi ilmiah Kerajaan: Animalia Filum: Chordata Kelas: Aves Ordo: Passeriformes Famili: Dicaeidae Genus: Dicaeum Spesies: D. celebicum Nama binomial Dicaeum celebicumMüller, 1843 Cabai panggul-kelabu (Dicaeum celebicum) adalah salah satu spesies burung di dalam keluarga Dicaeidae. Burung ini endemik di Sulawesi dan Kepulauan Sula. Deskripsi Pada indiv...
Factory GirlPoster filmSutradaraMohamed KhanProduserMohamed SamirCeritaWessam SolimanPemeranYasmin RaeisHany AdelSalwa KhattabPenyuntingDina FaroukTanggal rilis 8 Desember 2013 (2013-12-08) (Dubai) 19 Maret 2014 (2014-03-19) (Mesir) NegaraMesirBahasaArab Factory Girl (Arab: فتاة المصنعcode: ar is deprecated , Fataat El Masnaa) adalah sebuah film drama percintaan Mesir yang disutradarai oleh Mohamed Khan. Film tersebut tayang perdana di Festival Film Internasional D...
Roberto Boninsegna Boninsegna all'Inter nella stagione 1971-1972 Nazionalità Italia Altezza 174 cm Peso 72 kg Calcio Ruolo Allenatore (ex attaccante) Termine carriera 1981 - giocatore2003 - allenatore Carriera Giovanili 1962-1963 Inter Squadre di club1 1963-1964 Prato22 (1)1964-1965 Potenza32 (9)1965-1966 Varese28 (5)1966-1967 Cagliari34 (9)1967→ Chicago Mustangs[1]9 (11)1967-1969 Cagliari49 (14)1969-1976 Inter197 (113)1976-1979 ...
1938 film The GutterDirected byMaurice Lehmann Claude Autant-LaraWritten byJean Aurenche Michel DuranBased onThe Gutter by Pierre WolffProduced byMaurice LehmannStarringFrançoise Rosay Michel Simon Gaby Sylvia Ginette LeclercCinematographyMichel KelberEdited byVictoria PosnerMusic byTiarko Richepin Vincent ScottoProductioncompanyProductions Maurice LehmannDistributed byLes Distributeurs FrançaisRelease date 29 October 1938 (1938-10-29) Running time100 minutesCountryFranceLang...
Sports club in Hamburg, Germany Football clubHamburger SVFull nameHamburger Sport-Verein e.V.Nickname(s)Die Rothosen (The Red Shorts)Short nameHSVFounded29 September 1887; 136 years ago (1887-09-29) (as SC Germania)2 June 1919; 104 years ago (1919-06-02) (as Hamburger SV)GroundVolksparkstadionCapacity57,000PresidentMarcell JansenSporting directorJonas BoldtHead coachSteffen BaumgartLeague2. Bundesliga2022–232. Bundesliga, 3rd of 18WebsiteClub website Home...
Inge Amelia Nasution (lahir 6 Juli 1989)[1] adalah seorang politikus Indonesia. Ia merupakan anak kedua dari tiga bersaudara dari pasangan Erwin Nasution dan Ade Hanifah Siregar serta merupakan kakak kandung dari Walikota Medan Bobby Nasution dan kakak ipar dari Kahiyang Ayu, putri dari Presiden Indonesia Joko Widodo. Inge menjadi anggota DPRD Sumatera Utara sejak tahun 2014. Inge mencalonkan diri menjadi wakil rakyat dari partai NasDem dan bertarung di daerah pemilihan Sumut 10 yang ...
First professional baseball team For the current Major League Baseball team that has played since 1882, see Cincinnati Reds. For the Major League Baseball team that played from 1876 to 1880, see Cincinnati Reds (1876–1880). Part of a series on theHistory of baseball Origins Early years Knickerbocker Rules Massachusetts Game Town ball Alexander Cartwright Doc Adams Doubleday myth First league First pro team First pro league All-American Girls Professional Baseball League Team nicknames By co...
Toyota CorollaToyota Corolla (E210, Tiongkok)InformasiProdusenToyotaMasa produksi1967–sekarangBodi & rangkaKelasMobil subkompak (1967–1993)Mobil kompak (1994–sekarang)KronologiPendahuluToyota Publica Toyota Corolla adalah mobil jenis sedan kompak yang diproduksi oleh Toyota. Diperkenalkan tahun 1967, menjadi mobil terlaris di dunia sejak 1970,[1] dengan penjualan lebih dari 40 juta unit tahun 2013.[2] Tahun 1997, Corolla menjadi model mobil terlaris sedunia melampaui...
Cet article concerne la déformation d'une image. Pour les autres utilisations de ce mot, voir Anamorphose (homonymie). Une anamorphose (du grec αναμορφωειν anamorphoein, « transformer ») est le résultat de la projection frontale d’une image sur une surface non perpendiculaire à l’axe de projection. L’image projetée en épousant le modelé de la surface devient une image déformée appelée anamorphose. Par l’anamorphose une simple image (2D) devient une...
Perpustakaan neo-klasik Enderun Sekolah Enderun (Turki Utsmaniyah: اندرون مکتب, Enderûn Mektebi) adalah sebuah sekolah istana dan sekolah dasar yang sebagian besar untuk millet Kristen di Kekaisaran Utsmaniyah, yang utamanya merekrut murid-murid melalui devşirme,[1] sebuah sistem Islamisasi anak-anak Kristen untuk bertugas pada pemerintahan Utsmaniyah dalam birokratik, manajerial, dan posisi militer Janisari.[2] Kurikulum Kurikulumnya terbagi dalam lima divisi utam...
Head of Kenyan State Law Office Politics of Kenya National Government Constitution History Human rights LGBT rights Executive President (list) William Ruto Deputy President Rigathi Gachagua Cabinet Prime Cabinet Secretary Musalia Mudavadi Attorney General Justin Muturi Director of Public Prosecutions Renson M. Ingonga Legislature National Assembly Speaker: Moses Wetangula List of members Constituencies Senate Speaker: Amason Kingi List of members Judiciary Chief Justice Martha Koome Deputy Ch...
Gustave VI Adolphe(sv) Gustaf VI Adolf Gustave VI Adolphe (1962) Titre Roi de Suède 29 octobre 1950 – 15 septembre 1973(22 ans, 10 mois et 17 jours) Premier ministre Tage ErlanderOlof Palme Prédécesseur Gustave V Successeur Charles XVI Gustave Prince héritier de Suède 8 décembre 1907 – 29 octobre 1950(42 ans, 10 mois et 21 jours) Monarque Gustave V Prédécesseur Gustave, duc de Värmland Successeur Carl Gustaf, duc de Jämtland Biographie Dynastie Mai...
Former German state (1495-1806) Duchy of WürttembergHerzogtum Württemberg (German)1495–1803 Flag Coat of arms Duchy of Württemberg within the Holy Roman Empire (1618)CapitalStuttgartCommon languagesSwabian GermanReligion Roman CatholicLutheranDemonym(s)WürttembergerGovernmentDuchyDuke • 1495–1496 Eberhard I (first)• 1797–1803 Frederick II (last) Historical eraEarly modernNapoleonic• Diet of Worms 21 July 1495• Poor Conrad May 1514• R...
Aimee Garcia a Parigi nel 2022 Aimee Garcia (Chicago, 28 novembre 1978) è un'attrice statunitense. È conosciuta principalmente per i ruoli di Veronica Palmero nella sitcom George Lopez, di Jamie Batista nella serie televisiva Dexter e di Ella Lopez in Lucifer. È inoltre stata tra i personaggi principali delle serie di breve durata Trauma e Off the Map. Indice 1 Carriera 2 Filmografia 2.1 Cinema 2.2 Televisione 2.3 Doppiatrice 3 Doppiatrici italiane 4 Note 5 Altri progetti 6 Collegamenti es...
Questa voce sull'argomento montagne della Turchia è solo un abbozzo. Contribuisci a migliorarla secondo le convenzioni di Wikipedia. Olimpo della Misia (Uludağ)Cime del Monte UludağStato Turchia Altezza2 543 m s.l.m. Prominenza1 504 m Coordinate40°04′10″N 29°13′17″E / 40.069444°N 29.221389°E40.069444; 29.221389Coordinate: 40°04′10″N 29°13′17″E / 40.069444°N 29.221389°E40.069444; 29.221389 Mappa di locali...
Untuk album Buddy Holly, lihat The Buddy Holly Story (album). Untuk musikal tentang Holly, lihat Buddy - The Buddy Holly Story. The Buddy Holly StorySampul DVD The Buddy Holly StorySutradaraSteve RashProduserFred BauerEdward H. CohenFrances Avrut-BauerFred T. KuehnertSkenarioRobert GittlerCeritaAlan SwyerBerdasarkanNovel:John GoldrosenPemeranGary BuseyDon StroudCharles Martin SmithConrad JanisPaul MooneyPenata musikJoe RenzettiSinematograferStevan LarnerPenyuntingDavid E. BlewittJames S...
Artikel ini perlu diwikifikasi agar memenuhi standar kualitas Wikipedia. Anda dapat memberikan bantuan berupa penambahan pranala dalam, atau dengan merapikan tata letak dari artikel ini. Untuk keterangan lebih lanjut, klik [tampil] di bagian kanan. Mengganti markah HTML dengan markah wiki bila dimungkinkan. Tambahkan pranala wiki. Bila dirasa perlu, buatlah pautan ke artikel wiki lainnya dengan cara menambahkan [[ dan ]] pada kata yang bersangkutan (lihat WP:LINK untuk keterangan lebih lanjut...
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: München Hauptbahnhof – news · newspapers · books · scholar · JSTOR (November 2022) (Learn how and when to remove this message) Main railway station in Munich, Germany München Hauptbahnhof HbfThe entrance building (demolished in 2019)General informationLocatio...
Questa voce o sezione sull'argomento diritto è priva o carente di note e riferimenti bibliografici puntuali. Sebbene vi siano una bibliografia e/o dei collegamenti esterni, manca la contestualizzazione delle fonti con note a piè di pagina o altri riferimenti precisi che indichino puntualmente la provenienza delle informazioni. Puoi migliorare questa voce citando le fonti più precisamente. Segui i suggerimenti del progetto di riferimento. La giurisprudenza (termine derivante dal latin...