T-distributed stochastic neighbor embedding

T-SNE visualisation of word embeddings generated using 19th century literature
T-SNE embeddings of MNIST dataset

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis,[1] where Laurens van der Maaten and Hinton proposed the t-distributed variant.[2] It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP.

t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research,[3] natural language processing, music analysis,[4] cancer research,[5] bioinformatics,[6] geological domain interpretation,[7][8][9] and biomedical signal processing.[10]

For a data set with n elements, t-SNE runs in O(n2) time and requires O(n2) space.[11]

Details

Given a set of high-dimensional objects , t-SNE first computes probabilities that are proportional to the similarity of objects and , as follows.

For , define

and set . Note the above denominator ensures for all .

As van der Maaten and Hinton explained: "The similarity of datapoint to datapoint is the conditional probability, , that would pick as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at ."[2]

Now define


This is motivated because and from the N samples are estimated as 1/N, so the conditional probability can be written as and . Since , you can obtain previous formula.

Also note that and .

The bandwidth of the Gaussian kernels is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of are used in denser parts of the data space. The entropy increases with the perplexity of this distribution ; this relation is seen as


where is the shannon entropy

The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50."[2].

Since the Gaussian kernel uses the Euclidean distance , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this.[12]

t-SNE aims to learn a -dimensional map (with and typically chosen as 2 or 3) that reflects the similarities as well as possible. To this end, it measures similarities between two points in the map and , using a very similar approach. Specifically, for , define as

and set . Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.

The locations of the points in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution from the distribution , that is:

The minimization of the Kullback–Leibler divergence with respect to the points is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

Output

While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering,[13] and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters.[14] Thus, interactive exploration may be needed to choose parameters and validate results.[15][16] It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering.[17]

Software

  • A C++ implementation of Barnes-Hut is available on the github account of one of the original authors.
  • The R package Rtsne implements t-SNE in R.
  • ELKI contains tSNE, also with Barnes-Hut approximation
  • scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation.
  • Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version)
  • The Julia package TSne implements t-SNE

References

  1. ^ Hinton, Geoffrey; Roweis, Sam (January 2002). Stochastic neighbor embedding (PDF). Neural Information Processing Systems.
  2. ^ a b c van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.
  3. ^ Gashi, I.; Stankovic, V.; Leita, C.; Thonnard, O. (2009). "An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines". Proceedings of the IEEE International Symposium on Network Computing and Applications: 4–11.
  4. ^ Hamel, P.; Eck, D. (2010). "Learning Features from Music Audio with Deep Belief Networks". Proceedings of the International Society for Music Information Retrieval Conference: 339–344.
  5. ^ Jamieson, A.R.; Giger, M.L.; Drukker, K.; Lui, H.; Yuan, Y.; Bhooshan, N. (2010). "Exploring Nonlinear Feature Space Dimension Reduction and Data Representation in Breast CADx with Laplacian Eigenmaps and t-SNE". Medical Physics. 37 (1): 339–351. doi:10.1118/1.3267037. PMC 2807447. PMID 20175497.
  6. ^ Wallach, I.; Liliean, R. (2009). "The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding". Bioinformatics. 25 (5): 615–620. doi:10.1093/bioinformatics/btp035. PMID 19153135.
  7. ^ Balamurali, Mehala; Silversides, Katherine L.; Melkumyan, Arman (2019-04-01). "A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data". Computers & Geosciences. 125: 78–89. Bibcode:2019CG....125...78B. doi:10.1016/j.cageo.2019.01.011. ISSN 0098-3004. S2CID 67926902.
  8. ^ Balamurali, Mehala; Melkumyan, Arman (2016). Hirose, Akira; Ozawa, Seiichi; Doya, Kenji; Ikeda, Kazushi; Lee, Minho; Liu, Derong (eds.). "t-SNE Based Visualisation and Clustering of Geological Domain". Neural Information Processing. Lecture Notes in Computer Science. 9950. Cham: Springer International Publishing: 565–572. doi:10.1007/978-3-319-46681-1_67. ISBN 978-3-319-46681-1.
  9. ^ Leung, Raymond; Balamurali, Mehala; Melkumyan, Arman (2021-01-01). "Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering". Mathematical Geosciences. 53 (1): 105–130. Bibcode:2021MaGeo..53..105L. doi:10.1007/s11004-019-09839-z. ISSN 1874-8953. S2CID 208329378.
  10. ^ Birjandtalab, J.; Pouyan, M. B.; Nourani, M. (2016-02-01). "Nonlinear dimension reduction for EEG-based epileptic seizure detection". 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). pp. 595–598. doi:10.1109/BHI.2016.7455968. ISBN 978-1-5090-2455-1. S2CID 8074617.
  11. ^ Pezzotti, Nicola. "Approximated and User Steerable tSNE for Progressive Visual Analytics" (PDF). Retrieved 31 August 2023.
  12. ^ Schubert, Erich; Gertz, Michael (2017-10-04). Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection. SISAP 2017 – 10th International Conference on Similarity Search and Applications. pp. 188–203. doi:10.1007/978-3-319-68474-1_13.
  13. ^ "K-means clustering on the output of t-SNE". Cross Validated. Retrieved 2018-04-16.
  14. ^ Wattenberg, Martin; Viégas, Fernanda; Johnson, Ian (2016-10-13). "How to Use t-SNE Effectively". Distill. 1 (10): e2. doi:10.23915/distill.00002. ISSN 2476-0757.
  15. ^ Pezzotti, Nicola; Lelieveldt, Boudewijn P. F.; Maaten, Laurens van der; Hollt, Thomas; Eisemann, Elmar; Vilanova, Anna (2017-07-01). "Approximated and User Steerable tSNE for Progressive Visual Analytics". IEEE Transactions on Visualization and Computer Graphics. 23 (7): 1739–1752. arXiv:1512.01655. doi:10.1109/tvcg.2016.2570755. ISSN 1077-2626. PMID 28113434. S2CID 353336.
  16. ^ Wattenberg, Martin; Viégas, Fernanda; Johnson, Ian (2016-10-13). "How to Use t-SNE Effectively". Distill. 1 (10). doi:10.23915/distill.00002. Retrieved 4 December 2017.
  17. ^ Linderman, George C.; Steinerberger, Stefan (2017-06-08). "Clustering with t-SNE, provably". arXiv:1706.02582 [cs.LG].

Read other articles:

The WitnessPoster teatrikalNama lainHangul목격자 Alih Aksara yang DisempurnakanMok-gyeok-ja SutradaraJo Kyu-jangProduserOh Jung-hyunCha Ji-hyunDitulis olehLee Young-jongSkenarioLee Young-jongPemeranLee Sung-minKim Sang-hoJin KyungKwak Si-yangPenata musikMok Yeong-jinSinematograferYu EokPenyuntingKim Seon-minPerusahaanproduksiAD406DistributorNext Entertainment WorldTanggal rilis 15 Agustus 2018 (2018-08-15) Durasi111 menitNegaraKorea SelatanBahasaKoreaPendapatankotorUS$19,2...

 

 

Johannes Root Johannes Root (Banda, Maluku Tengah, 23 Juli 1828 - ?) adalah seorang flankeur Belanda, yang merupakan ksatria Militaire Willems-Orde kelas IV. Pendidikan Pada tanggal 12 Mei 1847, Root secara sukarela menjadi serdadu selama 6 tahun di Ambon dengan gaji 60 gulden. Ekspedisi militer Pada tahun 1848, Root ikut dalam Batalyon III yang ikut dalam Perang Bali II dan ikut lagi - di bawah Overste Toontje Poland - dalam penaklukan Jagaraga. Tak lama setelah Perang Bali III, Root di...

 

 

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (مايو 2016) جورجينا كليتغارد   معلومات شخصية الميلاد 1893 الوفاة 1976بيرزفيل  [لغات أخرى]‏  الجنسية الولايات المتحدة الأمريكية الحياة العملية المدرسة الأم كلية ...

Fossiliferous geological group Oliní GroupStratigraphic range: Coniacian-Campanian~87–75 Ma PreꞒ Ꞓ O S D C P T J K Pg N TypeGeological groupSub-unitsSee textUnderliesCórdoba Fm., La Tabla Fm.OverliesGüagüaquí Group Loma Gorda FormationThicknessup to 287 m (942 ft)LithologyPrimaryRadiolarite, siltstone, chert, mudstoneOtherSandstone, limestone, conglomerateLocationCoordinates3°44′12.6″N 75°27′55.5″W / 3.736833°N 75.465417°W / 3...

 

 

Hong Kong trade union and political party Not to be confused with the Hong Kong Federation of Trade Unions. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Hong Kong Confederation of Trade Unions – news · newspapers · books · scholar · JSTOR (March 2021) (Learn how and when to remove this template message) H...

 

 

American college football season 2013 South Florida Bulls footballConferenceAmerican Athletic ConferenceRecord2–10 (2–6 AAC)Head coachWillie Taggart (1st season)Offensive coordinatorWalt Wells (1st season)Defensive coordinatorChuck Bresnahan (1st season)Home stadiumRaymond James StadiumSeasons← 20122014 → 2013 American Athletic Conference football standings vte Conf Overall Team   W   L     W   L   No. 10 UCF $  ...

Air 1 radio station in Cherry Hill, New Jersey This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: WYPA – news · newspapers · books · scholar · JSTOR (April 2017) (Learn how and when to remove this message) WYPACherry Hill, New JerseyBroadcast areaPhiladelphia/Cherry HillFrequency89.5 MHz (HD Radio)BrandingAir1P...

 

 

يفتقر محتوى هذه المقالة إلى الاستشهاد بمصادر. فضلاً، ساهم في تطوير هذه المقالة من خلال إضافة مصادر موثوق بها. أي معلومات غير موثقة يمكن التشكيك بها وإزالتها. (أبريل 2019) هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. ...

 

 

Військово-музичне управління Збройних сил України Тип військове формуванняЗасновано 1992Країна  Україна Емблема управління Військово-музичне управління Збройних сил України — структурний підрозділ Генерального штабу Збройних сил України призначений для планува...

Hubungan akrab Jenis hubungan Duda · Istri · Janda · Keluarga · Kumpul kebo · Monogami · Nikah siri · Pacar lelaki · Pacar perempuan · Perkawinan · Poligami · Saudara · Sahabat · Selir · Suami · Wanita simpanan Peristiwa dalam hubungan Cinta · Ciuman · Kasih sayang · Pacaran · Persahabatan · Pernikahan · Perselingkuhan · Perceraian · Percumbuan · Per...

 

 

Lack of fresh water resources to meet water demand Map of global water stress (a symptom of water scarcity) in 2019. Water stress is the ratio of water use relative to water availability and is therefore a demand-driven scarcity.[1] Water scarcity (closely related to water stress or water crisis) is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.[2]: 560&...

 

 

American blues rock band This article describes the band Double Trouble across its various incarnations. For further information on its most successful period, see Stevie Ray Vaughan § Double Trouble and following sections. For the UK dance act, see Double Trouble (dance music producers). Double TroubleLeft to right: Chris Layton, Stevie Ray Vaughan, Tommy Shannon in 1983Background informationOriginAustin, Texas, U.S.Genres Texas blues blues rock electric blues instrumental rock Years activ...

American artistic gymnast John MaisFull nameJohn Charles MaisCountry representedUnited StatesBorn(1888-01-17)January 17, 1888Philadelphia, Pennsylvania, U.S.DiedAugust 8, 1974(1974-08-08) (aged 86)Bryn Mawr, Pennsylvania, U.S.Height164 cm (5 ft 5 in)DisciplineMen's artistic gymnasticsGymPhiladelphia Turngemeinde John Charles Mais (January 17, 1888 – August 8, 1974) was an American gymnast. He was a member of the United States men's national artistic gymnas...

 

 

Standbeeld van de Fries Pier Gerlofs Donia, die als 'Grote Pier' mede vanwege zijn grote lengte, goede vechtkunsten en slim gebruik van guerrillatactieken een van de bekendste Europese krijgsheren van de 16e eeuw was en uitgroeide tot een legendarische volksheld Een krijgsheer is een militair en politiek leider die controle heeft over een beperkt gebied, meestal onafhankelijk van de centrale regering. Een krijgsheer is in essentie niet veel anders dan een roverhoofdman, bendeleider, of piraat...

 

 

Political party in Ecuador Citizen Revolution Movement Movimiento Revolución CiudadanaAbbreviationRCLeaderRafael Correa[1]PresidentMarcela AguiñagaGeneral SecretaryDavid VillamarFounderIván EspinelFounded2010 (historical)August 2021 (modern)Registered18 August 2016; 8 years ago (2016-08-18)Split fromPAIS AllianceHeadquartersQuitoYouth wingFrente de Jóvenes de la RCMembership (2022)200,000[2]IdeologyDemocratic socialismPost-neoliberalismCorreism[3...

Antonio De Poli Questore del Senato della Repubblica[1]In caricaInizio mandato21 marzo 2013 PresidentePietro GrassoMaria Elisabetta Alberti CasellatiIgnazio La Russa Presidente dell'Unione di CentroIn caricaInizio mandato16 dicembre 2016 PredecessoreGianpiero D'Alia Sindaco di Carmignano di BrentaDurata mandato1º agosto 1990 –24 aprile 1995 PredecessorePaolo Rigon SuccessorePaolo Botton Senatore della Repubblica ItalianaIn caricaInizio mandato15 marzo 2...

 

 

French psychologist, IQ test developer (1857–1911) Alfred BinetAlfred BinetBorn8 July 1857 (1857-07-08)Nice, Kingdom of SardiniaDied18 October 1911(1911-10-18) (aged 54)[1]Paris, FranceNationalityFrenchKnown forStanford–Binet Intelligence ScalesBinet–Simon testSpouseLaure BalbianiScientific careerFieldsPsychology Alfred Binet (French: [binɛ]; 8 July 1857 – 18 October 1911), born Alfredo Binetti, was a French psychologist who together with Théodore...

 

 

Teviotdale redirects here. For other uses, see Teviotdale (disambiguation). For the county in New South Wales, see Roxburgh County. Historic county in ScotlandRoxburgh Siorrachd RosbroigHistoric countyCountryScotlandCounty townJedburghArea • Total666 sq mi (1,725 km2) Ranked 12th of 34Chapman codeROX Roxburghshire or the County of Roxburgh (Scottish Gaelic: Siorrachd Rosbroig) is a historic county and registration county in the Southern Uplands of Scotland. It ...

Location of Rüstringen Rüstringen or Rustringen was an old Frisian gau, which lies between the modern district Friesland and the Weser river in modern Lower Saxony. Nowadays, only a small part of the original territory remains, namely the Butjadingen peninsula. The largest part of historical Rüstringen has been lost to the sea in the Middle Ages due to various storm surges and now forms the Jadebusen bay. External links Butjadingen and Rüstringen This Lower Saxony location article is a st...

 

 

Advisory judicial body in Brazil This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: National Council of Justice – news · newspapers · books · scholar · JSTOR (June 2023) (Learn how and when to remove this message) National Council of JusticeConselho Nacional de JustiçaThe CNJ headquarters in Brasília15°48′...