T-distributed stochastic neighbor embedding

T-SNE visualisation of word embeddings generated using 19th century literature
T-SNE embeddings of MNIST dataset

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis,[1] where Laurens van der Maaten and Hinton proposed the t-distributed variant.[2] It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP.

t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research,[3] natural language processing, music analysis,[4] cancer research,[5] bioinformatics,[6] geological domain interpretation,[7][8][9] and biomedical signal processing.[10]

For a data set with n elements, t-SNE runs in O(n2) time and requires O(n2) space.[11]

Details

Given a set of high-dimensional objects , t-SNE first computes probabilities that are proportional to the similarity of objects and , as follows.

For , define

and set . Note the above denominator ensures for all .

As van der Maaten and Hinton explained: "The similarity of datapoint to datapoint is the conditional probability, , that would pick as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at ."[2]

Now define

This is motivated because and from the N samples are estimated as 1/N, so the conditional probability can be written as and . Since , you can obtain previous formula.

Also note that and .

The bandwidth of the Gaussian kernels is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of are used in denser parts of the data space. The entropy increases with the perplexity of this distribution ; this relation is seen as

where is the shannon entropy

The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.".[2]

Since the Gaussian kernel uses the Euclidean distance , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this.[12]

t-SNE aims to learn a -dimensional map (with and typically chosen as 2 or 3) that reflects the similarities as well as possible. To this end, it measures similarities between two points in the map and , using a very similar approach. Specifically, for , define as

and set . Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.

The locations of the points in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution from the distribution , that is:

The minimization of the Kullback–Leibler divergence with respect to the points is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

Output

While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering,[13] and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters.[14] Thus, interactive exploration may be needed to choose parameters and validate results.[15][16] It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering.[17]

Software

  • A C++ implementation of Barnes-Hut is available on the github account of one of the original authors.
  • The R package Rtsne implements t-SNE in R.
  • ELKI contains tSNE, also with Barnes-Hut approximation
  • scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation.
  • Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version)
  • The Julia package TSne implements t-SNE

References

  1. ^ Hinton, Geoffrey; Roweis, Sam (January 2002). Stochastic neighbor embedding (PDF). Neural Information Processing Systems.
  2. ^ a b c van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.
  3. ^ Gashi, I.; Stankovic, V.; Leita, C.; Thonnard, O. (2009). "An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines". Proceedings of the IEEE International Symposium on Network Computing and Applications: 4–11.
  4. ^ Hamel, P.; Eck, D. (2010). "Learning Features from Music Audio with Deep Belief Networks". Proceedings of the International Society for Music Information Retrieval Conference: 339–344.
  5. ^ Jamieson, A.R.; Giger, M.L.; Drukker, K.; Lui, H.; Yuan, Y.; Bhooshan, N. (2010). "Exploring Nonlinear Feature Space Dimension Reduction and Data Representation in Breast CADx with Laplacian Eigenmaps and t-SNE". Medical Physics. 37 (1): 339–351. doi:10.1118/1.3267037. PMC 2807447. PMID 20175497.
  6. ^ Wallach, I.; Liliean, R. (2009). "The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding". Bioinformatics. 25 (5): 615–620. doi:10.1093/bioinformatics/btp035. PMID 19153135.
  7. ^ Balamurali, Mehala; Silversides, Katherine L.; Melkumyan, Arman (2019-04-01). "A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data". Computers & Geosciences. 125: 78–89. Bibcode:2019CG....125...78B. doi:10.1016/j.cageo.2019.01.011. ISSN 0098-3004. S2CID 67926902.
  8. ^ Balamurali, Mehala; Melkumyan, Arman (2016). "t-SNE Based Visualisation and Clustering of Geological Domain". In Hirose, Akira; Ozawa, Seiichi; Doya, Kenji; Ikeda, Kazushi; Lee, Minho; Liu, Derong (eds.). Neural Information Processing. Lecture Notes in Computer Science. Vol. 9950. Cham: Springer International Publishing. pp. 565–572. doi:10.1007/978-3-319-46681-1_67. ISBN 978-3-319-46681-1.
  9. ^ Leung, Raymond; Balamurali, Mehala; Melkumyan, Arman (2021-01-01). "Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering". Mathematical Geosciences. 53 (1): 105–130. Bibcode:2021MatGe..53..105L. doi:10.1007/s11004-019-09839-z. ISSN 1874-8953. S2CID 208329378.
  10. ^ Birjandtalab, J.; Pouyan, M. B.; Nourani, M. (2016-02-01). "Nonlinear dimension reduction for EEG-based epileptic seizure detection". 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). pp. 595–598. doi:10.1109/BHI.2016.7455968. ISBN 978-1-5090-2455-1. S2CID 8074617.
  11. ^ Pezzotti, Nicola (2015). "Approximated and User Steerable tSNE for Progressive Visual Analytics". arXiv:1512.01655 [cs.CV].
  12. ^ Schubert, Erich; Gertz, Michael (2017-10-04). Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection. SISAP 2017 – 10th International Conference on Similarity Search and Applications. pp. 188–203. doi:10.1007/978-3-319-68474-1_13.
  13. ^ "K-means clustering on the output of t-SNE". Cross Validated. Retrieved 2018-04-16.
  14. ^ Wattenberg, Martin; Viégas, Fernanda; Johnson, Ian (2016-10-13). "How to Use t-SNE Effectively". Distill. 1 (10): e2. doi:10.23915/distill.00002. ISSN 2476-0757.
  15. ^ Pezzotti, Nicola; Lelieveldt, Boudewijn P. F.; Maaten, Laurens van der; Hollt, Thomas; Eisemann, Elmar; Vilanova, Anna (2017-07-01). "Approximated and User Steerable tSNE for Progressive Visual Analytics". IEEE Transactions on Visualization and Computer Graphics. 23 (7): 1739–1752. arXiv:1512.01655. doi:10.1109/tvcg.2016.2570755. ISSN 1077-2626. PMID 28113434. S2CID 353336.
  16. ^ Wattenberg, Martin; Viégas, Fernanda; Johnson, Ian (2016-10-13). "How to Use t-SNE Effectively". Distill. 1 (10). doi:10.23915/distill.00002. Retrieved 4 December 2017.
  17. ^ Linderman, George C.; Steinerberger, Stefan (2017-06-08). "Clustering with t-SNE, provably". arXiv:1706.02582 [cs.LG].

Read other articles:

Cari artikel bahasa  Cari berdasarkan kode ISO 639 (Uji coba)  Kolom pencarian ini hanya didukung oleh beberapa antarmuka Halaman bahasa acak Bahasa AvestaWilayahIran RayaEraZaman Besi, Zaman Perunggu Rumpun bahasaIndo-Eropa Indo-IranIranIran Timur ?Avesta Aspek ketatabahasaanKasusablativusakusativusdativusgenitivusinstrumentaliskasus vokatiflokativusnominativus Genderfemininmaskulinneutera Personafirst-person dualfirst-person pluralorang kedua tunggalorang pertama tu...

 

 

Upaya kudeta Turki 2016AnkaraIstanbulMarmarisMalatyaUpaya kudeta Turki 2016 (Turkey)Tanggal15–16 Juli 2016LokasiAnkara, Istanbul, Marmaris,[2] Malatya,[3][4] dan di seluruh dunia (aksi kekerasan skala kecil dan penahanan)[5]Status Kudeta gagal[6] Kejadian penting: Gedung Parlemen dibom.[7][8][9] Istana Presiden di Ankara dibom.[10] Kediaman Erdoğan di Marmaris diserang.[2]Pihak terlibat Dewan Perdamaian Turki F...

 

 

Hubungan Indonesia–Rusia Indonesia Rusia Hubungan Indonesia–Rusia (Rusia: Российско-индонезийские отношенияcode: ru is deprecated ) mengacu kepada hubungan luar negeri bilateral antara Indonesia dan Rusia. Rusia memiliki kedutaan besar di Jakarta, dan Indonesia memiliki kedutaan besar di Moskow serta konsulat jenderal di Saint Petersburg. Kedua negara adalah anggota APEC dan G-20. Menurut jajak pendapat Pew Research Center 2018, 46% orang Indonesia memiliki...

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (April 2014) (Learn how and when to remove this template message) This article has an unclear citation style. The references used may be made clearer with a di...

 

 

Katedral Saint-LouisKatedral Santo LouisCathédrale Saint LouisKatedral Saint-LouisLokasiSaint-LouisNegaraSenegalDenominasiGereja Katolik RomaSejarahDedikasiLouis IX dari PrancisTanggal konsekrasi4 November 1828ArsitekturStatusKatedralStatus fungsionalAktifPeletakan batu pertama11 Februari 1827Selesai4 November 1828AdministrasiKeuskupanKeuskupan Saint-Louis du Sénégal Katedral Santo Louis[1] (Prancis: Cathédrale Saint Louis),[2][3] adalah sebuah gereja katedral ...

 

 

追晉陸軍二級上將趙家驤將軍个人资料出生1910年 大清河南省衛輝府汲縣逝世1958年8月23日(1958歲—08—23)(47—48歲) † 中華民國福建省金門縣国籍 中華民國政党 中國國民黨获奖 青天白日勳章(追贈)军事背景效忠 中華民國服役 國民革命軍 中華民國陸軍服役时间1924年-1958年军衔 二級上將 (追晉)部队四十七師指挥東北剿匪總司令部參謀長陸軍�...

Gunung Borsippa (di zaman Babel). Digambar oleh Faucher-Gudin. Borsippa (Sumeria: BAD.SI.(A).AB.BAKI; Akkadian: Barsip dan Til-Barsip)[1] atau Birs Nimrud (yang dikenal sebagai dengan Nimrod) adalah sebuah situs arkeologis di Kegubernuran Babil, Irak. Zigguratnya, Menara Lidah, yang merupakan salah satu dari ziggurat yang tersisa yang paling gampang diidentifikasi, selanjutnya dikenali di kebudayaan Talmud dan Arab sebagai Menara Babel. Akan tetapi, ilmu pengetahuan modern menyimpulka...

 

 

Brazilian footballer and commentator For other footballers named Neto, see Neto (disambiguation) § Sports. Craque Neto Neto in 2021Personal informationFull name José Ferreira NetoDate of birth (1966-09-09) 9 September 1966 (age 57)Place of birth Santo Antônio de Posse, BrazilHeight 1.74 m (5 ft 8+1⁄2 in)Position(s) Attacking midfielder, Second striker, WingerYouth career1979–1980 Ponte Preta1980–1983 GuaraniSenior career*Years Team Apps (Gls)1983–1986 G...

 

 

この項目には、一部のコンピュータや閲覧ソフトで表示できない文字が含まれています(詳細)。 数字の大字(だいじ)は、漢数字の一種。通常用いる単純な字形の漢数字(小字)の代わりに同じ音の別の漢字を用いるものである。 概要 壱万円日本銀行券(「壱」が大字) 弐千円日本銀行券(「弐」が大字) 漢数字には「一」「二」「三」と続く小字と、「壱」「�...

Restaurant chain in British Columbia, Canada A Japadog cart in Los Angeles A kusomayo hot dog from Japadog. Japadog is a small chain of street food stands and restaurants located in Vancouver, British Columbia, Canada (there was also a location in New York City[1] which closed in 2013). The chain, which specializes in hot dogs that include variants of Japanese-style foods like okonomiyaki, yakisoba, teriyaki and tonkatsu, is owned by Noriki Tamura.[2] History Tamura and his wi...

 

 

The police in Angola The national police in Angola. Law enforcement in Angola is run by the Government of Angola. The National Police Force (PN) is a paramilitary body controlled by the Ministry of the Interior.[1][2] History Law enforcement has its origins in the Polícia de Segurança Pública of Portuguese Angola. On February 28, 1976, the National Police was founded via an oath taking ceremony for 383 policemen at the National Public Order Police School. It was founded as ...

 

 

' قرية با محيسن العنود  - قرية -  تقسيم إداري البلد  اليمن المحافظة محافظة حضرموت المديرية مديرية الضليعة العزلة عزلة الضليعة السكان التعداد السكاني 2004 السكان 45   • الذكور 24   • الإناث 21   • عدد الأسر 7   • عدد المساكن 8 معلومات أخرى التوقيت توقيت اليمن (+...

Protected area in Queensland, AustraliaGoold Island National ParkQueenslandIUCN category II (national park) Goold IslandGoold Island National ParkCoordinates18°10′01″S 146°10′16″E / 18.16694°S 146.17111°E / -18.16694; 146.17111Area8.3 km2 (3.2 sq mi)Managing authoritiesQueensland Parks and Wildlife ServiceWebsiteGoold Island National ParkSee alsoProtected areas of Queensland Goold Island is a national park in Queensland, Australia, 1,250...

 

 

Greek politician; Prime Minister 1910–20 and 1928–33 For the Athens airport, see Athens International Airport. Eleftherios VenizelosΕλευθέριος ΒενιζέλοςVenizelos in 1935Prime Minister of GreeceIn office16 January 1933 – 6 March 1933PresidentAlexandros ZaimisPreceded byPanagis TsaldarisSucceeded byAlexandros OthonaiosIn office5 June 1932 – 4 November 1932PresidentAlexandros ZaimisPreceded byAlexandros PapanastasiouSucceeded byPanagis TsaldarisIn off...

 

 

School of Mahayana Buddhism established and practiced in China For other uses, see Tiantai (disambiguation). Pagoda of the Guoqing Temple at Tiantai Mountain TiantaiChinese nameChinese天台Hanyu PinyinPRC Standard Mandarin: TiāntāiROC Standard Mandarin: Tiāntái Literal meaningfrom Tiantai [Heavenly Terrace] MountainTranscriptionsStandard MandarinHanyu PinyinPRC Standard Mandarin: TiāntāiROC Standard Mandarin: TiāntáiBopomofoPRC: ㄊㄧㄢ   ㄊㄞROC: ㄊㄧㄢ &#...

ISP PurworejoNama lengkapIkatan Sepakbola PurworejoJulukanLaskar BagelenBerdiri1954; 70 tahun lalu (1954)StadionStadion WR Soepratman[1]Purworejo, Jawa Tengah(Kapasitas: 10,000)KetuaAngko Setiyarso Widodo[2]LigaLiga 3 Kostum kandang Kostum tandang Kostum ketiga ISP (singkatan dari Ikatan Sepakbola Purworejo) adalah sebuah klub sepak bola Indonesia yang berbasis di Purworejo. ISP saat ini berlaga di Liga 3 Indonesia. Sejarah Sebelum nama Persekabpur[3] diresmikan, ...

 

 

В Википедии есть статьи о других людях с такой фамилией, см. Алексеев; Алексеев, Александр; Алексеев, Александр Александрович. Александр Алексеев Позиция защитник Рост 193 см Вес 91 кг Хват левый Страна  Россия Дата рождения 15 ноября 1999(1999-11-15) (24 года) Место рождения Санк...

 

 

Area pegunungan. Zona Sur (Zona Selatan) adalah salah satu dari lima zona alamiah di Chili. Perbatasan utaranya terbentuk oleh Sungai Bío-Bío, perbatasan dengan Zona Central. Di sebelah barat Zona Sur terdapat Samudra Pasifik, di sebelah timur terdapat pegunungan Andes dan Argentina. Perbatasan selatannya adalah Selat Chacao. Artikel bertopik geografi atau tempat Chili ini adalah sebuah rintisan. Anda dapat membantu Wikipedia dengan mengembangkannya.lbs

「打法」はこの項目へ転送されています。卓球における打法については「卓球#打法」をご覧ください。 野球における打撃(だげき)またはバッティング(英: batting)とは、打者が相手投手の投球をバットで打つこと、およびその方法である。 概要 打者はバッタースボックスの中で打撃姿勢(ホームベースに正対し、バットのグリップを握り、これを構えること)をと...

 

 

ダルマガレイ科 ナガダルマガレイ属の1種 Arnoglossus laterna 分類 界 : 動物界 Animalia 門 : 脊索動物門 Chordata 亜門 : 脊椎動物亜門 Vertebrata 綱 : 条鰭綱 Actinopterygii 目 : カレイ目 Pleuronectiformes 上科 : カレイ上科 Pleuronectoidea 科 : ダルマガレイ科 Bothidae 英名 Lefteye flounders ダルマガレイ科(ダルマカレイか、Bothidae)はカレイ目の科の1つ。全世界の温帯から熱帯海域に生息する[...