Sparse PCA

Sparse principal component analysis (SPCA or sparse PCA) is a technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.

A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. SPCA overcomes this disadvantage by finding components that are linear combinations of just a few input variables (SPCs). This means that some of the coefficients of the linear combinations defining the SPCs, called loadings,[note 1] are equal to zero. The number of nonzero loadings is called the cardinality of the SPC.

Mathematical formulation

Consider a data matrix, , where each of the columns represent an input variable, and each of the rows represents an independent sample from data population. One assumes each column of has mean zero, otherwise one can subtract column-wise mean from each element of . Let be the empirical covariance matrix of , which has dimension .

Given an integer with , the sparse PCA problem can be formulated as maximizing the variance along a direction represented by vector while constraining its cardinality:

Eq. 1

The first constraint specifies that v is a unit vector. In the second constraint, represents the pseudo-norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension p. The optimal value of Eq. 1 is known as the k-sparse largest eigenvalue.

If one takes k=p, the problem reduces to the ordinary PCA, and the optimal value becomes the largest eigenvalue of covariance matrix Σ.

After finding the optimal solution v, one deflates Σ to obtain a new matrix

and iterate this process to obtain further principal components. However, unlike PCA, sparse PCA cannot guarantee that different principal components are orthogonal. In order to achieve orthogonality, additional constraints must be enforced.

The following equivalent definition is in matrix form. Let be a p×p symmetric matrix, one can rewrite the sparse PCA problem as

Eq. 2

Tr is the matrix trace, and represents the non-zero elements in matrix V. The last line specifies that V has matrix rank one and is positive semidefinite. The last line means that one has , so Eq. 2 is equivalent to Eq. 1.

Moreover, the rank constraint in this formulation is actually redundant, and therefore sparse PCA can be cast as the following mixed-integer semidefinite program[1]

Eq. 3

Because of the cardinality constraint, the maximization problem is hard to solve exactly, especially when dimension p is high. In fact, the sparse PCA problem in Eq. 1 is NP-hard in the strong sense.[2]

Computational considerations

As most sparse problems, variable selection in SPCA is a computationally intractable nonconvex NP-hard problem,[3] therefore greedy sub-optimal algorithms are often employed to find solutions.

Algorithms for SPCA

Several alternative approaches (of Eq. 1) have been proposed, including

  • a regression framework,[4]
  • a penalized matrix decomposition framework,[5]
  • a convex relaxation/semidefinite programming framework,[6]
  • a generalized power method framework[7]
  • an alternating maximization framework[8]
  • forward-backward greedy search and exact methods using branch-and-bound techniques,[9]
  • a certifiably optimal branch-and-bound approach[10]
  • Bayesian formulation framework.[11]
  • A certifiably optimal mixed-integer semidefinite branch-and-cut approach [1]

The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies are recently reviewed in a survey paper.[12]

Notes on Semidefinite Programming Relaxation

It has been proposed that sparse PCA can be approximated by semidefinite programming (SDP).[6] If one drops the rank constraint and relaxes the cardinality constraint by a 1-norm convex constraint, one gets a semidefinite programming relaxation, which can be solved efficiently in polynomial time:

Eq. 3

In the second constraint, is a p×1 vector of ones, and |V| is the matrix whose elements are the absolute values of the elements of V.

The optimal solution to the relaxed problem Eq. 3 is not guaranteed to have rank one. In that case, can be truncated to retain only the dominant eigenvector.

While the semidefinite program does not scale beyond n=300 covariates, it has been shown that a second-order cone relaxation of the semidefinite relaxation is almost as tight and successfully solves problems with n=1000s of covariates [13]

Applications

Financial Data Analysis

Suppose ordinary PCA is applied to a dataset where each input variable represents a different asset, it may generate principal components that are weighted combination of all the assets. In contrast, sparse PCA would produce principal components that are weighted combination of only a few input assets, so one can easily interpret its meaning. Furthermore, if one uses a trading strategy based on these principal components, fewer assets imply less transaction costs.

Biology

Consider a dataset where each input variable corresponds to a specific gene. Sparse PCA can produce a principal component that involves only a few genes, so researchers can focus on these specific genes for further analysis.

High-dimensional Hypothesis Testing

Contemporary datasets often have the number of input variables () comparable with or even much larger than the number of samples (). It has been shown that if does not converge to zero, the classical PCA is not consistent. In other words, if we let in Eq. 1, then the optimal value does not converge to the largest eigenvalue of data population when the sample size , and the optimal solution does not converge to the direction of maximum variance. But sparse PCA can retain consistency even if

The k-sparse largest eigenvalue (the optimal value of Eq. 1) can be used to discriminate an isometric model, where every direction has the same variance, from a spiked covariance model in high-dimensional setting.[14] Consider a hypothesis test where the null hypothesis specifies that data are generated from a multivariate normal distribution with mean 0 and covariance equal to an identity matrix, and the alternative hypothesis specifies that data is generated from a spiked model with signal strength :

where has only k non-zero coordinates. The largest k-sparse eigenvalue can discriminate the two hypotheses if and only if .

Since computing k-sparse eigenvalue is NP-hard, one can approximate it by the optimal value of semidefinite programming relaxation (Eq. 3). If that case, we can discriminate the two hypotheses if . The additional term cannot be improved by any other polynomial time algorithm if the planted clique conjecture holds.

Software/source code

  • amanpg - R package for Sparse PCA using the Alternating Manifold Proximal Gradient Method [15]
  • elasticnet – R package for Sparse Estimation and Sparse PCA using Elastic-Nets [16]
  • epca – R package for exploratory principal component analysis for large-scale dataset, including sparse principal component analysis and sparse matrix approximation. [17]
  • nsprcomp - R package for sparse and/or non-negative PCA based on thresholded power iterations[18]
  • scikit-learn – Python library for machine learning which contains Sparse PCA and other techniques in the decomposition module.[19]


Notes

  1. ^ The term loadings is inappropriately used for these vectors, which should be called coefficients, instead. The naming comes from the term loadings used in factor analysis to design the values generating the common covariance matrix. Since in standard PCA the loadings are equal to the coefficients, the term loadings has been used for the coefficients. This is quite unfortunate, because in SPCA the coefficients are not equal to the loadings.


References

  1. ^ a b Dimitris Bertsimas; Ryan Cory-Wright; Jean Pauphilet (2020). "Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality". arXiv:2005.05195 [math.OC].
  2. ^ Andreas M. Tillmann; Marc E. Pfetsch (2013). "The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing". IEEE Transactions on Information Theory. 60 (2): 1248–1259. arXiv:1205.2081. CiteSeerX 10.1.1.760.2559. doi:10.1109/TIT.2013.2290112. S2CID 2788088.
  3. ^ Moghaddam, Baback; Weiss, Yair; Avidan, Shai (2007-09-07). Schölkopf, Bernhard; Platt, John; Hofmann, Thomas (eds.). Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference. The MIT Press. pp. 915–922. doi:10.7551/mitpress/7503.001.0001. ISBN 978-0-262-25691-9.
  4. ^ Hui Zou; Trevor Hastie; Robert Tibshirani (2006). "Sparse principal component analysis" (PDF). Journal of Computational and Graphical Statistics. 15 (2): 262–286. CiteSeerX 10.1.1.62.580. doi:10.1198/106186006x113430. S2CID 5730904.
  5. ^ Fan Chen; Karl Rohe (2021). "A New Basis for Sparse Principal Component Analysis". Journal of Computational and Graphical Statistics. 33 (2): 421–434. arXiv:2007.00596. doi:10.1080/10618600.2023.2256502.
  6. ^ a b Alexandre d’Aspremont; Laurent El Ghaoui; Michael I. Jordan; Gert R. G. Lanckriet (2007). "A Direct Formulation for Sparse PCA Using Semidefinite Programming" (PDF). SIAM Review. 49 (3): 434–448. arXiv:cs/0406021. doi:10.1137/050645506. S2CID 5490061.
  7. ^ Michel Journee; Yurii Nesterov; Peter Richtarik; Rodolphe Sepulchre (2010). "Generalized Power Method for Sparse Principal Component Analysis" (PDF). Journal of Machine Learning Research. 11: 517–553. arXiv:0811.4724. Bibcode:2008arXiv0811.4724J. CORE Discussion Paper 2008/70.
  8. ^ Peter Richtarik; Majid Jahani; S. Damla Ahipasaoglu; Martin Takac (2021). "Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes". Optimization and Engineering. 22 (3): 1493–1519. arXiv:1212.4137. doi:10.1007/s11081-020-09562-3. S2CID 2549610.
  9. ^ Baback Moghaddam; Yair Weiss; Shai Avidan (2005). "Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms" (PDF). Advances in Neural Information Processing Systems. Vol. 18. MIT Press.
  10. ^ Lauren Berk; Dimitris Bertsimas (2019). "Certifiably optimal sparse principal component analysis". Mathematical Programming Computation. 11 (3). Springer: 381–420. doi:10.1007/s12532-018-0153-6. hdl:1721.1/131566. S2CID 126998398.
  11. ^ Yue Guan; Jennifer Dy (2009). "Sparse Probabilistic Principal Component Analysis" (PDF). Journal of Machine Learning Research Workshop and Conference Proceedings. 5: 185.
  12. ^ Hui Zou; Lingzhou Xue (2018). "A Selective Overview of Sparse Principal Component Analysis". Proceedings of the IEEE. 106 (8): 1311–1320. doi:10.1109/jproc.2018.2846588.
  13. ^ Dimitris Bertsimas; Ryan Cory-Wright (2020). "On polyhedral and second-order cone decompositions of semidefinite optimization problems". Operations Research Letters. 48 (1). Elsevier: 78–85. arXiv:1910.03143. doi:10.1016/j.orl.2019.12.003.
  14. ^ Quentin Berthet; Philippe Rigollet (2013). "Optimal Detection of Sparse Principal Components in High Dimension". Annals of Statistics. 41 (1): 1780–1815. arXiv:1202.5070. doi:10.1214/13-aos1127. S2CID 7162068.
  15. ^ [1] https://cran.r-project.org/web/packages/amanpg/index.html
  16. ^ [2] https://cran.r-project.org/web/packages/elasticnet/index.html
  17. ^ [3] https://cran.r-project.org/web/packages/epca/index.html
  18. ^ [4] https://cran.r-project.org/web/packages/nsprcomp/index.html
  19. ^ [5] http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html

Read other articles:

Pindad MV2 Purwarupa MV2 Jenis SUV, Kendaraan utilitas militer ringan Negara asal Indonesia Sejarah produksi Perancang Pindad Produsen Pindad Diproduksi 2021 – sekarang Spesifikasi Berat 2.630 kilogram (5.800 pon) (kosong)3.200 kilogram (7.100 pon) (penuh) Panjang 5.220 mm (205,5 in) Lebar 2.010 mm (79,1 in) Tinggi 1.860 mm (73,2 in) Jenis Mesin 136 tenaga kuda (101 kW) Transmisi Manual (6 maju dan 1 mundur) Suspensi Independent double...

 

 

ComtéNegara asalPrancisWilayahJuraSumber susuSapiDipasteurisasiTidakTeksturKerasKadar lemak45%Waktu pematangan3 bulanSertifikasiAOC: 1958[1] Comté adalah keju dari daerah Jura di Prancis yang menggunakan susu sapi mentah.[1] Keju Comté ini merupakan keluarga dari keju Gruyère yang dibuat di Prancis.[2] Susu yang digunakan dalam pembuatan keju Comté hanyalah susu yang berasal dari sapi jenis Montbéliard dan Pie-Rouge.[1] Susu tersebut merupakan susu yang d...

 

 

Area Eagle's Nest adalah arena dalam ruangan yang terletak di kampus Universitas Negeri California, Los Angeles. Ini menjadi tuan rumah bagi tim basket dan bola voli untuk Golden Eagles, panjangnya 94 kaki (29 m) oleh Templat:Convert Lebar, dan dapat menangani dua lapangan basket dan tiga lapangan voli. Tempat duduk 3.200 dengan kapasitas penuh, menjadi tuan rumah kompetisi judo untuk Olimpiade Musim Panas 1984. Itu juga menjadi tuan rumah playoff JBA perdana untuk putaran hingga kejuara...

Aerotermodinamika bekerja pada pesawat Aerotermodinamika adalah ilmu yang mempelajari mengenai aliran gas yang di dalamnya berlangsung proses pertukaran energi panas.[1] Aktivitas ini dapat menimbulkan beberapa efek pada alirannya.[1] Ada beberapa tahap dalam mempelajari aerotermodinamika.[1] Pada tahap pertama dipelajari gas terutama udara dalam kondisi atmosfer standar di permukaan air laut. Lalu dipelajari aliran udara pada suhu dan kecepatan yang tinggi.[1]...

 

 

American baseball player (born 1957) Baseball player Don WelchelPitcherBorn: (1957-02-03) February 3, 1957 (age 67)Atlanta, TexasBatted: RightThrew: RightMLB debutSeptember 15, 1982, for the Baltimore OriolesLast MLB appearanceMay 31, 1983, for the Baltimore OriolesMLB statisticsWin–loss record1–2Earned run average5.81Strikeouts19 Teams Baltimore Orioles (1982–1983) Donald Ray Welchel (born February 3, 1957) is a former right-handed Major League Baseba...

 

 

Election in Colorado Main article: 1900 United States presidential election 1900 United States presidential election in Colorado ← 1896 November 6, 1900 1904 →   Nominee William Jennings Bryan William McKinley Party Democratic Republican Home state Nebraska Ohio Running mate Adlai Stevenson I Theodore Roosevelt Electoral vote 4 0 Popular vote 122,733 93,072 Percentage 55.43% 42.04% County Results Bryan   40-50%   50-60%  ...

Pour les articles homonymes, voir Saint-Michel. Saint-Michel-les-Portes Le village de Saint-Michel-les-Portes Administration Pays France Région Auvergne-Rhône-Alpes Département Isère Arrondissement Grenoble Intercommunalité Communauté de communes du Trièves Maire Mandat Joël Zoppé 2020-2026 Code postal 38650 Code commune 38429 Démographie Populationmunicipale 286 hab. (2021 ) Densité 14 hab./km2 Géographie Coordonnées 44° 52′ 10″ nord, 5° 35�...

 

 

Strada statale 452della ContessaLocalizzazioneStato Italia Regioni Umbria Marche Province Perugia Pesaro e Urbino DatiClassificazioneStrada statale InizioGubbio Fineex SS 3 presso Pontericcioli Lunghezza12,075[1] km Provvedimento di istituzioneD.M. 22/05/1964 - G.U. 187 del 31/07/1964[2] GestoreANAS: 1964-2001 Provincia di Perugia e Provincia di Pesaro e Urbino: 2001-2018 ANAS: dal 2018 Manuale La strada statale 452 della Contessa (SS 452), già strada...

 

 

جزء من سلسلة مقالات حولالإسلام حسب البلد الإسلام في إفريقيا أنغولا بنين بوتسوانا بوركينا فاسو بوروندي الكاميرون الرأس الأخضر أفريقيا الوسطى نشاد الجزائر جزر القمر الكونغو الديمقراطية الكونغو ساحل العاج جيبوتي مصر غينيا الاستوائية إريتريا إثيوبيا الغابون غامبيا غانا غي...

Financial services company SEI Investments CompanySEI headquarters in Oaks, PACompany typePublicTraded asNasdaq: SEICS&P 400 ComponentIndustryFinancial servicesFounded1968 FounderAlfred P. West, Jr.HeadquartersOaks, Pennsylvania, United StatesKey peopleRyan Hicke (CEO) Alfred P. West, Jr. (Executive Chairman)AUM$378.2 billion (Q3 2022)OwnerAlfred P. West, Jr. (14.2%)Number of employees4,700 (2022)Websitewww.seic.com SEI Investments Company, formerly Simulated Environments Inc, i...

 

 

Частина серії проФілософіяLeft to right: Plato, Kant, Nietzsche, Buddha, Confucius, AverroesПлатонКантНіцшеБуддаКонфуційАверроес Філософи Епістемологи Естетики Етики Логіки Метафізики Соціально-політичні філософи Традиції Аналітична Арістотелівська Африканська Близькосхідна іранська Буддій�...

 

 

Alessandro Schöpf Informasi pribadiTanggal lahir 7 Februari 1994 (umur 30)Tempat lahir Umhausen, AustriaTinggi 1,78 m (5 ft 10 in)Posisi bermain GelandangInformasi klubKlub saat ini Schalke 04Nomor 21Karier junior1999– SV Längenfeld0000–2009 AKA Tirol2009–2012 Bayern MünchenKarier senior*Tahun Tim Tampil (Gol)2012–2014 Bayern München II 63 (22)2014–2016 1. FC Nürnberg 51 (11)2016– Schalke 04 11 (3)Tim nasional‡2013– Austria U-21 9 (0)2016– Austria 3...

Any specific format for medications, specific to a dose and route Dosage forms (also called unit doses) are pharmaceutical drug products in the form in which they are marketed for use, with a specific mixture of active ingredients and inactive components (excipients), in a particular configuration (such as a capsule shell, for example), and apportioned into a particular dose. For example, two products may both be amoxicillin, but one is in 500 mg capsules and another is in 250 mg ch...

 

 

この記事は検証可能な参考文献や出典が全く示されていないか、不十分です。出典を追加して記事の信頼性向上にご協力ください。(このテンプレートの使い方)出典検索?: 全日本大学バスケットボール連盟 – ニュース · 書籍 · スカラー · CiNii · J-STAGE · NDL · dlib.jp · ジャパンサーチ · TWL(2014年10月) 一般財団法人全日本大�...

 

 

Maintaining one's gaze on a single location This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources. Find sources: Fixation visual – news · newspapers · books · scholar · JSTOR (December 2018) (Learn how and when to remove this message) Microsaccades and Ocular Drifts Fixation or visual fixation is the maintaining of the gaze on a single location. An animal can exhibit v...

Piège de cristal Logo original du film Données clés Titre québécois Piège de crystal Titre original Die Hard Réalisation John McTiernan Scénario Steven E. de SouzaJeb Stuart Musique Michael Kamen Acteurs principaux Bruce WillisAlan RickmanAlexander GodunovBonnie BedeliaReginald VelJohnson Sociétés de production Gordon CompanySilver PicturesTwentieth Century Fox Pays de production États-Unis Genre Action Durée 132 minutes Sortie 1988 Série Die Hard 58 Minutes pour vivre(1990...

 

 

HacıqaibMunisipalitasHacıqaibKoordinat: 41°25′28″N 48°39′28″E / 41.42444°N 48.65778°E / 41.42444; 48.65778Negara AzerbaijanRayonQubaPopulasi[butuh rujukan] • Total1.310Zona waktuUTC+4 (AZT) • Musim panas (DST)UTC+5 (AZT) Hacıqaib (juga Hacıqayıb, Gadzhigaib, Gadzhigaibkyshlakh, Gadzhigaibkyshlakh, dan Gadzhygaib) adalah sebuah desa dan munisipalitas di Rayon Quba, Azerbaijan. Desa ini memiliki penduduk berjumlah 1,3...

 

 

Campo Calabro Nước Ý Vùng Calabria Tỉnh tỉnh Reggio Calabria (RC) Thị trưởng Độ cao m Diện tích 7,5 km² Dân số  - Tổng số (Tháng 12 năm 2004) 4193  - Mật độ 562/km² Múi giờ CET, UTC+1 Tọa độ 38°13′B 15°40′Đ / 38,217°B 15,667°Đ / 38.217; 15.667 Danh xưng Mã điện thoại 0965 Mã bưu điện 89052 Vị trí của Campo Calabro tại Ý Website: www.comune.campocalabro.rc.it/ Campo Calabro là một đ...

Исполнение мотета. Моте́т (фр. motet, лат. motetus, motellus, от старофранц. mot — слово) — вокальное многоголосное произведение полифонического склада, один из центральных жанров в музыке западноевропейского Средневековья и Возрождения. Содержание 1 Общая характеристика...

 

 

チノ・モレノChino Moreno 2009年基本情報出生名 Camillo Wong Moreno生誕 (1973-06-20) 1973年6月20日(51歳)出身地 アメリカ合衆国カリフォルニア州サクラメントジャンル オルタナティブ・メタルエクスペリメンタル・ロックニューメタル、シューゲイザートリップ・ホップ、ドリーム・ポップ職業 音楽家、作曲家担当楽器 ボーカル、ギター、キーボード活動期間 1988年 ~ 現在レー...