Leakage (machine learning)

In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.[1]

Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause a statistician or modeler to select a suboptimal model, which could be outperformed by a leakage-free model.[1]

Leakage modes

Leakage can occur in many steps in the machine learning process. The leakage causes can be sub-classified into two possible sources of leakage for a model: features and training examples.[1]

Feature leakage

Feature or column-wise leakage is caused by the inclusion of columns which are one of the following: a duplicate label, a proxy for the label, or the label itself. These features, known as anachronisms, will not be available when the model is used for predictions, and result in leakage if included when the model is trained.[2]

For example, including a "MonthlySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate".

Training example leakage

Row-wise leakage is caused by improper sharing of information between rows of data. Types of row-wise leakage include:

  • Premature featurization; leaking from premature featurization before Cross-validation/Train/Test split (must fit MinMax/ngrams/etc on only the train split, then transform the test set)
  • Duplicate rows between train/validation/test (e.g. oversampling a dataset to pad its size before splitting; e.g. different rotations/augmentations of a single image; bootstrap sampling before splitting; or duplicating rows to up sample the minority class)
  • Non-i.i.d. data
    • Time leakage (e.g. splitting a time-series dataset randomly instead of newer data in test set using a TrainTest split or rolling-origin cross validation)
    • Group leakage—not including a grouping split column (e.g. Andrew Ng's group had 100k x-rays of 30k patients, meaning ~3 images per patient. The paper used random splitting instead of ensuring that all images of a patient were in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays.[3][4])

A 2023 review found data leakage to be "a widespread failure mode in machine-learning (ML)-based science", having affected at least 294 academic publications across 17 disciplines, and causing a potential reproducibility crisis.[5]

Detection

Data leakage in machine learning can be detected through various methods, focusing on performance analysis, feature examination, data auditing, and model behavior analysis. Performance-wise, unusually high accuracy or significant discrepancies between training and test results often indicate leakage.[6] Inconsistent cross-validation outcomes may also signal issues.

Feature examination involves scrutinizing feature importance rankings and ensuring temporal integrity in time series data. A thorough audit of the data pipeline is crucial, reviewing pre-processing steps, feature engineering, and data splitting processes.[7] Detecting duplicate entries across dataset splits is also important.

Analyzing model behavior can reveal leakage. Models relying heavily on counter-intuitive features or showing unexpected prediction patterns warrant investigation. Performance degradation over time when tested on new data may suggest earlier inflated metrics due to leakage.

Advanced techniques include backward feature elimination, where suspicious features are temporarily removed to observe performance changes. Using a separate hold-out dataset for final validation before deployment is advisable.[7]

See also

References

  1. ^ a b c Shachar Kaufman; Saharon Rosset; Claudia Perlich (January 2011). "Leakage in data mining: Formulation, detection, and avoidance". Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 6. pp. 556–563. doi:10.1145/2020408.2020496. ISBN 9781450308137. S2CID 9168804. Retrieved 13 January 2020.
  2. ^ Soumen Chakrabarti (2008). "9". Data Mining: Know it All. Morgan Kaufmann Publishers. p. 383. ISBN 978-0-12-374629-0. Anachronistic variables are a pernicious mining problem. However, they aren't any problem at all at deployment time—unless someone expects the model to work! Anachronistic variables are out of place in time. Specifically, at data modeling time, they carry information back from the future to the past.
  3. ^ Guts, Yuriy (30 October 2018). Yuriy Guts. TARGET LEAKAGE IN MACHINE LEARNING (Talk). AI Ukraine Conference. Ukraine – via YouTube.
  4. ^ Nick, Roberts (16 November 2017). "Replying to @AndrewYNg @pranavrajpurkar and 2 others". Brooklyn, NY, USA: Twitter. Archived from the original on 10 June 2018. Retrieved 13 January 2020. Replying to @AndrewYNg @pranavrajpurkar and 2 others ... Were you concerned that the network could memorize patient anatomy since patients cross train and validation? "ChestX-ray14 dataset contains 112,120 frontal-view X-ray images of 30,805 unique patients. We randomly split the entire dataset into 80% training, and 20% validation."
  5. ^ Kapoor, Sayash; Narayanan, Arvind (August 2023). "Leakage and the reproducibility crisis in machine-learning-based science". Patterns. 4 (9): 100804. doi:10.1016/j.patter.2023.100804. ISSN 2666-3899. PMC 10499856. PMID 37720327.
  6. ^ Batutin, Andrew (2024-06-20). "Data Leakage in Machine Learning Models". Shelf. Retrieved 2024-10-18.
  7. ^ a b "What is Data Leakage in Machine Learning? | IBM". www.ibm.com. 2024-09-30. Retrieved 2024-10-18.


Read other articles:

Indian Army soldier Sepoy/Captain BabaBaba Harbhajan SinghBorn30 August 1946Died4 October 1968(1968-10-04) (aged 22)South-east SikkimRankSepoy/ Captain (Honorary)Service number2456687[1]Unit23 Punjab Regiment Sepoy/Honorary Captain Baba Harbhajan Singh[a] (1946-1968) was an Indian Army soldier who served from 30 June 1965 to 4 October 1968. He is said to serve the Indian Army even after his death by coming in the dreams of soldiers and telling them the plans of their enem...

 

National monument in southeastern Arizona Chiricahua National MonumentIUCN category V (protected landscape/seascape)Rock formation on Echo Canyon TrailLocation in the United StatesShow map of ArizonaLocation in ArizonaShow map of the United StatesLocationCochise County, Arizona, United StatesNearest cityWillcox, ArizonaCoordinates32°00′20″N 109°21′24″W / 32.00569°N 109.35672°W / 32.00569; -109.35672Area12,025 acres (48.66 km2)[1]CreatedApr...

 

Provincial park in Saskatchewan, Canada Meadow Lake Provincial ParkIUCN category II (national park)Flotten Lake at the northern end of the parkLocation in SaskatchewanShow map of SaskatchewanMeadow Lake Provincial Park (Canada)Show map of CanadaLocation SaskatchewanNearest cityMeadow LakeCoordinates54°24′14″N 108°56′56″W / 54.4038°N 108.9489°W / 54.4038; -108.9489Length113 km (70 mi)Width32 km (20 mi)Area1,600 km2 (620...

Dialect of Ukrainian BalachkaNative toUkraineEthnicityUkrainianLanguage familyIndo-European Balto-SlavicSlavicEast SlavicBalachkaEarly formsProto-Indo-European Proto-Balto-Slavic Proto-Slavic Old East Slavic Ruthenian Eastern dialect (Old Ukrainian) Language codesISO 639-3– Baláchka (Ukrainian: балачка – conversation, chat) is a Ukrainian dialect spoken in the Kuban and Don regions, where Ukrainian settlers used to live. It was strongly influenced by Cossack culture. Part of ...

 

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Februari 2023. RedLink CommunicationsJenisPenyedia layanan internetDidirikan2008; 16 tahun lalu (2008)PendiriShane Thu Aung, Min Swe Hlaing, Thein Than ToeKantorpusatYangon, MyanmarWilayah operasiSeluruh negeriJasaAkses internet kecepatan tinggi kabel dan nirka...

 

United Nations resolution adopted in 2002 UN Security CouncilResolution 1452Flag of the TalibanDate20 December 2002Meeting no.4,678CodeS/RES/1452 (Document)SubjectThe situation in AfghanistanVoting summary15 voted forNone voted againstNone abstainedResultAdoptedSecurity Council compositionPermanent members China France Russia United Kingdom United StatesNon-permanent members Bulgaria Cameroon Colombia Guinea Ireland Mauritius M...

1994 French film by Krzysztof KieślowskiYou can help expand this article with text translated from the corresponding article in Polish. (10 2023) Click [show] for important translation instructions. View a machine-translated version of the Polish article. Machine translation, like DeepL or Google Translate, is a useful starting point for translations, but translators must revise errors as necessary and confirm that the translation is accurate, rather than simply copy-pasting machine-tra...

 

此條目可参照英語維基百科相應條目来扩充。 (2021年5月6日)若您熟悉来源语言和主题,请协助参考外语维基百科扩充条目。请勿直接提交机械翻译,也不要翻译不可靠、低品质内容。依版权协议,译文需在编辑摘要注明来源,或于讨论页顶部标记{{Translated page}}标签。 约翰斯顿环礁Kalama Atoll 美國本土外小島嶼 Johnston Atoll 旗幟颂歌:《星條旗》The Star-Spangled Banner約翰斯頓環礁�...

 

此條目可参照英語維基百科相應條目来扩充。 (2021年5月6日)若您熟悉来源语言和主题,请协助参考外语维基百科扩充条目。请勿直接提交机械翻译,也不要翻译不可靠、低品质内容。依版权协议,译文需在编辑摘要注明来源,或于讨论页顶部标记{{Translated page}}标签。 约翰斯顿环礁Kalama Atoll 美國本土外小島嶼 Johnston Atoll 旗幟颂歌:《星條旗》The Star-Spangled Banner約翰斯頓環礁�...

American jazz saxophonist (1940–2022) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Pharoah Sanders – news · newspapers · books · scholar · JSTOR (September 2022) (Learn how and when to remove this message) Pharoah SandersSanders in 2006Background informationBirth nameFerrell Lee SandersBorn(1940-10-13)O...

 

本條目存在以下問題,請協助改善本條目或在討論頁針對議題發表看法。 此條目可参照英語維基百科相應條目来扩充。 (2022年9月3日)若您熟悉来源语言和主题,请协助参考外语维基百科扩充条目。请勿直接提交机械翻译,也不要翻译不可靠、低品质内容。依版权协议,译文需在编辑摘要注明来源,或于讨论页顶部标记{{Translated page}}标签。 此條目缺少有關事件過程的信息。 (...

 

2020 computer-animated fantasy film Over the MoonOfficial release posterDirected byGlen KeaneWritten byAudrey WellsProduced by Gennie Rim Peilin Chou Starring Cathy Ang Phillipa Soo Ken Jeong John Cho Ruthie Ann Miles Margaret Cho Sandra Oh Edited byEdie IchiokaMusic by Steven Price (score) Christopher Curtis (songs) Marjorie Duffield (songs) Helen Park (songs) Productioncompanies Netflix Animation Pearl Studio Sony Pictures Imageworks Glen Keane Productions Distributed byNetflixRelease dates...

Aschau im Chiemgau. Aschau im Chiemgau adalah kota yang terletak di distrik Rosenheim di Bayern, Jerman. Kota Aschau im Chiemgau memiliki luas sebesar 79.61 km². Aschau im Chiemgau pada tahun 2006, memiliki penduduk sebanyak 5.630 jiwa. lbsKota dan kotamadya di Rosenheim Albaching Amerang Aschau im Chiemgau Babensham Bad Aibling Bad Endorf Bad Feilnbach Bernau am Chiemsee Brannenburg Breitbrunn am Chiemsee Bruckmühl Chiemsee Edling Eggstätt Eiselfing Feldkirchen-Westerham Flintsbach F...

 

For the town in Jordan, see al-Shajara, Jordan. For the town in Syria, see Al-Shajara, Syria. Place in Tiberias, Mandatory PalestineAl-Shajara الشجرةal-ShajeraEtymology: the Tree[1] 1870s map 1940s map modern map 1940s with modern overlay map A series of historical maps of the area around Al-Shajara, Tiberias (click the buttons)Al-ShajaraLocation within Mandatory PalestineCoordinates: 32°45′16″N 35°23′56″E / 32.75444°N 35.39889°E / 32.75444; 3...

 

Principle in journalism Journalism News Writing style (Five Ws) Ethics (code of ethics) Culture Objectivity News values Attribution Defamation Sensationalism Editorial independence Journalism school Index of journalism articles Areas Arts Business Data Entertainment Environment Fashion Medicine Music Politics Science Sports Technology Traffic Video games War Weather World Genres Adversarial Advocacy (Interventionism) Analytic Blogging Broadcast Churnalism Citizen Civic Collaborative Comics-ba...

Negative attitudes and feelings, or hostility towards, poverty and poor people Part of a series onDiscrimination Forms Institutional Structural Statistical Taste-based Attributes Age Caste Class Dialect Disability Genetic Hair texture Height Language Looks Mental disorder Race / Ethnicity Skin color Scientific racism Rank Sex Sexual orientation Species Size Viewpoint Social Arophobia Acephobia Adultism Anti-albinism Anti-autism Anti-homelessness Anti-drug addicts Anti-intellectualism...

 

Social science theory Part of a series onFeminism History Feminist history History of feminism Women's history American British Canadian German Waves First Second Third Fourth Timelines Women's suffrage Muslim countries US Other women's rights Women's suffrage by country Austria Australia Canada Colombia India Japan Kuwait Liechtenstein New Zealand Spain Second Republic Francoist Switzerland United Kingdom Cayman Islands Wales United States states Intersectional variants Fat Lesbian Lesbian o...

 

歐洲贵族等级 皇帝/女皇/國王兼皇帝/女王兼女皇/凱撒/沙皇 至高王/大王(英语:Great king) 王/女王 奥地利大公/女大公/沙皇太子(英语:Tsesarevich) 大親王/長公主/大公/女大公 选帝侯/王爵/女王爵/储君/血統親王(英语:Prince du sang)/外藩親王(英语:Prince étranger)/伊比利王太子/法国王太子/波蘭王太子(英语:Królewicz)/瑞典王太子(英语�...

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Disk storage – news · newspapers · books · scholar · JSTOR (March 2013) (Learn how and when to remove this message) General category of storage mechanisms Disk storage (also sometimes called drive storage) is a data storage mechanism based on a rotating disk. T...

 

Private press founded by designer William Morris in 1891 Kelmscott PressIndustryPrintingFounded1891 in HammersmithFounderWilliam MorrisDefunct1898HeadquartersUnited Kingdom  The Kelmscott Press, founded by William Morris and Emery Walker, published 53 books in 66 volumes[1] between 1891 and 1898. Each book was designed and ornamented by Morris and printed by hand in limited editions of around 300. Many books were illustrated by Edward Burne-Jones.[2] Kelmscott Press books...