Imputation (statistics)

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.[1] Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data.[2] There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; non-negative matrix factorization; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.

Listwise (complete case) deletion

By far, the most common means of dealing with missing data is listwise deletion (also known as complete case), which is when all cases with a missing value are deleted. If the data are missing completely at random, then listwise deletion does not add any bias, but it does decrease the power of the analysis by decreasing the effective sample size. For example, if 1000 cases are collected but 80 have missing values, the effective sample size after listwise deletion is 920. If the cases are not missing completely at random, then listwise deletion will introduce bias because the sub-sample of cases represented by the missing data are not representative of the original sample (and if the original sample was itself a representative sample of a population, the complete cases are not representative of that population either).[3] While listwise deletion is unbiased when the missing data is missing completely at random, this is rarely the case in actuality.[4]

Pairwise deletion (or "available case analysis") involves deleting a case when it is missing a variable required for a particular analysis, but including that case in analyses for which all required variables are present. When pairwise deletion is used, the total N for analysis will not be consistent across parameter estimations. Because of the incomplete N values at some points in time, while still maintaining complete case comparison for other parameters, pairwise deletion can introduce impossible mathematical situations such as correlations that are over 100%.[5]

The one advantage complete case deletion has over other methods is that it is straightforward and easy to implement. This is a large reason why complete case is the most popular method of handling missing data in spite of the many disadvantages it has.

Single imputation

Hot-deck

A once-common method of imputation was hot-deck imputation where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.

One form of hot-deck imputation is called "last observation carried forward" (or LOCF for short), which involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset. The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value. The process is repeated for the next cell with a missing value until all missing values have been imputed. In the common scenario in which the cases are repeated measurements of a variable for a person or other entity, this represents the belief that if a measurement is missing, the best guess is that it hasn't changed from the last time it was measured. This method is known to increase risk of increasing bias and potentially false conclusions. For this reason LOCF is not recommended for use.[6]

Cold-deck

Cold-deck imputation, by contrast, selects donors from another dataset. Due to advances in computer power, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques. It is a method of replacing with response values of similar items in past surveys. It is available in surveys that measure time intervals.

Mean substitution

Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed. This is because, in cases with imputation, there is guaranteed to be no relationship between the imputed variable and any other measured variables. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis.

Mean imputation can be carried out within classes (i.e. categories such as gender), and can be expressed as where is the imputed value for record and is the sample mean of respondent data within some class . This is a special case of generalized regression imputation:

Here the values are estimated from regressing on in non-imputed data, is a dummy variable for class membership, and data are split into respondent () and missing ().[7][8]

Non-negative matrix factorization

Non-negative matrix factorization (NMF) can take missing data while minimizing its cost function, rather than treating these missing data as zeros that could introduce biases.[9] This makes it a mathematically proven method for data imputation. NMF can ignore missing data in the cost function, and the impact from missing data can be as small as a second order effect.

Regression

Regression imputation has the opposite problem of mean imputation. A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where the value of that variable is missing. In other words, available information for complete and incomplete cases is used to predict the value of a specific variable. Fitted values from the regression model are then used to impute the missing values. The problem is that the imputed data do not have an error term included in their estimation, thus the estimates fit perfectly along the regression line without any residual variance. This causes relationships to be over identified and suggest greater precision in the imputed values than is warranted. The regression model predicts the most likely value of missing data but does not supply uncertainty about that value.

Stochastic regression was a fairly successful attempt to correct the lack of an error term in regression imputation by adding the average regression variance to the regression imputations to introduce error. Stochastic regression shows much less bias than the above-mentioned techniques, but it still missed one thing – if data are imputed then intuitively one would think that more noise should be introduced to the problem than simple residual variance.[5]

Multiple imputation

In order to deal with the problem of increased noise due to imputation, Rubin (1987)[10] developed a method for averaging the outcomes across multiple imputed data sets to account for this. All multiple imputation methods follow three steps.[3]

  1. Imputation – Similar to single imputation, missing values are imputed. However, the imputed values are drawn m times from a distribution rather than just once. At the end of this step, there should be m completed datasets.
  2. Analysis – Each of the m datasets is analyzed. At the end of this step there should be m analyses.
  3. Pooling – The m results are consolidated into one result by calculating the mean, variance, and confidence interval of the variable of concern[11][12] or by combining simulations from each separate model.[13]

Multiple imputation can be used in cases where the data are missing completely at random, missing at random, and missing not at random, though it can be biased in the latter case.[14] One approach is multiple imputation by chained equations (MICE), also known as "fully conditional specification" and "sequential regression multiple imputation."[15] MICE is designed for missing at random data, though there is simulation evidence to suggest that with a sufficient number of auxiliary variables it can also work on data that are missing not at random. However, MICE can suffer from performance problems when the number of observation is large and the data have complex features, such as nonlinearities and high dimensionality.

More recent approaches to multiple imputation use machine learning techniques to improve its performance. MIDAS (Multiple Imputation with Denoising Autoencoders), for instance, uses denoising autoencoders, a type of unsupervised neural network, to learn fine-grained latent representations of the observed data.[16] MIDAS has been shown to provide accuracy and efficiency advantages over traditional multiple imputation strategies.

As alluded in the previous section, single imputation does not take into account the uncertainty in the imputations. After imputation, the data is treated as if they were the actual real values in single imputation. The negligence of uncertainty in the imputation can lead to overly precise results and errors in any conclusions drawn.[17] By imputing multiple times, multiple imputation accounts for the uncertainty and range of values that the true value could have taken. As expected, the combination of both uncertainty estimation and deep learning for imputation is among the best strategies and has been used to model heterogeneous drug discovery data.[18][19]

Additionally, while single imputation and complete case are easier to implement, multiple imputation is not very difficult to implement. There are a wide range of statistical packages in different statistical software that readily performs multiple imputation. For example, the MICE package allows users in R to perform multiple imputation using the MICE method.[20] MIDAS can be implemented in R with the rMIDAS package and in Python with the MIDASpy package.[16]

See also

References

  1. ^ Barnard, J.; Meng, X. L. (1999-03-01). "Applications of multiple imputation in medical studies: from AIDS to NHANES". Statistical Methods in Medical Research. 8 (1): 17–36. doi:10.1177/096228029900800103. ISSN 0962-2802. PMID 10347858. S2CID 11453137.
  2. ^ Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006. Ch.25
  3. ^ a b Lall, Ranjit (2016). "How Multiple Imputation Makes a Difference". Political Analysis. 24 (4): 414–433. doi:10.1093/pan/mpw020.
  4. ^ Kenward, Michael G (2013-02-26). "The handling of missing data in clinical trials". Clinical Investigation. 3 (3): 241–250. doi:10.4155/cli.13.7 (inactive 2024-11-11). ISSN 2041-6792.{{cite journal}}: CS1 maint: DOI inactive as of November 2024 (link)
  5. ^ a b Enders, C. K. (2010). Applied Missing Data Analysis. New York: Guilford Press. ISBN 978-1-60623-639-0.
  6. ^ Molnar, Frank J.; Hutton, Brian; Fergusson, Dean (2008-10-07). "Does analysis using "last observation carried forward" introduce bias in dementia research?". Canadian Medical Association Journal. 179 (8): 751–753. doi:10.1503/cmaj.080820. ISSN 0820-3946. PMC 2553855. PMID 18838445.
  7. ^ Kalton, Graham (1986). "The treatment of missing survey data". Survey Methodology. 12: 1–16.
  8. ^ Kalton, Graham; Kasprzyk, Daniel (1982). "Imputing for missing survey responses" (PDF). Proceedings of the Section on Survey Research Methods. 22. American Statistical Association. S2CID 195855359. Archived from the original (PDF) on 2020-02-12.
  9. ^ Ren, Bin; Pueyo, Laurent; Chen, Christine; Choquet, Elodie; Debes, John H; Duchene, Gaspard; Menard, Francois; Perrin, Marshall D. (2020). "Using Data Imputation for Signal Separation in High Contrast Imaging". The Astrophysical Journal. 892 (2): 74. arXiv:2001.00563. Bibcode:2020ApJ...892...74R. doi:10.3847/1538-4357/ab7024. S2CID 209531731.
  10. ^ Rubin, Donald (9 June 1987). Multiple imputation for nonresponse in surveys. Wiley Series in Probability and Statistics. Wiley. doi:10.1002/9780470316696. ISBN 9780471087052.
  11. ^ Yuan, Yang C. (2010). "Multiple imputation for missing data: Concepts and new development" (PDF). SAS Institute Inc., Rockville, MD. 49: 1–11. Archived from the original (PDF) on 2018-11-03. Retrieved 2018-01-17.
  12. ^ Van Buuren, Stef (2012-03-29). "2. Multiple Imputation". Flexible Imputation of Missing Data. Chapman & Hall/CRC Interdisciplinary Statistics Series. Vol. 20125245. Chapman and Hall/CRC. doi:10.1201/b11826. ISBN 9781439868249. S2CID 60316970.
  13. ^ King, Gary; Honaker, James; Joseph, Anne; Scheve, Kenneth (March 2001). "Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation". American Political Science Review. 95 (1): 49–69. doi:10.1017/S0003055401000235. ISSN 1537-5943. S2CID 15484116.
  14. ^ Pepinsky, Thomas B. (2018-08-03). "A Note on Listwise Deletion versus Multiple Imputation". Political Analysis. 26 (4). Cambridge University Press (CUP): 480–488. doi:10.1017/pan.2018.18. ISSN 1047-1987.
  15. ^ Azur, Melissa J.; Stuart, Elizabeth A.; Frangakis, Constantine; Leaf, Philip J. (2011-03-01). "Multiple imputation by chained equations: what is it and how does it work?". International Journal of Methods in Psychiatric Research. 20 (1): 40–49. doi:10.1002/mpr.329. ISSN 1557-0657. PMC 3074241. PMID 21499542.
  16. ^ a b Lall, Ranjit; Robinson, Thomas (2021). "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning". Political Analysis. 30 (2): 179–196. doi:10.1017/pan.2020.49.
  17. ^ Graham, John W. (2009-01-01). "Missing data analysis: making it work in the real world". Annual Review of Psychology. 60: 549–576. doi:10.1146/annurev.psych.58.110405.085530. ISSN 0066-4308. PMID 18652544.
  18. ^ Irwin, Benedict (2020-06-01). "Practical Applications of Deep Learning to Impute Heterogeneous Drug Discovery Data". Journal of Chemical Information and Modeling. 60 (6): 2848–2857. doi:10.1021/acs.jcim.0c00443. PMID 32478517. S2CID 219171721.
  19. ^ Whitehead, Thomas (2019-02-12). "Imputation of Assay Bioactivity Data Using Deep Learning". Journal of Chemical Information and Modeling. 59 (3): 1197–1204. doi:10.1021/acs.jcim.8b00768. PMID 30753070. S2CID 73429643.
  20. ^ Horton, Nicholas J.; Kleinman, Ken P. (2007-02-01). "Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models". The American Statistician. 61 (1): 79–90. doi:10.1198/000313007X172556. ISSN 0003-1305. PMC 1839993. PMID 17401454.

Read other articles:

Manas International AirportМанас эл-аралык аэропортFRUIATA: FRUICAO: UAFMInformasiJenisJoint (Civil and Military)MelayaniBishkekLokasiBishkek, KyrgyzstanMaskapai penghubung Air Bishkek Avia Traffic Company Kyrgyzstan Air Company Ketinggian dpl627 mdplKoordinat43°03′40.7″N 74°28′39.2″E / 43.061306°N 74.477556°E / 43.061306; 74.477556Situs webwww.airport.kgLandasan pacu Arah Panjang Permukaan kaki m 08/26 13,780 4,200 Beton Band...

 

Reza ArtameviaReza pada tahun 2017LahirReza Artamevia Adriana Eka Suci29 Mei 1975 (umur 48)Jakarta, IndonesiaKebangsaanIndonesiaNama lainReza ArtameviraAlmamaterUniversitas PancasilaPekerjaanPenyanyiaktrispolitikusTahun aktif1992–sekarangPartai politikNasDem (2023–sekarang)Perindo (2019–2023)Suami/istriAdjie Massaid ​ ​(m. 1999; c. 2005)​Anak2, termasuk Aaliyah MassaidKarier musikGenreR&Bsoulpop rockInstrumenVokal Label...

 

العلاقات البرتغالية السويسرية البرتغال سويسرا   البرتغال   سويسرا تعديل مصدري - تعديل   العلاقات البرتغالية السويسرية هي العلاقات الثنائية التي تجمع بين البرتغال وسويسرا.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدولتين: وجه ...

History of Bangladesh after gaining independence from Pakistan This article is about the history of the modern sovereign state Bangladesh established after 1971 War of Independence. For the pre-establishment era of the state, see History of Bangladesh. For the history of the Bengal region before the partition and formation of the Bangladesh, see History of Bengal. Part of a series on the History of Bangladesh Etymology Timeline Traditional Urheimat Ancient Neolithic, c. 7600 – c. 3300 BCE B...

 

Kelinci Klasifikasi ilmiah Kerajaan: Animalia Superfilum: Chordata Filum: Vertebrata Kelas: Mammalia Ordo: Lagomorpha Famili: Leporidaesebagian Genera yang termasuk Pentalagus Bunolagus Nesolagus Romerolagus Brachylagus Sylvilagus Oryctolagus Poelagus Kelinci atau kuilu[1] adalah hewan mamalia dari famili Leporidae, yang dapat ditemukan di banyak bagian bumi. Kelinci berkembang biak dengan cara beranak yang disebut vivipar. Dulunya, hewan ini adalah hewan liar yang hidup di Afrika hi...

 

Avenue in Brooklyn, New York BroadwayBroadway at Union AvenueOwnerCity of New YorkMaintained byNYCDOTLength4.4 mi (7.1 km)[1]LocationBrooklynPostal code11211 11206 11221 11207Nearest metro stationJamaica Line ​ Broadway Sparrow Shoe Warehouse Broadway is an avenue in the New York City borough of Brooklyn that extends from the East River in the neighborhood of Williamsburg in a southeasterly direction to East New York for a length of 4.32 miles (6.95 km). It was na...

Grant Gustin a.k.a The Flash (Barry Allen)Grant Gustin (2015)LahirThomas Grant Gustin14 Januari 1990 (umur 34)Norfolk, Virginia, United StatesPekerjaanAktor, penyanyi, penariTahun aktif2003–sekarang Thomas Grant Gustin (lahir 14 Januari 1990), lebih dikenal sebagai Grant Gustin,[1] adalah aktor film dan teater Amerika Serikat yang berasal dari Norfolk, Virginia.[2] Ia dikenal karena perannya dalam Glee sebagai Sebastian Smythe dari The Dalton Academy Warblers. Fil...

 

American politician Edward Lacey7th Comptroller of the CurrencyIn officeMay 1, 1889 – June 30, 1892Preceded byWilliam L. TrenholmSucceeded byA. Barton HepburnMember of the U.S. House of Representativesfrom Michigan's 3rd districtIn officeMarch 4, 1881 – March 3, 1885Preceded byJonas H. McGowanSucceeded byJames O'Donnell Personal detailsBorn(1835-11-26)November 26, 1835Chili, New York, U.S.DiedOctober 2, 1916(1916-10-02) (aged 80)Evanston, Illinois, U.S.P...

 

Former British railway company Not to be confused with the 19th century Great North of England Railway (GNER). Great North Eastern RailwayInterCity 125 HST at London King's Cross in 2007OverviewFranchise(s)InterCity East Coast28 April 1996 – 8 December 2007Main region(s) London East of England East Midlands Yorkshire North East England Scotland Fleet size 11 InterCity 125 HST sets 31 InterCity 225 sets Stations called at53Stations operated12Parent companySea ContainersReporting markGRPredec...

Ernesto Carlo d'Asburgo-Lorena Ernesto Carlo d'Asburgo-Lorena, (in tedesco: Ernst Karl Felix Maria Rainer Gottfried Cyriak) (Milano, 8 agosto 1824 – Arco, 4 aprile 1899), è stato un arciduca d'Austria. Indice 1 Biografia 2 Carriera militare 3 Matrimonio 4 Morte 5 Onorificenze 6 Ascendenza 7 Note 8 Altri progetti Biografia Era il figlio dell'arciduca Ranieri Giuseppe d'Asburgo-Lorena, viceré del Regno Lombardo-Veneto, e di sua moglie, Maria Elisabetta di Savoia-Carignano, figlia di Carlo E...

 

Un émetteur-récepteur d'un aéronef en vol au-dessus de l'Océan Atlantique Nord [1]. Émetteurs-récepteurs dans un cockpit Antenne VHF 118 à 137 MHz et l'indicatif (radio) F-HBGB d'un aéronef français Radiocommunication de surface Les radiocommunications aéronautiques sont dans des bandes de fréquences du spectre radioélectrique, réservée à l'aéronautique par des traités internationaux. Elles sont utilisées pour les communications entre les pilotes et le personnel des sta...

 

German politician (1954–2017) Schurer in August 2017 Ewald Schurer (15 April 1954 – 2 December 2017) was a German politician for the Social Democratic Party (SPD). At the time of death, he had been serving in the Bundestag since 2005. He had previously served in the Bundestag from 1998 to 2002. He was elected to the local council of Ebersberg, Bavaria in 1984. A Roman Catholic, Schurer was married with four children.[1] He ran unsuccessfully in Erding – Ebersberg in 2009, 2013 a...

جين بيرس (بالإنجليزية: Jane Pierce)‏  معلومات شخصية اسم الولادة (بالإنجليزية: Jane Means Appleton)‏  الميلاد 12 مارس 1806(1806-03-12)هامبتون الوفاة 2 ديسمبر 1863 (57 سنة) سبب الوفاة سل  الجنسية الولايات المتحدة الأمريكية الزوج فرانكلين بيرس الأولاد بنجامين بيرس  [لغات أخرى]‏  مناصب...

 

Chiesa Nuova setelah restorasi (2002). Santa Maria in Vallicella, juga disebut Chiesa Nuova, adalah sebuah gereja di Roma, Italia, yang sekarang menghadap ke bagian utama Corso Vittorio Emanuele dan persimpangan Via della Chiesa Nuova. Gereja tersebut adalah gereja utama dari Oratorian, sebuah kongregasi agama dari para imam sekuler, yang didirikan oleh Santo Filipus Neri pada 1561 pada abad ke-16 saat Kontra Reformasi menimbulkan pendirian sejumlah organisasi keagamaan baru seperti Serikat Y...

 

1993 in spaceflightAstronauts Story Musgrave and Jeffrey Hoffman repair the Hubble Space Telescope during STS-61.Orbital launchesFirst12 JanuaryLast22 DecemberTotal83Successes77Failures4Partial failures2National firstsSatellite PortugalRocketsMaiden flightsAriane 4 42LAtlas IIASPSLVStart-1Crewed flightsOrbital9Total travellers47vte The following is an outline of 1993 in spaceflight. vteTimeline of spaceflight Spaceflight before 1951 1950s 1950 1951 1952 1953 1954 1955 1956 1957 1958 195...

1959 Christmas carol Cover of original 1959 edition of sheet music of Little Donkey Little Donkey is a popular Christmas carol, written by British songwriter Eric Boswell in 1959, which describes the journey by Mary the mother of Jesus to Bethlehem on the donkey of the title.[1] The first version to chart was by Gracie Fields, followed a fortnight later by The Beverley Sisters, who overtook her in the charts by Christmas. The song became No. 1 in the UK Sheet Music Chart[2] fr...

 

Italian poet and writer Niccolò FrancoBorn13 or 14 September 1515Benevento, ItalyDied11 March 1570, aged 55Rome, ItalyCause of deathExecution by hangingAcademic workMain interestsPoetry, eulogies, invectives, licentious sonnets and literary products Niccolò Franco (13/14 September 1515 – 11 March 1570) was a poet and literato executed for libel. Life Born in Benevento to a modest family, Franco completed humanistic studies at the school of his brother Vincenzo.[1]...

 

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada April 2017. Makoto KimuraInformasi pribadiNama lengkap Makoto KimuraTanggal lahir 10 Juni 1979 (umur 45)Tempat lahir Prefektur Fukui, JepangPosisi bermain BekKarier senior*Tahun Tim Tampil (Gol)2002-2005 Kawasaki Frontale 2006-2009 Montedio Yamagata 2010 Zweige...

Soyuz-2 ( indeks GRAU 14A14 ) adalah versi roket Soyuz Rusia abad ke-21. Dalam bentuk dasarnya, ini adalah roket pembawa tiga tahap untuk menempatkan muatan ke orbit Bumi yang rendah. Dibandingkan dengan versi sebelumnya dari Soyuz, pendorong tahap pertama dan dua tahap inti memiliki mesin dengan sistem injeksi yang ditingkatkan. Sistem kontrol penerbangan dan telemetri digital memungkinkan roket ini diluncurkan dari platform peluncuran tetap, sedangkan platform peluncuran untuk roket Soyuz ...

 

Questa voce sull'argomento atleti giamaicani è solo un abbozzo. Contribuisci a migliorarla secondo le convenzioni di Wikipedia. Segui i suggerimenti del progetto di riferimento. Sheri-Ann BrooksNazionalità Giamaica Altezza168 cm Atletica leggera Specialità100 metri piani, 200 metri piani Palmarès Competizione Ori Argenti Bronzi Mondiali 1 1 0 World Relays 0 0 1 Giochi panamericani 1 1 0 Giochi del Commonwealth 2 0 0 Campionati CAC 2 0 0 Vedi maggiori dettagli  Modifica dat...