Statistical disclosure control (SDC), also known as statistical disclosure limitation (SDL) or disclosure avoidance, is a technique used in data-driven research to ensure no person or organization is identifiable from the results of an analysis of survey or administrative data, or in the release of microdata. The purpose of SDC is to protect the confidentiality of the respondents and subjects of the research.[1]
SDC usually refers to 'output SDC'; ensuring that, for example, a published table or graph does not disclose confidential information about respondents. SDC can also describe protection methods applied to the data: for example, removing names and addresses, limiting extreme values, or swapping problematic observations. This is sometimes referred to as 'input SDC', but is more commonly called anonymization, de-identification, or microdata protection.
Textbooks (eg [2]) typically cover input SDC and tabular data protection (but not other parts of output SDC). This is because these two problems are of direct interest to statistical agencies who supported the development of the field.[3] For analytical environments, output rules developed for statistical agencies were generally used until data managers began arguing for specific output SDC for research.[4]
This page focuses on output SDC.
Necessity
Many kinds of social, economic and health research use potentially sensitive data as a basis for their research, such as survey or Census data, tax records, health records, educational information, etc. Such information is usually given in confidence, and, in the case of administrative data, not always for the purpose of research.
Researchers are not usually interested in information about one single person or business; they are looking for trends among larger groups of people.[5] However, the data they use is, in the first place, linked to individual people and businesses, and SDC ensures that these cannot be identified from published data, no matter how detailed or broad.[6]
It is possible that at the end of data analysis, the researcher somehow singles out one person or business through their research. For example, a researcher may identify the exceptionally good or bad service in a geriatric department within a hospital in a remote area, where only one hospital provides such care. In that case, the data analysis 'discloses' the identity of the hospital, even if the dataset used for analysis was properly anonymised or de-identified.
Statistical disclosure control will identify this disclosure risk and ensure the results of the analysis are altered to protect confidentiality.[7] It requires a balance between protecting confidentiality and ensuring the results of the data analysis are still useful for statistical research.[8]
Output SDC: statistical models
Output SDC relies upon having a set of rules that can be followed by an output checker; for example, that a frequency table must have a minimum number of observations, or that survival tables should be right-censored for extreme values. The value and drawbacks of rules for frequency and magnitude tables have been discussed extensively since the late 20th Century. However, with awareness of the increasing need for rules for other types of analyses, a more structured approach is needed.
'Safe' and 'unsafe' statistics
Some statistical outputs, such as frequency tables, have a high level of inherent risk: differencing, low numbers, class disclosure. They therefore need to be checked before release, ideally by someone with some understanding of the data, to ensure that there is no meaningful risk on release. These are referred to as 'unsafe statistics'. However, there are some statistics, such as the coefficients from modelling, that have no meaningful risk and therefore can be released with no further checks. These are called 'safe statistics'. By separating statistics into 'safe' and 'unsafe', output checks can be concentrated on the latter, improving both security and efficiency.[4]
This is less important for official statistics, where 'unsafe' statistics such as counts, means, medians and simple indexes dominate the outputs. However, for research output this is important, as a great deal of research output (particularly estimates and test statistics) is inherently 'safe'.
Statistical barns or statbarns
The safe/unsafe model is useful but limited with two simple categories; within those categories, guidelines for SDC largely consist of long lists of statistics and how to handle them. In 2023, the SACRO project https://dareuk.org.uk/driver-project-sacro/ undertook to review the whole field and see whether a more useful classification scheme could be introduced. The result is the 'statistical barn' (or 'statbarn') concept[9].
A statbarn is a classification of statistics for disclosure control purposes, where all of the statistics in that class share the same characteristics as far as disclosure control is concerned:
their mathematical form is similar
they share the same risks
they share the same responses to those risks
output checking rules are applicable to all
As of March 2024, 14 statbarns have been identified, with 12 described for output checkers:[10]
frequencies
statistical hypothesis tests
coefficients of association
position (median, IQR etc)
extreme values (max, min)
shape
linear aggregates
mode
non-linear concentration ratios
odds and risk ratios
survival tables
Gini coefficients
These cover almost all statistics. They also cover most graph forms, where the graph can be converted into the appropriate statsbarn (for example, a pie chart is another form of frequency table). The SACRO manual provides guidance on what to look out for, and the rules to be followed fro checking.
Output SDC: operating models
There are two main approaches to output SDC: principles-based and rules-based.[11] In principles-based systems, disclosure control attempts to uphold a specific set of fundamental principles—for example, "no person should be identifiable in released microdata".[12] Rules-based systems, in contrast, are evidenced by a specific set of rules that a person performing disclosure control follows (for example, "any frequency must be based on at least five observations"), after which the data are presumed to be safe to release. In general, official statistics are rules-based; research environments are more likely to be principles-based.
In research environments, the choice of output-checking regime can have significant operational implications.[13]
Rules-Based SDC
In rules-based SDC, a rigid set of rules is used to determine whether or not the results of data analysis can be released. The rules are applied consistently, which makes it obvious what kinds of output are acceptable. Rules-based systems are good for ensuring consistency across time, across data sources, and across production teams, which makes them appealing for statistical agencies.[13] Rules-based systems also work well for remote job serves such as microdata.no or Lissy.
However, because the rules are inflexible, either disclosive information may still slip through, or the rules are over-restrictive and may only allow for results that are too broad for useful analysis to be published.[11] In practice, research environments running rules-based systems may have to bring flexibility in 'ad hoc' systems.[13]
In principles-based SDC, both the researcher and the output checker are trained in SDC. They receive a set of rules, which are rules-of-thumb rather than hard rules as in rules-based SDC. This means that in principle, any output may be approved or refused. The rules-of-thumb are a starting point for the researcher. A researcher may request outputs which breach the 'rules of thumb' as long as (1) they are non-disclosive (2) they are important and (3) this is an exceptional request.[15] It is up to the researcher to prove that any 'unsafe' outputs are non-disclosive, but the checker has the final say. Since there are no hard rules, this requires knowledge on disclosure risks and judgment from both the researcher and the checker. It requires training and an understanding of statistics and data analysis,[11] although it has been argued[13] that this can be used to make the process more efficient than a rules-based model.
In the UK all major secure research environments in social science and public health, with the exception of Northern Ireland, are principles-based. This includes the UK Data Service's Secure Data Service,[16] the Office for National Statistics' Secure Research Service, the Scottish Safe Havens, Secure Anonymized Information Linkage (SAIL) and OpenSAFELY.
Critiques
Many contemporary statistical disclosure control techniques, such as generalization and cell suppression, have been shown to be vulnerable to attack by a hypothetical data intruder. For example, Cox showed in 2009 that Complementary cell suppression typically leads to "over-protected" solutions because of the need to suppress both primary and complementary cells, and even then can lead to the compromise of sensitive data when exact intervals are reported.[17]
Many of the rules are arbitrary and reflect data owner's unwillingness to be different, rather than solid evidence. For example, Ritchie[18] demonstrated the choice of a minimum threshold is more about an organisation's wish to be in line with others than any statistical rationale.
A more substantive criticism is that the theoretical models used to explore control measures are not appropriate for guides for practical action.[19] Hafner et al provide a practical example of how a change in perspective can generate substantially different results.[3]
Output SDC and AI models
Artificial intelligence and machine learning models present different risks for output checking.[20] The GRAIMATTER project https://dareuk.org.uk/sprint-exemplar-project-graimatter/ provided some initial guidance and automatic tools. These were extended and simplified as part of the SACRO project (see below), and more guidelines for data services staff added. This is still a quickly-evolving area. The SDC-REBOOT community network https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=SDC-REBOOT is currently co-ordinating the ongoing development of the tools and guidance.
Automated tools
Output checking is generally labour-intensive, as it requires analysts who can understand what they are looking at and make a judgement about whether to release an output. There is therefore considerable interest in automated checking. A Eurostat-commissioned report[21] explored the options for output checking, which largely come down to two options:
end-of-process review (EoPR): training a computer to look at the output and understand what it shows. This has the advantage of requiring no additional training for the researcher. However, it can be difficult to explain to any automated system what it is looking at; this can be more time-consuming than checking the output manually. tauArgus and sdcTable are EoPR.
within-process review (WPR): the output checking tool is called at the same time the output is being generated, and has access to the source data; therefore, there is no need to explain how the output has been created. The disadvantage of this approach is that it can slow down processing times, and requires the analysis to include the necessary commands to run the output checking tool. However, the major advantage is that it does not need to be taught about the data,
tauArgus and sdcMicro
tau-Argus and sdcTable are fully-automated open-source EoPR tools for tabular data protection (frequency and magnitude tables). They are designed to work with multiple tables. Metadata needs to be set up describing the output(s), and the control parameters. They provide the output checkers with extensive information on potential problems, including secondary disclosure across tables. They can also carry out correction measures, from suppression and simple rounding to secondary suppression and controlled tabular rounding. They do not deal with non-tabular outputs.
Because of the need to rewrite the metadata for each table, these tools are poorly suited for research use. However, in official statistics, where the same tables are being repeatedly generated and where secondary differencing is considered a significant problem, the investment in setting up the tools can be very cost-effective.
SACRO (Semi-autonomous checking of research outputs) is a WPR tool, originally commissioned (ACRO) by Eurostat in 2020 as a proof-of-concept to show that a general-purpose output checking tool could be developed.[22] In 2023 the UK Medical Research Council commissioned a generalised version (SACRO) which would work with multiple languages (as of 2024: Stata, R and Python) and provide a more user-friendly interface.[23] SACRO directly implements the statbarns model and is principles-based; hence, it is 'semi-automatic' as it allows users to request exceptions and for output checkers to override the automated recommendations. All UK social science secure facilities, and most UK public health secure facilities, are planning to adopt it.
The software is available on Github at https://github.com/AI-SDC, which also contains links to the original ACRO and tools for assessing AI models.
^Lawrence H. Cox, Vulnerability of Complementary Cell Suppression to Intruder Attack, Journal of Privacy and Confidentiality (2009) 1, Number 2, pp. 235–251 http://repository.cmu.edu/jpc/vol1/iss2/8/
Justice Bao (包青天)Bao Zheng包拯 Informasi pribadiLahir(999-03-05)5 Maret 999Shenxian, Hefei, Luzhou, Dinasti Song UtaraMeninggal3 Juli 1062(1062-07-03) (umur 63)Kaifeng, Dinasti Song UtaraMakamPemakaman Baogong, Distrik Luyang, Hefei, Anhui, Tiongkok31°51′27.17″N 117°17′56.39″E / 31.8575472°N 117.2989972°E / 31.8575472; 117.2989972Koordinat: 31°51′27.17″N 117°17′56.39″E / 31.8575472°N 117.2989972°E / 31.857547...
Gambar keranjang Keranjang atau bakul (Bahasa Indonesia)/boboko (Sunda), adalah sebuah wadah yang biasanya dibuat dari serat-serat tanaman yang dianyam. Pada bagian atasnya bisa terbuka atau bisa ditutup dengan sebuah penutup. Penyebaran bakul di Indonesia Bakul tersebar di beberapa daerah Indonesia, khususnya/biasanya di daerah yang memang terdapat bahan baku yang cukup yaitu tanaman bambu sebagai bahan dasar pembuatan bakul. Bakul di Indonesia juga terdapat berbagai macam jenis dan ukuranny...
Cet article est une ébauche concernant le cyclisme. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Milan-San Remo 1961GénéralitésCourse 52e Milan-San RemoCompétition Super Prestige Pernod 1961 (d)Date 19 mars 1961Distance 288 kmPays traversé(s) ItalieLieu de départ MilanLieu d'arrivée San RemoCoureurs au départ 214Coureurs à l'arrivée 126Vitesse moyenne 37,474 km/hRésultatsVainqueur Raymon...
NavanNavan (GNR) stationGeneral informationLocationNavan, County MeathIrelandHistoryOpened1850Closed1958Original companyDublin and Drogheda RailwayPost-groupingGreat Northern RailwaysServices Preceding station Disused railways Following station Beaupark Great Northern RailwayDrogheda–Oldcastle Navan Junction Navan railway station is a former train station which served the town of Navan in County Meath, Ireland. History The station served the centre of the town, ...
هذه المقالة تحتاج للمزيد من الوصلات للمقالات الأخرى للمساعدة في ترابط مقالات الموسوعة. فضلًا ساعد في تحسين هذه المقالة بإضافة وصلات إلى المقالات المتعلقة بها الموجودة في النص الحالي. (سبتمبر 2017) شيتانغ الاسم الرسمي (بالصينية: 西塘镇) الإحداثيات 30°56′28″N 120°53′14″E...
State park in California, United States Gray Whale Cove State BeachShow map of CaliforniaShow map of the United StatesLocationSan Mateo County, California, United StatesNearest cityMontara, CaliforniaCoordinates37°33′56″N 122°30′52″W / 37.56556°N 122.51444°W / 37.56556; -122.51444Area3.1 acres (1.3 ha)Established1966Governing bodyCalifornia Department of Parks and Recreation Gray Whale Cove State Beach is a California State Park between Pacif...
أوردوس (مدينة) (بالصينية: 鄂尔多斯市)(بالمنغولية: ᠣᠷᠳᠣᠰ ᠬᠣᠲᠠ) خريطة الموقع تاريخ التأسيس 1 مايو 2001 تقسيم إداري البلد الصين [1][2] التقسيم الأعلى منغوليا الداخلية خصائص جغرافية إحداثيات 39°48′54″N 109°59′51″E / 39.815°N 109.9975°E / 39.815; 109.9975 ...
This article's lead section may be too short to adequately summarize the key points. Please consider expanding the lead to provide an accessible overview of all important aspects of the article. (May 2021) This is a list of products released by Fitbit. Products This article does not include the Google Pixel Watch Trackers Fitbit Charge 3 displaying time, heart rate, and steps Activity tracker products. Fitbit Ace range Released in March 2018, the Fitbit Ace is a version of the Alta for child...
Harold Huston GeorgeHarold Huston GeorgeJulukanPursuit GeorgeLahir14 September 1892Lockport, New YorkMeninggal29 April 1942 (1942-04-30) (aged 49)Darwin, AustraliaDikebumikanArlington National CemeteryDinas/cabangAir Service, United States ArmyUnited States Army Air CorpsUnited States Army Air ForcesLama dinas1916-1942Pangkat BrigjenNRP0-9605KesatuanAir Service, United States Army 139th Aero Squadron Komandan31st Pursuit GroupFar East Air ForcePerang/pertempuran Perang Dun...
Silver Soul redirects here. For the Beach House song, see Teen Dream. Japanese manga series by Hideaki Sorachi GintamaFirst tankōbon volume cover, featuring Gintoki Sakata銀魂GenreAdventure[1]Science fiction comedy[2] MangaWritten byHideaki SorachiPublished byShueishaEnglish publisherNA: Viz MediaImprintJump ComicsMagazineWeekly Shōnen Jump(December 8, 2003–September 15, 2018)Jump Giga(December 28, 2018–February 22, 2019)Gintama app(May 13–June 20, 2019)Eng...
Mukhram SharmaMukhram SharmaBorn(1909-05-13)13 May 1909Poothi, United Provinces, British India(present day Uttar Pradesh, India)Died25 April 2000(2000-04-25) (aged 90)Meerut, Uttar Pradesh, IndiaOccupationwriterYears active1954-1980 Mukhram Sharma (13 May 1909 – 25 April 2000) was an Indian film lyricist, script, and story writer. He is best known for winning the first Filmfare Award in the Best Story category in 1955 for the film Aulad. His notable works as story writer inc...
Former capital of Cambodia For the Japanese dish, see Udon. For other uses, see Udong (disambiguation).City in Kandal Province, CambodiaOudong ឧដុង្គCityPhnom OudongNickname: City of Past KingsOudongLocation of Oudong, CambodiaCoordinates: 11°49′26″N 104°44′33″E / 11.82389°N 104.74250°E / 11.82389; 104.74250Country CambodiaProvinceKandal ProvinceDistrictPonhea LueuCommunePhsar DaekTime zoneUTC+7 (Cambodia)Area code12000 Oudong (Khmer: ...
Neighborhood in Philadelphia, Pennsylvania, United StatesQueen VillageNeighborhoodOld Swedes' Church in Queen VillageQueen VillageCoordinates: 39°56′19″N 75°09′00″W / 39.9385°N 75.1500°W / 39.9385; -75.1500Country United StatesStatePennsylvaniaCountyPhiladelphiaCityPhiladelphiaArea code(s)215, 267, and 445 Queen Village is a residential neighborhood of Philadelphia, Pennsylvania, United States that lies along the eastern edge of the city in South Phila...
Pour les articles homonymes, voir Saint Bruno, Saint-Bruno et Bruno de Cologne. Bruno le Chartreux Chanoine, écolâtre, ermite, fondateur Naissance v. 1030Cologne, Allemagne Décès 6 octobre 1101 Serra San Bruno, Calabre Italie Autres noms Bruno de Cologne Ordre religieux Ordre des Chartreux Vénéré à Serra san Bruno, Calabre, Italie Canonisation 1514 (autorisation de culte privé, étendue à l'Église universelle au concile de Trente Grande-Chartreuse Vénéré par Église ...
Vegas Golden KnightsHockey su ghiaccio Detentore della Stanley Cup Segni distintiviUniformi di gara Casa Trasferta Colori socialiGrigio, oro, rosso, nero Dati societariCittàParadise, Nevada Paese Stati Uniti LegaNational Hockey League ConferenceWestern DivisionPacific Fondazione2017 Esordio2017 DenominazioneVegas Golden Knights2017-presente ProprietarioBill Foley (85%)Famiglia Maloof (15%) General managerGeorge McPhee AllenatorePeter Deboer Squadre affiliateHenderson Silver Knights (AHL...
هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (مايو 2022) إساءة استخدام مبيدات الآفات تُعتبر خرقًا القوانين التي تنظم استخدامها في الولايات المتحدة الأمريكية إذ أن ذلك يعرض كل من صحة الإنسان والبيئة للخطر. وُضعت الع...
سيرغي ماتفيف معلومات شخصية الميلاد 29 يناير 1975 (العمر 49 سنة)الجمهورية الأوكرانية السوفيتية الاشتراكية الطول 183 سنتيمتر الجنسية أوكرانيا الوزن 78 كيلوغرام[1] المدرسة الأم الجامعة الوطنية للتربية البدنية والرياضة في أوكرانيا الحياة العملية الدور دراج الفرق ...
Edward ElkasElkas, Paul Panzer, dan Arthur Pierot dalam The Mystery Mind (1920)Lahir(1862-02-08)8 Februari 1862New York, New York, Amerika SerikatMeninggal17 Desember 1933(1933-12-17) (umur 71)Tahun aktif1911–1926 Edward Elkas (8 Februari 1862 – 17 Desember 1933) adalah seorang pemeran film Amerika Serikat dari era film bisu. Ia tampil dalam 84 film antara 1911 dan 1926. Ia lahir di New York, New York. Filmografi pilihan The Foolish Virgin (1916) The Suspect (1916...