Pointwise mutual information

In statistics, probability theory and information theory, pointwise mutual information (PMI),[1] or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were independent.[2]

PMI (especially in its positive pointwise mutual information variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in [a] corpus than we would have expected them to appear by chance."[2]

The concept was introduced in 1961 by Robert Fano under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables:[2] The mutual information (MI) of two discrete random variables refers to the average PMI of all possible events.

Definition

The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:[2]

(with the latter two expressions being equal to the first by Bayes' theorem). The mutual information (MI) of the random variables X and Y is the expected value of the PMI (over all possible outcomes).

The measure is symmetric (). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is non-negative. PMI maximizes when X and Y are perfectly associated (i.e. or ), yielding the following bounds:

Finally, will increase if is fixed but decreases.

Here is an example to illustrate:

x y p(xy)
0 0 0.1
0 1 0.7
1 0 0.15
1 1 0.05

Using this table we can marginalize to get the following additional table for the individual distributions:

p(x) p(y)
0 0.8 0.25
1 0.2 0.75

With this example, we can compute four values for . Using base-2 logarithms:

pmi(x=0;y=0) = −1
pmi(x=0;y=1) = 0.222392
pmi(x=1;y=0) = 1.584963
pmi(x=1;y=1) = -1.584963

(For reference, the mutual information would then be 0.2141709.)

Similarities to mutual information

Pointwise Mutual Information has many of the same relationships as the mutual information. In particular,

Where is the self-information, or .

Variants

Several variations of PMI have been proposed, in particular to address what has been described as its "two main limitations":[3]

  1. PMI can take both positive and negative values and has no fixed bounds, which makes it harder to interpret.[3]
  2. PMI has "a well-known tendency to give higher scores to low-frequency events", but in applications such as measuring word similarity, it is preferable to have "a higher score for pairs of words whose relatedness is supported by more evidence."[3]

Positive PMI

The positive pointwise mutual information (PPMI) measure is defined by setting negative values of PMI to zero:[2]

This definition is motivated by the observation that "negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous" and also by a concern that "it's not clear whether it's even possible to evaluate such scores of 'unrelatedness' with human judgment".[2] It also avoid having to deal with values for events that never occur together (), by setting PPMI for these to 0.[2]

Normalized pointwise mutual information (npmi)

Pointwise mutual information can be normalized between [-1,+1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete co-occurrence.[4]

Where is the joint self-information .

PMIk family

The PMIk measure (for k=2, 3 etc.), which was introduced by Béatrice Daille around 1994, and as of 2011 was described as being "among the most widely used variants", is defined as[5][3]

In particular, . The additional factors of inside the logarithm are intended to correct the bias of PMI towards low-frequency events, by boosting the scores of frequent pairs.[3] A 2011 case study demonstrated the success of PMI3 in correcting this bias on a corpus drawn from English Wikipedia. Taking x to be the word "football", its most strongly associated words y according to the PMI measure (i.e. those maximizing ) were domain-specific ("midfielder", "cornerbacks", "goalkeepers") whereas the terms ranked most highly by PMI3 were much more general ("league", "clubs", "england").[3]

Specific Correlation

Total correlation is an extension of mutual information to multi-variables. Analogously to the definition of total correlation, the extension of PMI to multi-variables is "specific correlation."[6] The SI of the results of random variables is expressed as the following:

Chain-rule

Like mutual information,[7] point mutual information follows the chain rule, that is,

This is proven through application of Bayes' theorem:

Applications

PMI could be used in various disciplines e.g. in information theory, linguistics or chemistry (in profiling and analysis of chemical compounds).[8] In computational linguistics, PMI has been used for finding collocations and associations between words. For instance, countings of occurrences and co-occurrences of words in a text corpus can be used to approximate the probabilities and respectively. The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia (dump of October 2015)[citation needed] filtering by 1,000 or more co-occurrences. The frequency of each count can be obtained by dividing its value by 50,000,952. (Note: natural log is used to calculate the PMI values in this example, instead of log base 2)

word 1 word 2 count word 1 count word 2 count of co-occurrences PMI
puerto rico 1938 1311 1159 10.0349081703
hong kong 2438 2694 2205 9.72831972408
los angeles 3501 2808 2791 9.56067615065
carbon dioxide 4265 1353 1032 9.09852946116
prize laureate 5131 1676 1210 8.85870710982
san francisco 5237 2477 1779 8.83305176711
nobel prize 4098 5131 2498 8.68948811416
ice hockey 5607 3002 1933 8.6555759741
star trek 8264 1594 1489 8.63974676575
car driver 5578 2749 1384 8.41470768304
it the 283891 3293296 3347 -1.72037278119
are of 234458 1761436 1019 -2.09254205335
this the 199882 3293296 1211 -2.38612756961
is of 565679 1761436 1562 -2.54614706831
and of 1375396 1761436 2949 -2.79911817902
a and 984442 1375396 1457 -2.92239510038
in and 1187652 1375396 1537 -3.05660070757
to and 1025659 1375396 1286 -3.08825363041
to in 1025659 1187652 1066 -3.12911348956
of and 1761436 1375396 1190 -3.70663100173

Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.

References

  1. ^ Kenneth Ward Church and Patrick Hanks (March 1990). "Word association norms, mutual information, and lexicography". Comput. Linguist. 16 (1): 22–29.
  2. ^ a b c d e f g Dan Jurafsky and James H. Martin: Speech and Language Processing (3rd ed. draft), December 29, 2021, chapter 6
  3. ^ a b c d e f Francois Role, Moahmed Nadif. Handling the Impact of Low frequency Events on Co-occurrence-based Measures of Word Similarity:A Case Study of Pointwise Mutual Information. Proceedings of KDIR 2011 : KDIR- International Conference on Knowledge Discovery and Information Retrieval, Paris, October 26–29, 2011
  4. ^ Bouma, Gerlof (2009). "Normalized (Pointwise) Mutual Information in Collocation Extraction" (PDF). Proceedings of the Biennial GSCL Conference.
  5. ^ B. Daille. Approche mixte pour l'extraction automatique de terminologie : statistiques lexicales et filtres linguistiques. Thèse de Doctorat en Informatique Fondamentale. Université Paris 7. 1994. p.139
  6. ^ Tim Van de Cruys. 2011. Two Multivariate Generalizations of Pointwise Mutual Information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pages 16–20, Portland, Oregon, USA. Association for Computational Linguistics.
  7. ^ Paul L. Williams. INFORMATION DYNAMICS: ITS THEORY AND APPLICATION TO EMBODIED COGNITIVE SYSTEMS.
  8. ^ Čmelo, I.; Voršilák, M.; Svozil, D. (2021-01-10). "Profiling and analysis of chemical compounds using pointwise mutual information". Journal of Cheminformatics. 13 (1): 3. doi:10.1186/s13321-020-00483-y. ISSN 1758-2946. PMC 7798221. PMID 33423694.

Read other articles:

LigneShinonoi Ligne de Shiojiri à Shinonoi Carte de la ligne Automotrice série 383 sur la ligne Shinonoi Pays Japon Villes desservies Shiojiri, Matsumoto, Nagano Historique Mise en service 1900 Caractéristiques techniques Longueur 66,7 km Écartement étroit (1 067 mm) Électrification 1500 V continu Nombre de voies 1 à 2 Trafic Propriétaire JR East Exploitant(s) JR East Schéma de la ligne Légende modifier  La ligne Shinonoi (篠ノ井線, Shinonoi-sen?) est une li...

 

 

Miss Indonesia 2013Tanggal20 Februari 2013TempatHall D2 Jakarta International Expo, Jakarta, IndonesiaTemaBeauty for The WorldPembawa acaraDaniel ManantaJennifer SumiaJavier JustinOvi DianPengisi acaraNOAHBunga Citra LestariCakra KhanFebri IdolDion IdolYoda Idol3ComposerAva VictoriaPenyiaranRCTIPeserta33Finalis/Semifinalis10PemenangVania Larissa Kalimantan BaratPersahabatanBalgis Novrilia Kalimantan Tengah← 20122014 →lbs Miss Indonesia 2013 adalah kon...

 

 

Untuk kegunaan lain, lihat A Christmas Story (disambiguasi). A Christmas StoryPoster rilis teatrikalSutradaraBob ClarkProduser Rene Dupont Bob Clark Ditulis oleh Jean Shepherd Leigh Brown Bob Clark BerdasarkanIn God We Trust: All Others Pay Cashnovel tahun 1966oleh Jean ShepherdPemeran Melinda Dillon Darren McGavin Peter Billingsley NaratorJean ShepherdPenata musik Carl Zittrer Paul Zaza SinematograferReginald H. MorrisPenyuntingStan ColePerusahaanproduksiMetro-Goldwyn-MayerDistributorM...

العلاقات الكاميرونية الفرنسية الكاميرون فرنسا   الكاميرون   فرنسا تعديل مصدري - تعديل   العلاقات الكاميرونية الفرنسية هي العلاقات الثنائية التي تجمع بين الكاميرون وفرنسا.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدولتين: وجه ...

 

 

Stadion Olahraga Sekondi Informasi stadionPemilikNSC/RCC Western Region, GhanaLokasiLokasiSekondi-Takoradi, Kawasan Barat, GhanaKonstruksiDibuka2008Data teknisKapasitas20,000PemakaiSekondi Hasaacas FCSunting kotak info • L • BBantuan penggunaan templat ini Stadion Olahraga Sekondi (juga dikenal sebagai Stadion Essipong) adalah sebuah stadion serba guna di Sekondi-Takoradi, Ghana. Stadion tersebut adalah stadion kandang dari Sekondi Hasaacas FC. Referensi Pranala luar Sekondi sta...

 

 

Pour les articles homonymes, voir Atlas. Des principaux membres de la famille Atlas : Atlas V (en haut), Centaur-Atlas (à gauche) et Mercury-Atlas (à droite). Atlas est une famille de lanceurs spatiaux américains dérivés du missile Atlas, premier missile balistique intercontinental américain mis au point à la fin des années 1950. Le lanceur initial est une fusée à ergols liquides brûlant un mélange d'oxygène liquide et de kérosène et disposant de trois moteurs dont deux s...

U.S. Representative from Pennsylvania J. Roland KinzerMember of the U.S. House of Representativesfrom Pennsylvania's 9th districtIn officeJanuary 3, 1945 – January 3, 1947Preceded byCharles L. GerlachSucceeded byPaul B. DagueMember of the U.S. House of Representativesfrom Pennsylvania's 10th districtIn officeJanuary 28, 1930 – January 3, 1945Preceded byWilliam Walton GriestSucceeded byJohn W. Murphy Personal detailsBorn(1874-03-28)March 28, 1874Te...

 

 

Polish politician (born 1962) Jarosław KalinowskiMember of the SejmIn office14 October 1993 – 10 June 2009Constituency18 – Siedlce Personal detailsBorn1962 (age 61–62)NationalityPolishPolitical partyPolish People's Party Jarosław Kalinowski (Polish pronunciation: [jaˈrɔswaf kaliˈnɔfskʲi] ⓘ; born 12 April 1962) is a Polish politician from the agrarian Polish People's Party (PSL). Kalinowski was born in Wyszków. He was first elected to the Sejm in 1993,...

 

 

この記事は検証可能な参考文献や出典が全く示されていないか、不十分です。出典を追加して記事の信頼性向上にご協力ください。(このテンプレートの使い方)出典検索?: コルク – ニュース · 書籍 · スカラー · CiNii · J-STAGE · NDL · dlib.jp · ジャパンサーチ · TWL(2017年4月) コルクを打ち抜いて作った瓶の栓 コルク(木栓、�...

  سرقسطة (بالإسبانية: Zaragoza)‏[1]    سرقسطة سرقسطة تقسيم إداري البلد إسبانيا  [2][3] العاصمة سرقسطة  [لغات أخرى]‏  عاصمة لـ أَرَغـُونسرقسطة  خصائص جغرافية إحداثيات 41°39′00″N 0°53′00″W / 41.65°N 0.88333333333333°W / 41.65; -0.88333333333333   [4] المساح�...

 

 

 烏克蘭總理Прем'єр-міністр України烏克蘭國徽現任杰尼斯·什米加尔自2020年3月4日任命者烏克蘭總統任期總統任命首任維托爾德·福金设立1991年11月后继职位無网站www.kmu.gov.ua/control/en/(英文) 乌克兰 乌克兰政府与政治系列条目 宪法 政府 总统 弗拉基米尔·泽连斯基 總統辦公室 国家安全与国防事务委员会 总统代表(英语:Representatives of the President of Ukraine) 总...

 

 

此条目序言章节没有充分总结全文内容要点。 (2019年3月21日)请考虑扩充序言,清晰概述条目所有重點。请在条目的讨论页讨论此问题。 哈萨克斯坦總統哈薩克總統旗現任Қасым-Жомарт Кемелұлы Тоқаев卡瑟姆若马尔特·托卡耶夫自2019年3月20日在任任期7年首任努尔苏丹·纳扎尔巴耶夫设立1990年4月24日(哈薩克蘇維埃社會主義共和國總統) 哈萨克斯坦 哈萨克斯坦政府...

Puteri Indonesia PapuaLogo Puteri IndonesiaSingkatanPI PapuaDinamai berdasarkanPuteri Indonesia RegionalTanggal pendirian1992; 32 tahun lalu (1992)Didirikan diKota Jayapura, Papua, IndonesiaTipeKontes Kecantikan RegionalKantor pusatKota Jayapura, IndonesiaLokasi IndonesiaJumlah anggota Puteri IndonesiaBahasa resmi Bahasa IndonesiaBahasa InggrisPresiden dan CEO Puteri IndonesiaMooryati SoedibyoKetua Puteri IndonesiaPutri KuswisnuwardhaniOrganisasi indukPuteri IndonesiaSitus webwww.pu...

 

 

Allen Stuart DruryRonald Reagan menemui Drury pada 1981Lahir(1918-09-02)2 September 1918Houston, Texas, Amerika SerikatMeninggal2 September 1998(1998-09-02) (umur 80)San Francisco, CaliforniaTempat tinggalTiburon, CaliforniaKebangsaanAmerikaWarga negaraAmerikaPendidikanBachelor of ArtsAlmamaterStanford UniversityPekerjaanJurnalis, novelisTahun aktif1943-1998Tempat kerja Tulare Bee Bakersfield Californian Pathfinder Magazine Washington Evening Star The New York Times Dikenal atasPen...

 

 

State capital and largest city of Utah, United States This article is about the capital of Utah. For other uses, see Salt Lake City (disambiguation). State capital in Utah, United StatesSalt Lake CityState capitalCity of Salt Lake City[1]Skyline of Downtown Salt Lake CityUnion Pacific DepotUniversity of UtahChurch Office BuildingTRAX light rail systemSalt Lake TempleCity and County BuildingDelta CenterUtah State Capitol FlagSealNickname: The Crossroads of the WestInteractive map ...

Nose Electric Railway Co., Ltd.Native name能勢電鉄株式会社Founded23 May 1908HeadquartersKawanishi, Hyogo, JapanParentHankyu CorporationWebsitenoseden.hankyu.co.jpThe Nose Electric Railway Co., Ltd. (能勢電鉄株式会社, Nose (pronounced No-say) Dentetsu), occasionally abbreviated as Nose Railway or Noseden (能勢電), is a Japanese private railway company headquartered in Kawanishi, Hyogo, which links several areas in the mountainous Nose, Osaka, area to Kawanishi-noseguchi Sta...

 

 

Award 1913 Nobel Prize in LiteratureRabindranath Tagorebecause of his profoundly sensitive, fresh and beautiful verse, by which, with consummate skill, he has made his poetic thought, expressed in his own English words, a part of the literature of the West.Date 9 October 1913 (announcement) 10 December 1913 (ceremony) LocationStockholm, SwedenPresented bySwedish AcademyFirst awarded1901WebsiteOfficial website ← 1912 · Nobel Prize in Literature · 1914 → The 19...

 

 

Hans Blix di Vienna 2002. Foto oleh Dean Calma, IAEA Hans Blixⓘ (lahir 28 Juni 1928) adalah seorang diplomat dan politisi Swedia. Ia menjabat sebagai Menteri Luar Negeri Swedia antara tahun 1978 sampai 1979. Blix juga adalah ketua Komisi Pemantau, Verifikasi dan Inspeksi PBB (United Nations Monitoring, Verification and Inspection Commission dari Januari 2000 sampai Juni 2003 ketika digantikan oleh Demetrius Perricos. Komisi tersebut mulai bekerja mencari senjata pemusnah massal di Irak tahu...

1969 book by Gilles Deleuze The Logic of Sense Cover of the French editionAuthorGilles DeleuzeOriginal titleLogique du sensTranslatorsMark Lester, Charles StivaleLanguageFrenchSeriesEuropean PerspectivesSubjectMeaningPublished 1969 (Les Éditions de Minuit, in French) 1990 (Columbia University Press, in English) Publication placeFranceMedia typePrint (hardcover and paperback)Pages392 (French edition)393 (Columbia University Press edition)ISBN978-0231059831 The Logic of Sense (French...

 

 

Neolithic henge monument in Derbyshire For the retail complex in Birmingham, UK, see Bull Ring, Birmingham. The Bull RingLocationDove HolesRegionDerbyshire, EnglandCoordinates53°18′2.5″N 1°53′3.92″W / 53.300694°N 1.8844222°W / 53.300694; -1.8844222TypeHenge (Class II)HistoryPeriodslate NeolithicSite notesConditionsome damage Scheduled monumentOfficial nameBull Ring henge, oval barrow and bowl barrowDesignated26 November 1928[1]Reference no.1011...