Information gain ratio

In decision tree learning, information gain ratio is a ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan,[1] to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute.[2]

Information gain is also known as mutual information.[3]

The image shows the information gain of a variable called "year" and shows the result of choosing a year 1 through 12. The information gain would favor this variable as the results would either be definitely positive or negative while also creating multiple leaf nodes, however, the problem is that none of these years will occur again. The next input would be year 13, but there is no branch to year 13 and that is a problem that can be solved with information gain ratio. Information gain ratio will normalize the data using the entropy value of that variable to remove the bias of multi-variable data and variables with multiple nodes compared to variables with a smaller set of nodes. This would remove the odds of the tree in the image from being created.
The image shows the information gain of a variable called "year" and shows the result of choosing a year 1 through 12. The information gain would favor this variable as the results would either be definitely positive or negative while also creating multiple leaf nodes, however, the problem is that none of these years will occur again. The next input would be year 13, but there is no branch to year 13 and that is a problem that can be solved with information gain ratio. Information gain ratio will normalize the data using the entropy value of that variable to remove the bias of multi-variable data and variables with multiple nodes compared to variables with a smaller set of nodes. This would remove the odds of the tree in the image from being created.

Information gain calculation

Information gain is the reduction in entropy produced from partitioning a set with attributes and finding the optimal candidate that produces the highest value:

where is a random variable and is the entropy of given the value of attribute .

The information gain is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case the relative entropies subtracted from the total entropy are 0.

Split information calculation

The split information value for a test is defined as follows:

where is a discrete random variable with possible values and being the number of times that occurs divided by the total count of events where is the set of events.

The split information value is a positive number that describes the potential worth of splitting a branch from a node. This in turn is the intrinsic value that the random variable possesses and will be used to remove the bias in the information gain ratio calculation.

Information gain ratio calculation

The information gain ratio is the ratio between the information gain and the split information value:

Example

Using weather data published by Fordham University,[4] the table was created below:

WEKA weather data
Outlook Temperature Humidity Wind Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal False Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

Using the table above, one can find the entropy, information gain, split information, and information gain ratio for each variable (outlook, temperature, humidity, and wind). These calculations are shown in the tables below:

Outlook table
Outlook Yes No Count of each group Entropy
sunny 2 3 5 0.971
overcast 4 0 4 0.000
rainy 3 2 5 0.971
Results Values
Information 0.694
Overall entropy 0.940
Information gain 0.247
Split information 1.577
Gain ratio 0.156
Temperature table
Temperature Yes No Count of each group Entropy
hot 2 2 4 1.000
mild 4 2 6 0.918
cool 3 1 4 0.811
Results Values
Information 0.911
Overall entropy 0.940
Information gain 0.029
Split information 1.557
Gain ratio 0.019
Wind table
Wind Yes No Count of each group Entropy
False 6 2 8 0.811
True 3 3 6 1.000
Results Values
Information 0.892
Overall entropy 0.940
Information gain 0.048
Split information 0.985
Gain ratio 0.049
Humidity table
Humidity Yes No Count of each group Entropy
High 3 4 7 0.985
Normal 6 1 7 0.592
Results Values
Information 0.788
Overall entropy 0.940
Information gain 0.152
Split information 1.000
Gain ratio 0.152

Using the above tables, one can deduce that Outlook has the highest information gain ratio. Next, one must find the statistics for the sub-groups of the Outlook variable (sunny, overcast, and rainy), for this example one will only build the sunny branch (as shown in the table below):

Outlook table
Outlook Temperature Humidity Wind Play
Sunny Hot High False No
Sunny Hot High True No
Sunny Mild High False No
Sunny Cool Normal False Yes
Sunny Mild Normal True Yes

One can find the following statistics for the other variables (temperature, humidity, and wind) to see which have the greatest effect on the sunny element of the outlook variable:

Temperature table
Temperature Yes No Count of each group Entropy
Hot 0 2 2 0.000
Mild 1 1 2 1.000
Cool 1 0 1 0.000
Results Values
Information 0.400
Overall entropy 0.971
Gain 0.571
Split information 1.522
Gain ratio 0.375
Wind table
Wind Yes No Count of each group Entropy
False 1 2 3 0.918
True 1 1 2 1.000
Results Values
Information 0.951
Overall entropy 0.971
Gain 0.020
Split information 0.971
Gain ratio 0.021
Humidity table
Humidity Yes No Count of each group Entropy
High 0 3 3 0.000
Normal 2 0 2 0.000
Results Values
Information 0.000
Overall entropy 0.971
Gain 0.971
Split information 0.971
Gain ratio 1.000

Humidity was found to have the highest information gain ratio. One will repeat the same steps as before and find the statistics for the events of the Humidity variable (high and normal):

Humidity-high Table
Humidity Wind Play
High False No
High True No
High False No
Humidity-normal Table
Humidity Wind Play
Normal False Yes
Normal True Yes

Since the play values are either all "No" or "Yes", the information gain ratio value will be equal to 1. Also, now that one has reached the end of the variable chain with Wind being the last variable left, they can build an entire root to leaf node branch line of a decision tree.

Alt text

Once finished with reaching this leaf node, one would follow the same procedure for the rest of the elements that have yet to be split in the decision tree. This set of data was relatively small, however, if a larger set was used, the advantages of using the information gain ratio as the splitting factor of a decision tree can be seen more.

Advantages

Information gain ratio biases the decision tree against considering attributes with a large number of distinct values.

For example, suppose that we are building a decision tree for some data describing a business's customers. Information gain ratio is used to decide which of the attributes are the most relevant. These will be tested near the root of the tree. One of the input attributes might be the customer's telephone number. This attribute has a high information gain, because it uniquely identifies each customer. Due to its high amount of distinct values, this will not be chosen to be tested near the root.

Disadvantages

Although information gain ratio solves the key problem of information gain, it creates another problem. If one is considering an amount of attributes that have a high number of distinct values, these will never be above one that has a lower number of distinct values.

Difference from information gain

  • Information gain's shortcoming is created by not providing a numerical difference between attributes with high distinct values from those that have less.
    • Example: Suppose that we are building a decision tree for some data describing a business's customers. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high information gain, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before.
  • Information gain ratio's strength is that it has a bias towards the attributes with the lower number of distinct values.
  • Below is a table describing the differences of information gain and information gain ratio when put in certain scenarios.
Situational differences between information gain and information gain ratio
Information gain Information gain ratio
Will not favor any attributes by number of distinct values Will favor attribute that have a lower number of distinct values
When applied to attributes that can take on a large number of distinct values, this technique might learn the training set too well User will struggle if required to find attributes requiring a high number of distinct values

See also

References

  1. ^ Quinlan, J. R. (1986). "Induction of decision trees". Machine Learning. 1: 81–106. doi:10.1007/BF00116251.
  2. ^ http://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf Archived 2014-12-28 at the Wayback Machine [bare URL PDF]
  3. ^ "Information gain, mutual information and related measures".
  4. ^ https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.nominal.arff

Read other articles:

Keuskupan KumboDioecesis KumboensisKatolik LokasiNegaraKamerunWilayahDivisi Bui dan Donga-Mantung.Provinsi gerejawiBamendaDekanat6StatistikLuas8.600 km2 (3.300 sq mi)Populasi- Total- Katolik(per 2011)789.000202,543 (25.7%)Paroki27Imam83InformasiDenominasiKatolik RomaRitusRitus LatinPendirian18 Maret 1982KatedralGereja Katedral Santa Teresa dari Kanak-Kanak YesusKepemimpinan kiniPausFransiskusUskupGeorge NkuoUskup agungCornelius Fontem EsuaVikaris jenderalRol...

 

Muhidin M. Said Anggota Dewan Perwakilan RakyatPetahanaMulai menjabat 1 Oktober 2004Daerah pemilihanSulawesi TengahAnggota Majelis Permusyawaratan RakyatMasa jabatan1992–2004Grup parlemenUtusan Daerah Sulawesi Tengah Informasi pribadiLahirMuhidin Mohamad Said7 Oktober 1950 (umur 73)Soppeng, Sulawesi, IndonesiaPartai politikGolkarSuami/istriSri SulistiatiAnak5Alma materUniversitas TadulakoJakarta Institute of Management StudiesPekerjaanPengusahaPolitikusSunting kotak info • L...

 

MualLukisan tahun 1681 yang menggambarkan seseorang muntahICD-10R11.0ICD-9787.03MedlinePlus003117MeSHD009325 Mual atau loya adalah perasaan tidak menyenangkan yang ada sebelum muntah.[1] Ini biasa disertai berkeringat, bertambahnya air liur, dan kontraksi ritmis otot-otot dinding perut.[1] Dalam sumber lain mual adalah suatu kondisi di mana seseorang mempunyai perasaan yang menekan dan tidak nyaman sebelum muntah, tetapi tidak selalu menyebabkan muntah.[2] Mual setelah...

Pour les articles homonymes, voir Lesur. Annie Lesur Fonctions Secrétaire d'État chargée de l'Enseignement préscolaire 8 juin 1974 – 12 janvier 1976(1 an, 7 mois et 4 jours) Président Valéry Giscard d'Estaing Gouvernement Chirac I Prédécesseur Poste créé Successeur Poste supprimé Biographie Nom de naissance Anne-Marie Alice Renée Charpin Date de naissance 28 mars 1926 Lieu de naissance Saint-Honoré-les-Bains (Nièvre) Date de décès 8 septembre 2021 (à 95 ...

 

Published theory of Samuel P. Huntington about cultural geography The Clash of Civilizations and the Remaking of World Order AuthorSamuel P. HuntingtonCountryUnited StatesLanguageEnglishPublisherSimon & SchusterPublication date1996ISBN978-0-684-84441-1 The Clash of Civilizations is a thesis that people's cultural and religious identities will be the primary source of conflict in the post–Cold War world.[1][2][3][4][5] The American political scient...

 

American politician (1936–2013) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: William J. Coyne – news · newspapers · books · scholar · JSTOR (November 2020) (Learn how and when to remove this template message) William J. CoyneMember of the U.S. House of Representativesfrom Pennsylvania's 14th...

Antiretroviral medication AZT redirects here. For other uses, see AZT (disambiguation). ZidovudineClinical dataTrade namesRetrovir, othersAHFS/Drugs.comMonographMedlinePlusa687007License data EU EMA: by INN US DailyMed: Zidovudine Pregnancycategory AU: B3 Routes ofadministrationBy mouth, intravenous, rectal suppositoryATC codeJ05AF01 (WHO) Legal statusLegal status UK: POM (Prescription only)[2] US: WARNING[1]Rx-only[3] EU:&#...

 

Dieffenbach-au-Valcomune Dieffenbach-au-Val – Veduta LocalizzazioneStato Francia RegioneGrand Est Dipartimento Basso Reno ArrondissementSélestat-Erstein CantoneMutzig TerritorioCoordinate48°19′N 7°20′E / 48.316667°N 7.333333°E48.316667; 7.333333 (Dieffenbach-au-Val)Coordinate: 48°19′N 7°20′E / 48.316667°N 7.333333°E48.316667; 7.333333 (Dieffenbach-au-Val) Superficie2,96 km² Abitanti638[1] (2009) Densità215,54 ab./...

 

  「俄亥俄」重定向至此。关于其他用法,请见「俄亥俄 (消歧义)」。 俄亥俄州 美國联邦州State of Ohio 州旗州徽綽號:七葉果之州地图中高亮部分为俄亥俄州坐标:38°27'N-41°58'N, 80°32'W-84°49'W国家 美國加入聯邦1803年3月1日,在1953年8月7日追溯頒定(第17个加入联邦)首府哥倫布(及最大城市)政府 • 州长(英语:List of Governors of {{{Name}}}]]) •&...

Pour les articles homonymes, voir Tatum. Channing Tatum Channing Tatum au Comic-Con de San Diego, en 2017. Données clés Naissance 26 avril 1980 (44 ans) Nationalité Américaine Profession Acteur, danseur, mannequin, producteur, réalisateur Films notables Sexy Dance21Magic MikeJe te prometsWhite House Down modifier Channing Tatum est un acteur, danseur, mannequin, réalisateur et producteur de cinéma américain, né le 26 avril 1980 à Cullman (Alabama). Après avoir été révélé...

 

密西西比州 哥伦布城市綽號:Possum Town哥伦布位于密西西比州的位置坐标:33°30′06″N 88°24′54″W / 33.501666666667°N 88.415°W / 33.501666666667; -88.415国家 美國州密西西比州县朗兹县始建于1821年政府 • 市长罗伯特·史密斯 (民主党)面积 • 总计22.3 平方英里(57.8 平方公里) • 陸地21.4 平方英里(55.5 平方公里) • ...

 

Clothing company For other uses, see White Stag. White StagCompany typeSubsidiaryIndustryApparelFounded1907HeadquartersUnited States ProductsSport clothing, sportswearParentWal-Mart Stores, Inc. White Stag is an in-store brand of women's clothing and accessories sold by Walmart. Founded as a skiwear manufacturer in Portland, Oregon, the company was purchased by the Warnaco Group in 1966, which in turn sold the brand to Wal-Mart in 2003. Company origins Tag from a Hirsch-Weis sleeping bag...

United States Department of Education research branch Mark Schneider, current director The Institute of Education Sciences (IES) is the independent, non-partisan statistics, research, and evaluation arm of the U.S. Department of Education. IES' stated mission is to provide scientific evidence on which to ground education practice and policy and to share this information in formats that are useful and accessible to educators, parents, policymakers, researchers, and the public.[1] It wa...

 

Berikut adalah daftar munisipalitas di provinsi Girona, di wilayah otonomi Catalunya Spanyol Peta munisipalitas di Provinsi Girona Nama Populasi (2007) Agullana 756 Aiguaviva 594 Albanyà 137 Albons 620 Alp 1.467 Amer 2.258 Anglès 5.211 Arbúcies 6.232 Argelaguer 418 L'Armentera 770 Avinyonet de Puigventós 1.241 Banyoles 17.309 Bàscara 900 Begur 4.076 Bellcaire d'Empordà 672 Besalú 2.257 Bescanó 3.859 Beuda 164 La Bisbal d'Empordà 9.261 Biure 236 Blanes 37.819 Boadella d'Empordà 230 B...

 

Pesta Olahraga Asia Tenggara 1997Tuan rumahJakarta IndonesiaJumlah negara10Jumlah atlet6007 (termasuk ofisial)Jumlah disiplin440 dari 34 cabang olahragaUpacara pembukaan11 Oktober 1997Upacara penutupan19 Oktober 1997Dibuka olehSoehartoPresiden Republik IndonesiaDitutup olehSoehartoPresiden Republik IndonesiaTempat utamaStadion SenayanSitus webPesta Olahraga Asia Tenggara 1997← Chiangmai 1995 Bandar Seri Begawan 1999 → Pesta Olahraga Negara-Negara Asia Tenggara 1997 (bahasa In...

2001 greatest hits album by the Backstreet Boys The Hits – Chapter OneGreatest hits album by Backstreet BoysReleasedOctober 23, 2001[1]RecordedOctober 1994 – April 2001Studio Parc (Orlando, Florida, U.S.) Platinum Post (Orlando, Florida, U.S.) Cheiron (Stockholm, Sweden) Polar (Stockholm, Sweden) Matiz (Hamburg, Germany) Powerplay (Zurich, Switzerland) Battery (New York City, U.S.) Tates Creek High School (Lexington, Kentucky, U.S.) GenrePop[2]Length65:03LabelJiveBacks...

 

Negara-negara anggota Konvensi Regulasi Perburuan Paus Internasional (biru).[1] Konvensi Regulasi Perburuan Paus Internasional adalah sebuah perjanjian lingkungan hidup internasional yang ditandatangani pada 1946.[2] Konvensi tersebut mengatur kegiatan komersial dan saintifik dari perburuan paus terhadap delapan puluh sembilan negara anggota. Referensi Wikisumber memiliki naskah asli yang berkaitan dengan artikel ini: International Convention for the Regulation of Whaling ^ Me...

 

Leuilly-sous-Coucy Mairie et église. Administration Pays France Région Hauts-de-France Département Aisne Arrondissement Laon Intercommunalité Communauté de communes Picardie des Châteaux Maire Mandat Christian Zakryenski 2020-2026 Code postal 02380 Code commune 02423 Démographie Gentilé Leuillois(es) Populationmunicipale 436 hab. (2021 ) Densité 34 hab./km2 Géographie Coordonnées 49° 28′ 47″ nord, 3° 21′ 33″ est Altitude Min. 51...

アメリカ陸軍航空軍 創設 1941年6月20日 - 1947年9月17日国籍 アメリカ合衆国軍種 アメリカ陸軍タイプ 陸軍航空隊任務 航空戦上級部隊 アメリカ陸軍主な戦歴 第二次世界大戦表話編歴テンプレートを表示 アメリカ陸軍航空軍(United States Army Air Forces, USAAF)は、かつて存在したアメリカ陸軍の部門。アメリカ空軍の前身である。第二次世界大戦中の1941年に陸軍地上軍と同格�...

 

City wall in ancient Athens The Piraeus and the Long Walls of Athens Ancient Athens Although long walls were built at several locations in ancient Greece, notably Corinth and Megara,[1] the term Long Walls (Ancient Greek: Μακρὰ Τείχη [makra tei̯kʰɛː]) generally refers to the walls that connected Athens' main city to its ports at Piraeus and Phaleron. Built in several phases, they provided a secure connection to the sea even during times of siege. The walls were ...