Temporal difference learning

Il temporal difference (TD) learning, ovvero l'apprendimento mediante differenza temporale, indica una classe di metodi di reinforcement learning che basano il proprio apprendimento sul bootstrap dalla stima corrente della funzione obiettivo: questi metodi campionano dall'ambiente, così come il Metodo Monte Carlo, ma eseguono gli aggiornamenti della funzione di valore di stato basandosi sulle stime correnti, come avviene, invece, nella programmazione dinamica.

A differenza dei metodi Monte Carlo che modificano le loro stime solo quando il risultato finale è noto, questi metodi di distanza temporale adattano le proprie previsioni in modo dinamico, così da avere previsioni sul futuro più accurate, prima ancora che il risultato finale sia a disposizione[1]. Questa è, appunto, una forma di bootstrap, come si evince dal seguente esempio[1]:

"Supponi di voler prevedere il tempo per sabato e di avere un modello che prevede il tempo di sabato, avendo a disposizione il clima di ogni giorno della settimana. Di norma, aspetteresti fino a sabato e solo allora aggiusterai tutti i tuoi modelli. Tuttavia, quando è, ad esempio, venerdì, dovresti già avere una buona idea di come potrebbe essere il tempo sabato - e quindi essere in grado di cambiare, diciamo, il modello di sabato prima dell'arrivo di sabato".

I metodi di differenza temporale sono in stretta correlazione al modello di differenza temporale dell'apprendimento animale[2][3][4][5][6].

Formulazione matematica

Il metodo tabulare TD(0), uno dei metodi TD più semplici, stima la funzione di valore di stato di un processo decisionale di Markov (MDP) a stati finiti in base a una policy (o politica) . Sia la funzione del valore di stato di un MDP che ha stati , ricompense e un fattore di sconto per la policy :

soddisfa l'equazione di Hamilton-Jacobi-Bellman:quindi è uno stimatore non condizionato (a bias nullo) per . Questa osservazione giustifica il seguente algoritmo per stimare . L'algoritmo inizializza una tabella con valori arbitrati, scegliendo un valore per ciascuno degli stati del processo markoviano: viene inoltre fissato un tasso di apprendimento positivo. A questo punto viene valutata la policy , e una volta ottenuta la ricompensa , viene aggiornata la funzione del valore di stato per il vecchio stato usando la seguente regola[7]:

dove e indicano, rispettivamente, il vecchio e il nuovo stato.

TD-Lambda

TD-Lambda è un algoritmo di apprendimento creato da Richard S. Sutton basato su un precedente lavoro sull'apprendimento delle differenze temporali di Arthur Samuel[8]. Questo algoritmo è stato notoriamente applicato da Gerald Tesauro per creare TD-Gammon, un programma che ha imparato a giocare a backgammon al livello di giocatori esperti umani[9].

Il parametro può assumere valori compresi tra 0 e 1. Aumentando il valore di lambda, viene dato maggior peso alle ricompense che si ottengono in stati distanti da quello corrente.

Note

  1. ^ a b Richard Sutton, Learning to predict by the methods of temporal differences, in Machine Learning, vol. 3, n. 1, 1988, pp. 9–44, DOI:10.1007/BF00115009. (A revised version is available on Richard Sutton's publication page Archiviato il 30 marzo 2017 in Internet Archive.)
  2. ^ Schultz, W, Dayan, P & Montague, PR., A neural substrate of prediction and reward, in Science, vol. 275, n. 5306, 1997, pp. 1593–1599, DOI:10.1126/science.275.5306.1593, PMID 9054347.
  3. ^ P. R. Montague, P. Dayan e T. J. Sejnowski, A framework for mesencephalic dopamine systems based on predictive Hebbian learning (PDF), in The Journal of Neuroscience, vol. 16, n. 5, 1º marzo 1996, pp. 1936–1947, DOI:10.1523/JNEUROSCI.16-05-01936.1996, PMID 8774460.
  4. ^ P.R. Montague, P. Dayan e S.J. Nowlan, Using aperiodic reinforcement for directed self-organization (PDF), in Advances in Neural Information Processing Systems, vol. 5, 1993, pp. 969–976.
  5. ^ P. R. Montague e T. J. Sejnowski, The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms, in Learning & Memory, vol. 1, n. 1, 1994, pp. 1–33, PMID 10467583.
  6. ^ T.J. Sejnowski, P. Dayan e P.R. Montague, Predictive hebbian learning, in Proceedings of Eighth ACM Conference on Computational Learning Theory, 1995, pp. 15–18, DOI:10.1145/230000/225300/p15-sejnowski.
  7. ^ Reinforcement learning: An introduction (PDF), p. 130. URL consultato il 10 settembre 2019 (archiviato dall'url originale il 12 luglio 2017).
  8. ^ Richard Sutton e Andrew Barto, Reinforcement Learning, MIT Press, 1998, ISBN 978-0-585-02445-5. URL consultato il 10 settembre 2019 (archiviato dall'url originale il 30 marzo 2017).
  9. ^ Gerald Tesauro, Temporal Difference Learning and TD-Gammon, in Communications of the ACM, vol. 38, n. 3, March 1995, pp. 58–68, DOI:10.1145/203330.203343. URL consultato l'8 febbraio 2010.

Voci correlate

Collegamenti esterni

Read other articles:

Koin Ariarathes VI Ariarathes VI Epifanes Filopator (bahasa Yunani Kuno: Ἀριαράθης Ἐπιφανής Φιλοπάτωρ, Ariaráthēs Epiphanḗs Philopátōr; bertakhta 130–116 atau 126–111 SM), merupakan seorang raja Kapadokia, ia adalah putra bungsu Ariarathes V dari Kapadokia dan Nysa dari Kapadokia. Referensi Hazel, John; Who's Who in the Greek World, Ariarathes VI, (1999) Head, Barclay; Historia Numorum, Cappadocia, (1911) Justin; Epitome of Pompeius Trogus, John Selby...

 

本條目存在以下問題,請協助改善本條目或在討論頁針對議題發表看法。 此條目需要补充更多来源。 (2018年3月17日)请协助補充多方面可靠来源以改善这篇条目,无法查证的内容可能會因為异议提出而被移除。致使用者:请搜索一下条目的标题(来源搜索:羅生門 (電影) — 网页、新闻、书籍、学术、图像),以检查网络上是否存在该主题的更多可靠来源(判定指引)。 �...

 

Vale Nouvelle-Calédonie Création 2005 Forme juridique Société par actions simplifiée Siège social Nouméa Nouvelle-Calédonie Direction Antonin Beurrier (président directeur général)[1] Actionnaires 69 % Vale,21 % Sumic Netherlands Nickel5 % Province Sud,5 % Provinces Nord et Îles Loyauté. Activité Mine, Métallurgie Société mère Vale Effectif 3000 (2020)[2] Site web www.vale.nc modifier - modifier le code - voir Wikidata  Vale Nouvelle-Calédonie...

Armenians living in Azerbaijan This article needs to be updated. Please help update this article to reflect recent events or newly available information. (December 2020) Armenians in AzerbaijanՀայերն Ադրբեջանում Azərbaycan erməniləriTotal population217 (2009)Regions with significant populationsBakuLanguagesArmenian, AzerbaijaniReligionArmenian Apostolic ChurchRelated ethnic groupsArmenians in Nakhchivan, Armenians in Baku, Armenians in Russia, Armenians in Turkey Part of a...

 

Untuk rasi bintang Cina, lihat Dinding (rasi bintang Cina). Dinding dari batu bata Dinding atau tembok adalah suatu struktur padat yang membatasi dan kadang melindungi suatu area. Umumnya, dinding membatasi suatu bangunan dan menyokong struktur lainnya, membatasi ruang dalam bangunan menjadi ruangan-ruangan, atau melindungi atau membatasi suatu ruang di alam terbuka. Tiga jenis utama dinding struktural adalah dinding bangunan, dinding pembatas (boundary), serta dinding penahan (retaining). Di...

 

Film production company founded by Ryan Reynolds and George Dewey For the song, see Deadpool (soundtrack). Maximum Effort ProductionsLogo used since 2020IndustryFilm, television, advertisingFounded2018Founder Ryan Reynolds George Dewey HeadquartersNew York City, U.S.[1]Key peopleRyan Reynolds (co-founder, president)George Dewey (co-founder)Ashley Fox (co-president)Johnny Pariseau (co-president)James Toney III (CSO)Websitemaximumeffort.com Maximum Effort Productions[2] is a fi...

Television program Heaven & EarthPromotional posterAlso known asAs Much as Heaven and Earth By Land and Sky High as the Sky, Wide as the EarthGenreRomance, Drama, FamilyWritten byChoi Hyun-kyungDirected byMoon Bo-hyunStarringPark Hae-jin Han Hyo-joo Kang Jung-hwa Lee Joo-hyun Hong Soo-ahCountry of originSouth KoreaOriginal languageKoreanNo. of episodes165ProductionProducerGo Young-takProduction locationKoreaRunning time30 minutesOriginal releaseNetworkKBS1ReleaseJanuary 15 (2007-01-1...

 

Air MarshalM M EngineerPVSM, MVC, DFCBirth nameMinoo Merwan EngineerBorn1 December 1921Died31 December 1997Allegiance British India (1940–1947) India (1947–1973)Service/branch Royal Indian Air Force Indian Air ForceYears of service1940–1973Rank Air MarshalService number1614 F(P)Commands heldWestern Air CommandEastern Air CommandAir Force Station LohegaonAir Force Station SrinagarNo.4 SquadronNo. 8 SquadronBattles/warsBurma CampaignArakan Campaign 1942–43Indo-...

 

Запрос «Пугачёва» перенаправляется сюда; см. также другие значения. Алла Пугачёва На фестивале «Славянский базар в Витебске», 2016 год Основная информация Полное имя Алла Борисовна Пугачёва Дата рождения 15 апреля 1949(1949-04-15) (75 лет) Место рождения Москва, СССР[1]...

American actress (born 1989) Kelly Marie TranTran in 2017BornLoan Tran[1][2] (1989-01-17) January 17, 1989 (age 35)San Diego, California, U.S.Alma materUniversity of California, Los Angeles (BA)OccupationActressYears active2011–present Kelly Marie Tran (born Loan Tran,[1][2] January 17, 1989) is an American actress. She began acting in 2011, with most of her roles being in short film and television. She came to global prominence for her role as...

 

Portuguese football manager and former player In this Portuguese name, the first or maternal family name is da Silva and the second or paternal family name is Cruz. Tulipa Tulipa in 2023Personal informationFull name Manuel Jorge da Silva CruzDate of birth (1972-10-16) 16 October 1972 (age 51)Place of birth Vila Nova de Gaia, PortugalHeight 1.72 m (5 ft 8 in)Position(s) MidfielderTeam informationCurrent team Torreense (manager)Youth career1982–1986 Avintes1986–1991 ...

 

此条目序言章节没有充分总结全文内容要点。 (2019年3月21日)请考虑扩充序言,清晰概述条目所有重點。请在条目的讨论页讨论此问题。 哈萨克斯坦總統哈薩克總統旗現任Қасым-Жомарт Кемелұлы Тоқаев卡瑟姆若马尔特·托卡耶夫自2019年3月20日在任任期7年首任努尔苏丹·纳扎尔巴耶夫设立1990年4月24日(哈薩克蘇維埃社會主義共和國總統) 哈萨克斯坦 哈萨克斯坦政府...

نادي الهلال السعودي موسم 1963–64موسم 1963–64الرئيس عبد الرحمن بن سعيدملعبملعب الصايغكأس الملكالمركز الثاني الطقم الداخلي → 1962-63 1964-65 ← موسم نادي الهلال السعودي لكرة القدم 1963-64 الذي يعتبر الموسم السادس في مسيرة نادي الهلال منذ تأسيسه 1957 علي يد الشيخ عبد الرحمن بن سعيد في الريا�...

 

此生者传记没有列出任何参考或来源。 (2021年8月5日)请协助補充可靠来源,针对在世人物的无法查证的内容将被立即移除。 鈴木勝大男演员原文名鈴木 勝大(すずき かつひろ)罗马拼音Suzuki Katsuhiro国籍 日本出生 (1992-12-29) 1992年12月29日(31歲) 日本神奈川縣职业演員活跃年代2009年至今 鈴木勝大(1992年12月29日—)是一位日本的男演員、模特兒,出生於神奈川縣,身...

 

Elections in Bulacan 2022 Marilao local elections ← 2019 May 9, 2022 2025 →   Nominee Ricky Silvestre Atty. Jem Sy Party PDP–Laban Aksyon Running mate Henry Lutao Irma Celones (NUP) Popular vote 43,124 42,853 Percentage 50.15% 49.84% Mayor before election Ricky Silvestre PDP–Laban Elected Mayor Ricky Silvestre PDP–Laban Local elections were held in Marilao, Bulacan on May 9, 2022 within the Philippine general election. The voters elected the elective...

Marble sculpture in Washington, D.C. Progress of Civilization PedimentArtistThomas CrawfordYear1863MediumMarbleDimensions366 cm × 1829 cm (144 in × 720 in)LocationWashington D.C. The Progress of Civilization is a marble pediment above the entrance to the Senate wing of the United States Capitol building designed by the sculptor Thomas Crawford. An allegorical personification of America stands at the center of the pediment. To her right, a white wood...

 

Process of identifying those affected by a project or event Stakeholder analysis in conflict resolution, business administration, environmental health sciences decision making,[1] industrial ecology, public administration, and project management is the process of assessing a system and potential changes to it as they relate to relevant and interested parties known as stakeholders. This information is used to assess how the interests of those stakeholders should be addressed in a proje...

 

Pantai Batakan Panorama Pantai Batakan, Tanah Laut. Lokasi di Tanah Laut Informasi Lokasi Kecamatan Panyipatan, Kabupaten Tanah Laut Negara Indonesia Koordinat 4°05′49″S 114°37′44″E / 4.097°S 114.629°E / -4.097; 114.629Koordinat: 4°05′49″S 114°37′44″E / 4.097°S 114.629°E / -4.097; 114.629 Pemilik Jenis objek wisata Wisata pantai Pantai Batakan merupakan objek wisata bahari yang terpadu dengan panorama alam pegunungan pan...

This article relies largely or entirely on a single source. Relevant discussion may be found on the talk page. Please help improve this article by introducing citations to additional sources.Find sources: How Long Betcha' Got a Chick on the Side – news · newspapers · books · scholar · JSTOR (August 2021) 1975 single by the Pointer SistersHow Long (Betcha' Got a Chick on the Side)Single by the Pointer Sistersfrom the album Steppin' B-sideEasy DaysR...

 

Combination jerseyThe 1985 combination jersey, worn by Greg LeMondSportRoad bicycle racingCompetitionTour de FranceAwarded forBest combination leaderLocal nameMaillot du combiné (French)HistoryFirst award1968Editions15Final award1989First winner Franco Bitossi (ITA)Most wins Eddy Merckx (BEL) 5 times Most recent Steven Rooks (NED) The combination jersey (also known as the multi-coloured jersey or technicolour jersey) was the jersey in the Tour de France wor...