A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model (the probability distribution of different observations given the underlying state) and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations (or belief states) to the actions.
The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The general framework of Markov decision processes with imperfect information was described by Karl Johan Åström in 1965[1] in the case of a discrete state space, and it was further studied in the operations research community where the acronym POMDP was coined. It was later adapted for problems in artificial intelligence and automated planning by Leslie P. Kaelbling and Michael L. Littman.[2]
An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes the expected reward (or minimizes the cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.
Definition
Formal definition
A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple , where
is a set of states,
is a set of actions,
is a set of conditional transition probabilities between states,
is the reward function.
is a set of observations,
is a set of conditional observation probabilities, and
is the discount factor.
At each time period, the environment is in some state . The agent takes an action , which causes the environment to transition to state with probability . At the same time, the agent receives an observation which depends on the new state of the environment, , and on the just taken action, , with probability (or sometimes depending on the sensor model). Finally, the agent receives a reward equal to . Then the process repeats. The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward: , where is the reward earned at time . The discount factor determines how much immediate rewards are favored over more distant rewards. When the agent only cares about which action will yield the largest expected immediate reward; when the agent cares about maximizing the expected sum of future rewards.
Discussion
Because the agent does not directly observe the environment's state, the agent must make decisions under uncertainty of the true environment state. However, by interacting with the environment and receiving observations, the agent may update its belief in the true state by updating the probability distribution of the current state. A consequence of this property is that the optimal behavior may often include (information gathering) actions that are taken purely because they improve the agent's estimate of the current state, thereby allowing it to make better decisions in the future.
It is instructive to compare the above definition with the definition of a Markov decision process. An MDP does not include the observation set, because the agent always knows with certainty the environment's current state. Alternatively, an MDP can be reformulated as a POMDP by setting the observation set to be equal to the set of states and defining the observation conditional probabilities to deterministically select the observation that corresponds to the true state.
Belief update
After having taken the action and observing , an agent needs to update its belief in the state the environment may (or not) be in. Since the state is Markovian (by assumption), maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. The operation is denoted . Below we describe how this belief update is computed.
After reaching , the agent observes with probability . Let be a probability distribution over the state space . denotes the probability that the environment is in state . Given , then after taking action and observing ,
where is a normalizing constant with .
Belief MDP
A Markovian belief state allows a POMDP to be formulated as a Markov decision process where every belief is a state. The resulting belief MDP will thus be defined on a continuous state space (even if the "originating" POMDP has a finite number of states: there are infinite belief states (in ) because there are an infinite number of probability distributions over the states (of )).[2]
Formally, the belief MDP is defined as a tuple where
is the set of belief states over the POMDP states,
is the same finite set of action as for the original POMDP,
is the belief state transition function,
is the reward function on belief states,
is the discount factor equal to the in the original POMDP.
Of these, and need to be derived from the original POMDP. is
where is the value derived in the previous section and
The belief MDP reward function () is the expected reward from the POMDP reward function over the belief state distribution:
.
The belief MDP is not partially observable anymore, since at any given time the agent knows its belief, and by extension the state of the belief MDP.
Policy and value function
Unlike the "originating" POMDP (where each action is available from only one state), in the corresponding Belief MDP all belief states allow all actions, since you (almost) always have some probability of believing you are in any (originating) state. As such, specifies an action for any belief .
Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When defines a cost, the objective becomes the minimization of the expected cost.
The expected reward for policy starting from belief is defined as
where is the discount factor. The optimal policy is obtained by optimizing the long-term reward.
where is the initial belief.
The optimal policy, denoted by , yields the highest expected reward value for each belief state, compactly represented by the optimal value function . This value function is solution to the Bellman optimality equation:
For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex.[3] It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an -optimal value function, and preserves its piecewise linearity and convexity.[4] By improving the value, the policy is implicitly improved. Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead.[5][6]
Approximate POMDP solutions
In practice, POMDPs are often computationally intractable to solve exactly. This intractability is often due to the curse of dimensionality or the curse of history (the fact that optimal policies may depend on the entire history of actions and observations). To address these issues, computer scientists have developed methods that approximate solutions for POMDPs. These solutions typically attempt to approximate the problem or solution with a limited number of parameters,[7] plan only over a small part of the belief space online, or summarize the action-observation history compactly.
Grid-based algorithms[8] comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered which are not in the set of grid points. More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states.[9][10] For example, adaptive grids and point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space.[11][12]
Dimensionality reduction using PCA has also been explored.[13]
Online planning algorithms approach large POMDPs by constructing a new policy for the current belief each time a new observation is received. Such a policy only needs to consider future beliefs reachable from the current belief, which is often only a very small part of the full belief space. This family includes variants of Monte Carlo tree search[14] and heuristic search.[15] Similar to MDPs, it is possible to construct online algorithms that find arbitrarily near-optimal policies and have no direct computational complexity dependence on the size of the state and observation spaces.[16]
Another line of approximate solution techniques for solving POMDPs relies on using (a subset of) the history of previous observations, actions and rewards up to the current time step as a pseudo-state. Usual techniques for solving MDPs based on these pseudo-states can then be used (e.g. Q-learning). Ideally the pseudo-states should contain the most important information from the whole history (to reduce bias) while being as compressed as possible (to reduce overfitting).[17]
POMDP theory
Planning in POMDP is undecidable in general. However, some settings have been identified to be decidable (see Table 2 in,[18] reproduced below). Different objectives have been considered. Büchi objectives are defined by Büchi automata. Reachability is an example of a Büchi condition (for instance, reaching a good state in which all robots are home). coBüchi objectives correspond to traces that do not satisfy a given Büchi condition (for instance, not reaching a bad state in which some robot died). Parity objectives are defined via parity games; they enable to define complex objectives such that reaching a good state every 10 timesteps. The objective can be satisfied:
almost-surely, that is the probability to satisfy the objective is 1;
positive, that is the probability to satisfy the objective is strictly greater than 0;
quantitative, that is the probability to satisfy the objective is greater than a given threshold.
We also consider the finite memory case in which the agent is a finite-state machine, and the general case in which the agent has an infinite memory.
POMDPs can be used to model many kinds of real-world problems. Notable applications include the use of a POMDP in management of patients with ischemic heart disease,[19] assistive technology for persons with dementia,[9][10] the conservation of the critically endangered and difficult to detect Sumatran tigers[20] and aircraft collision avoidance.[21]
One application is a teaching case, a crying baby problem, where a parent needs to sequentially decide whether to feed the baby based on the observation of whether the baby is crying or not, which is an imperfect representation of the actual baby's state of hunger.[22][23]
^Smallwood, R.D., Sondik, E.J. (1973). "The optimal control of partially observable Markov decision processes over a finite horizon". Operations Research. 21 (5): 1071–88. doi:10.1287/opre.21.5.1071.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^Sondik, E.J. (1978). "The optimal control of partially observable Markov processes over the infinite horizon: discounted cost". Operations Research. 26 (2): 282–304. doi:10.1287/opre.26.2.282.
^Hansen, E. (1998). "Solving POMDPs by searching in policy space". Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence (UAI-98). arXiv:1301.7380.
^Lovejoy, W. (1991). "Computationally feasible bounds for partially observed Markov decision processes". Operations Research. 39: 162–175. doi:10.1287/opre.39.1.162.
^ abJesse Hoey; Axel von Bertoldi; Pascal Poupart; Alex Mihailidis (2007). "Assisting Persons with Dementia during Handwashing Using a Partially Observable Markov Decision Process". Proceedings of the International Conference on Computer Vision Systems. doi:10.2390/biecoll-icvs2007-89.
^ abJesse Hoey; Pascal Poupart; Axel von Bertoldi; Tammy Craig; Craig Boutilier; Alex Mihailidis. (2010). "Automated Handwashing Assistance For Persons With Dementia Using Video and a Partially Observable Markov Decision Process". Computer Vision and Image Understanding. 114 (5): 503–519. CiteSeerX10.1.1.160.8351. doi:10.1016/j.cviu.2009.06.008.
^Hauskrecht, M. (1997). "Incremental methods for computing bounds in partially observable Markov decision processes". Proceedings of the 14th National Conference on Artificial Intelligence (AAAI). Providence, RI. pp. 734–739. CiteSeerX10.1.1.85.8303.
Wadah tinta bubuk berwarna Tinta bubuk atau toner adalah serbuk yang digunakan pada pencetak laser dan mesin fotokopi untuk membentuk cetakan teks dan gambar pada kertas. Pada masa-masa awal, serbuk yang digunakan adalah karbon biasa. Namun, untuk meningkatkan mutu cetakan, bahan yang digunakan adalah campuran karbon dengan polimer. Partikel-partikel pada tinta bubuk meleleh karena panas pada pelebur (fuser), dan kemudian melekat pada kertas. Pranala luar Lihat entri toner di kamus bebas Wik...
Carex Carex halleriana TumbuhanJenis buahBuah kurung TaksonomiDivisiTracheophytaSubdivisiSpermatophytesKladAngiospermaeKladmonocotsKladcommelinidsOrdoPoalesFamiliCyperaceaeSubfamiliCyperoideaeTribusCariceaeGenusCarex Linnaeus, 1753 Diversitas sekitar 1800 spesies Tata namaGender of a scientific name of a genus (en)feminin Distribusi lbs Distribusi Carex adalah genus dari lebih dari 2,000 spesies[1] tanaman mirip rerumputan dari famili Cyperaceae, umumnya dikenal sebagai alang-ala...
Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Januari 2023. SMP Negeri 1 WonosoboInformasiRentang kelasVII, VIII, IXKurikulumKurikulum 2013AlamatLokasiJl. Pemuda 7, Wonosobo, Jawa TengahMoto SMP Negeri (SMPN) 1 Wonosobo, merupakan salah satu Sekolah Menengah Pertama Negeri yang ada di Provinsi Jawa Tengah, Indo...
The album chart name changed from Top Pop Albums to Billboard 200 Top Albums on September 7, 1991.[1] The highest-selling albums and EPs in the United States are ranked in the Billboard 200, which is published by Billboard magazine. The data are compiled by Nielsen Soundscan starting with the week ending on May 25, 1991, based on each album's weekly physical and digital sales. In 1991, a total of 14 albums claimed the top of the chart. One of which, American rapper Vanilla Ice's To t...
Akrokorinthos menghadap utara ke Teluk Korinthos. Akrokorinthos (bahasa Yunani: Ακροκόρινθος), Korinthos Atas, akropolis Korinthos kuno, adalah bangunan batu monolitikum yang menghadap kota kuno Korinthos di Yunani. Akrokorinthos disebut sebagai akropolis paling impresif di Yunani daratan, oleh George Forrest.[1] Akrokorinthos selalu di tempati sejak masa Arkaik hingga awal abad ke-19. Akrokorinthos dibentengi menjadi sangat kuat pada masa Kekaisaran Bizantium karena t...
Faruq beralih ke halaman ini. Untuk desa Iran, lihat Faruq, Iran. Untuk pegulat profesional Amerika, lihat Faarooq. Untuk nama pemberian Iran dan Asia Selatan yang serupa Farokh, Farrokh, Farukh, Faruque atau Farrukh, lihat Farrokh (name). Farooq فاروقRepresentasi kaligrafi Al-FarooqPelafalanArab Baku: [fɑːˈruːq, faːˈruːq, fæːˈruːq]Arab Mesir: [fɑˈɾuːʔ]Arab Syam: [faːˈruːʔ]Persia: [fɒːˈɾuːɣ]bahasa Inggris: /fɑːˈruːk/Jenis kelamin...
Nom officiel Централный Аерогидродинамический Институт, ЦАГИ Nom en français TsAGI Pays Russie Siège social Joukovski Création 1er décembre 1918 Effectif 3 700 Directeur général Boris S. Alyoshin Site Internet http://www.tsagi.com/ modifier TsAGI, à gauche - Soufflerie subsonique T-105 TsAGI est l'acronyme russe de Institut central d'aérohydrodynamique (Централный Аерогидродинамический Институт, �...
Cet article est une ébauche concernant un écrivain allemand. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Julius Meier-GraefeJulius Meier-Graefe.Portrait par Lovis Corinth (1917). Paris, musée d'OrsayBiographieNaissance 10 juin 1867ReșițaDécès 5 juin 1935 (à 67 ans)VeveyNationalité allemandeFormation Université de technologie de MunichActivités Historien de l’art, écrivain, historien de la ...
متلازمة تكيس المبايض متلازمة المبيض المتعدد الكيسات معلومات عامة الاختصاص علم الغدد الصم، وطب النساء من أنواع متلازمة، وأمراض الغدد الصماء، واضطراب جيني، وأمراض الجهاز التناسلي، ومرض الأسباب الأسباب اضطراب جيني[1] المظهر السريري الأعراض ف�...
Sando三多Sando in 1925Amban of Outer MongoliaIn office1909–1911MonarchXuantong Emperor Personal detailsBorn1876Hangzhou, Zhejiang, Qing ChinaDied1941 (aged 64–65)OccupationQing Civil ServantMilitary serviceAllegiance Qing dynasty Sando (alternately Sanduo, San To, Sadowo; Chinese: 三多; pinyin: Sānduō; Mongolian: Сандо; 1876–1941), courtesy name Liuqiao (Chinese: 六橋), was an official of the Qing dynasty and later the Republic of China who most served as t...
Jerez AirportAeropuerto de JerezIATA: XRYICAO: LEJR XRYLocation of airport in AndalusiaInformasiJenisPublicPengelolaAenaLokasiJerez de la FronteraKetinggian dpl28 mdplKoordinat36°44′41″N 006°03′36″W / 36.74472°N 6.06000°W / 36.74472; -6.06000Landasan pacu Arah Panjang Permukaan m kaki 02/20 2,300 7,546 Aspal Source: Spanish AIP at EUROCONTROL[1] Bandara Jerez (bahasa Spanyol: Aeropuerto de Jerez) (IATA: XRY, ICAO: LEJR), merupakan...
Communist movement in South Vietnam This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Đồng Khởi Movement – news · newspapers · books · scholar · JSTOR (April 2023) (Learn how and when to remove this message) Đồng Khởi Museum in Mỏ Cày Nam ward, Bến Tre Đồng Khởi (lit. 'Uprise Together...
Alih aksara Hepburn (ヘボン式ローマ字code: ja is deprecated , Hebon-shiki Rōmaji) dinamakan untuk memperingati Pendeta James Curtis Hepburn, pencipta alih aksara bahasa Jepang ke dalam abjad Latin yang pertama kali digunakan Hepburn sewaktu menyusun kamus bahasa Jepang-Inggris edisi ke-3 terbitan tahun 1887. Perkumpulan Romanisasi Aksara Jepang atau Rōmajikai merupakan organisasi yang pertama kali mengusulkan sistem Hepburn pada tahun 1885. James Curtis Hepburn, pencipta alih aksara...
Town in New York, United StatesNorth Salem, New YorkTownTown of North SalemDowntown Croton Falls, a hamlet within the town SealLocation of North Salem, New YorkCoordinates: 41°19′41″N 73°36′47″W / 41.32806°N 73.61306°W / 41.32806; -73.61306CountryUnited StatesStateNew YorkCountyWestchesterIncorporated1788Government • Town SupervisorWarren Lucas (R[1])Area[2] • Total22.939 sq mi (59.41 km2) •...
Association football league in Scotland Football leagueWest of Scotland Football LeagueFounded2020Country ScotlandConfederationUEFADivisions5Number of teams80Level on pyramid6–10Promotion toLowland Football LeagueDomestic cup(s)Scottish Cup (SFA licensed clubs and Premier Division winners)South Region Challenge CupScottish Junior Cup (SJFA members) Strathclyde Cup (Neither SFA nor SJFA members only)League cup(s)West of Scotland League CupCurrent championsBeith Juniors (2nd title) (2023...
Curved structure that spans a space and may support a load This article is about the architectural construct. For other uses of arch or arches, see Arch (disambiguation). Gateway Arch An arch is a curved vertical structure spanning an open space underneath it.[1] Arches may support the load above them, or they may perform a purely decorative role. As a decorative element, the arch dates back to the 4th millennium BC, but structural load-bearing arches became popular only after their a...
Church in Rome, ItalySaints Martin and Sebastian of the SwissSanti Martino e Sebastiano degli SvizzeriView of the façade of the church from St Peter's square.Click on the map for a fullscreen view41°54′12.19″N 012°27′24.2″E / 41.9033861°N 12.456722°E / 41.9033861; 12.456722LocationVatican City, RomeCountryItalyDenominationRoman CatholicWebsitewww.gardessuisses.ch/paepstliche-schweizergarde/en/about-us/HistoryStatusOratory,national churchArchitectureArchite...
Private island located in Lake Erie Ballast Island (left) and Lost Ballast Island as viewed from the Sonny-S between Middle Bass Island and South Bass Island. Ballast Island is a small, 15-acre (0.049 km²)[1] private island on Lake Erie in the U.S. state of Ohio, about one-quarter mile (0.4 km) northeast of the northeast tip of South Bass Island. It is known primarily as a navigation point for boats going to or from Put-in-Bay from the east. There are shoals between Ballast...
German poet (1680-1747) This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help improve this article by introducing more precise citations. (February 2013) (Learn how and when to remove this message) Barthold Heinrich Brockes (September 22, 1680 – January 16, 1747) was a German poet. Barthold Brockes, Portrait by Dominicus van der Smissen He was born in Hamburg and educated at the Gelehrten...