The name "integral probability metric" was given by German statistician Alfred Müller;[1] the distances had also previously been called "metrics with a ζ-structure."[2]
Definition
Integral probability metrics (IPMs) are distances on the space of distributions over a set , defined by a class of real-valued functions on as
here the notation Pf refers to the expectation of f under the distribution P. The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every its negation is also in .
The functions f being optimized over are sometimes called "critic" functions;[3] if a particular achieves the supremum, it is often termed a "witness function"[4] (it "witnesses" the difference in the distributions). These functions try to have large values for samples from P and small (likely negative) values for samples from Q; this can be thought of as a weaker version of classifers, and indeed IPMs can be interpreted as the optimal risk of a particular classifier.[5]: sec. 4
The choice of determines the particular distance; more than one can generate the same distance.[1]
For any choice of , satisfies all the definitions of a metric except that we may have we may have for some P ≠ Q; this is variously termed a "pseudometric" or a "semimetric" depending on the community. For instance, using the class which only contains the zero function, is identically zero. is a metric if and only if separates points on the space of probability distributions, i.e. for any P ≠ Q there is some such that ;[1] most, but not all, common particular cases satisfy this property.
Examples
All of these examples are metrics except when noted otherwise.
The energy distance, as a special case of the maximum mean discrepancy,[7] is generated by the unit ball in a particular reproducing kernel Hilbert space.
The f-divergences are probably the best-known way to measure dissimilarity of probability distributions. It has been shown[5]: sec. 2 that the only functions which are both IPMs and f-divergences are of the form , where and is the total variation distance between distributions.
One major difference between f-divergences and most IPMs is that when P and Q have disjoint support, all f-divergences take on a constant value;[17] by contrast, IPMs where functions in are "smooth" can give "partial credit." For instance, consider the sequence of Dirac measures at 1/n; this sequence converges in distribution to , and many IPMs satisfy , but no nonzero f-divergence can satisfy this. That is, many IPMs are continuous in weaker topologies than f-divergences. This property is sometimes of substantial importance,[18] although other options also exist, such as considering f-divergences between distributions convolved with continuous noise.[18][19]
Estimation from samples
Because IPM values between discrete distributions are often sensible, it is often reasonable to estimate using a simple "plug-in" estimator: where and are empirical measures of sample sets. These empirical distances can be computed exactly for some classes ;[5] estimation quality varies depending on the distance, but can be minimax-optimal in certain settings.[14][20][21]
When exact maximization is not available or too expensive, another commonly used scheme is to divide the samples into "training" sets (with empirical measures and ) and "test" sets ( and ), find approximately maximizing , then use as an estimate.[22][12][23][24] This estimator can possibly be consistent, but has a negative bias[22]: thm. 2 . In fact, no unbiased estimator can exist for any IPM[22]: thm. 3 , although there is for instance an unbiased estimator of the squared maximum mean discrepancy.[4]
References
^ abcMüller, Alfred (June 1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability. 29 (2): 429–443. doi:10.2307/1428011. JSTOR1428011. S2CID124648603.
^Zolotarev, V. M. (January 1984). "Probability Metrics". Theory of Probability & Its Applications. 28 (2): 278–302. doi:10.1137/1128025.
^ abGretton, Arthur; Borgwardt, Karsten M.; Rasche, Malte J.; Schölkopf, Bernhard; Smola, Alexander (2012). "A Kernel Two-Sample Test"(PDF). Journal of Machine Learning Research. 13: 723–773.
^ abcSriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv:0901.2698 [cs.IT].
^Stanczuk, Jan; Etmann, Christian; Lisa Maria Kreusser; Schönlieb, Carola-Bibiane (2021). "Wasserstein GANs Work Because They Fail (To Approximate the Wasserstein Distance)". arXiv:2103.01678 [stat.ML].
^Mallasto, Anton; Montúfar, Guido; Gerolin, Augusto (2019). "How Well do WGANs Estimate the Wasserstein Metric?". arXiv:1910.03875 [cs.LG].