Please note that eDoc will be permanently shut down in the first quarter of 2021!      Home News About Us Contact Contributors Disclaimer Privacy Policy Help FAQ

Quick Search
My eDoc
Session History
Support Wiki
Direct access to
document ID:

          Institute: MPI für biologische Kybernetik     Collection: Biologische Kybernetik     Display Documents

ID: 548498.0, MPI für biologische Kybernetik / Biologische Kybernetik
A Fast, Consistent Kernel Two-Sample Test
Authors:Gretton, A.; Fukumizu, K.; Harchaoui, Z.; Sriperumbudur, B.K.
Editors:Bengio, Y.; Schuurmans, D.; Lafferty, J.; Williams, C.; Culotta, A.
Date of Publication (YYYY-MM-DD):2010-04
Title of Proceedings:Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009
Start Page:673
End Page:681
Physical Description:9
Audience:Not Specified
Intended Educational Use:No
Abstract / Description:A kernel embedding of probability distributions into reproducing kernel Hilbert
spaces (RKHS) has recently been proposed, which allows the comparison of two
probability measures P and Q based on the distance between their respective embeddings:
for a sufficiently rich RKHS, this distance is zero if and only if P and
Q coincide. In using this distance as a statistic for a test of whether two samples
are from different distributions, a major difficulty arises in computing the significance
threshold, since the empirical statistic has as its null distribution (where
P = Q) an infinite weighted sum of x2 random variables. Prior finite sample
approximations to the null distribution include using bootstrap resampling, which
yields a consistent estimate but is computationally costly; and fitting a parametric
model with the low order moments of the test statistic, which can work well in
practice but has no consistency or accuracy guarantees. The main result of the
present work is a novel estimate of the null distribution, computed from the eigenspectrum
of the Gram matrix on the aggregate sample from P and Q, and having
lower computational cost than the bootstrap. A proof of consistency of this estimate
is provided. The performance of the null distribution estimate is compared
with the bootstrap and parametric approaches on an artificial example, high dimensional
multivariate data, and text.
External Publication Status:published
Document Type:Conference-Paper
Communicated by:Holger Fischer
Affiliations:MPI für biologische Kybernetik/Empirical Inference (Dept. Schölkopf)
The scope and number of records on eDoc is subject to the collection policies defined by each institute - see "info" button in the collection browse view.