Introduction
Motivation

IIn most unsupervised learning tasks, a notion of distance or similarity between data points is both crucial and usually not directly available as an input. For example, the efficiency of tasks like clustering and dimension reduction depend on the distance chosen. Furthermore, in many applications data lie on a lower dimensional manifold (unknown).

The Fermat distance takes into account the underlying structure (manifold) of the data and the underlying density from which the points are sampled.


Proposal: Fermat distance
Fermat distance follows the manifold and takes into account the density.

Consider:

  • $\mathcal{M} \subseteq \mathbb{R}^D$ a $d$-dimensional manifold, with $d \leq D$.
  • $f: \mathcal M \mapsto \mathbb R_+$ a density function.
  • $\mathbb{X}_n \subseteq \mathcal{M}$ a sample of $n$ independent points with common density $f$.
  • $\ell (\cdotp,\cdotp)$ a distance on $\mathcal M$ (for instance, the Euclidean distance)

Definition(Fermat distance)

Given two points $x, y$ and a parameter $\beta > 0$, we define the Fermat distance as

$$ \mathscr D (x, y) = \inf_{\Gamma} \int_\Gamma \frac{1}{f^\beta} d \ell , $$

where the minimization is performed all over posible paths $\Gamma$ that connect $x$ with $y$.

Definition (Fermat distance estimator)

For $\alpha > 1$ and given two points $ x, y\in \mathbb{X}_n$, we define the Fermat distance estimator as:

$$ \mathscr{D}_{\mathbb{X}_n} (x, y) = \min_{ \begin{align} ({q}_1, ...,{q}_K ) \in \mathbb{X}_n^K \\ q_1 = x, \, q_K = y, \, K \geq 2 \end{align}} \, \sum_{i=1}^{K-1} \, \ell ( q_{i+1}, q_i)^\alpha.$$

Remaks

  • The Fermat distance $\mathscr D (\cdot, \cdot)$ takes into account both manifold $\mathcal M$ and density $f$.
  • The Fermat distance $\mathscr D (\cdot, \cdot)$ can be estimated from a finite set of samples $\mathscr D_{\mathbb X_n}(\cdot , \cdot)$, since $$ \lim_{n \to \infty} n^\beta \mathscr D_{\mathbb X_n}(x, y) = \mathscr D (x, y) $$ with $\beta = (\alpha - 1)/ d$.
  • $\mathscr{D}_{\mathbb{X}_n} (x, y)$ can be computed efficiently.
Papers
[2018]
Weighted Geodesic Distance Following Fermat's Principle:
https://openreview.net/pdf?id=BJfaMIJwG
[2018]
Nonhomogeneous Euclidean first-passage percolation and distance learning: https://arxiv.org/abs/1810.09398