How to calculate pairwise distance in a slow and fast way

Yun Yan

2018/01/17

Question

How to calculate the pairwise euclidean distance in R.

dist(mat)

But do you know how slow it is to calculate the pairwise euclidean distance in R? And how fast could it be in Python’s scikit-learn? Short answer: 300 faster than basic R.

Data

n_row <- 500
n_col <- 3000
mat <- matrix(rpois(n_row * n_col, lambda = 100), nrow = n_row)
tmat <- t(mat)

Methods

Matrix

pdist <- function(mat){
  # @param mat A non-negative matrix with samples by features
  # @reference http://r.789695.n4.nabble.com/dist-function-in-R-is-very-slow-td4738317.html
  mtm <- Matrix::tcrossprod(mat)
  sq <- rowSums(mat^2)
  out0 <- outer(sq, sq, "+") - 2 * mtm
  out0[out0 < 0] <- 0

  sqrt(out0)
}

Rcpp with/without OpenMP supported

sourceCpp('./distance.cpp')

Source file: https://gist.github.com/Puriney/31ea85c6f08efeb197336e915d8fd072

Result

Whether correct?

mat0 <- mat[1:5, 1:10]
tmat0 <- t(mat0)
all.equal(c(as.matrix(dist(mat0))), c(pdist(mat0)))
## [1] TRUE
all.equal(c(as.matrix(dist(mat0))), c(colEuclid(tmat0, tmat0)))
## [1] TRUE
all.equal(c(as.matrix(dist(mat0))), c(dist_euclidean(mat0, mat0)))
## [1] TRUE

Performance

tm <- microbenchmark(
  pdist = pdist(mat),
  Rcpp = dist_euclidean(mat, mat),
  openmp = colEuclid(tmat, tmat, 4),
  times=5
)
ggplotly(viz_microbenchmark(tm))