Question
How to calculate the pairwise euclidean distance in R.
dist(mat)
But do you know how slow it is to calculate the pairwise euclidean distance in R? And how fast could it be in Python’s scikit-learn? Short answer: 300 faster than basic R.
- Basic R:
dist()
. Rcpp
Matrix
: matrix multiplicationvelocity.R
’scolEuclid
function
Data
n_row <- 500
n_col <- 3000
mat <- matrix(rpois(n_row * n_col, lambda = 100), nrow = n_row)
tmat <- t(mat)
Methods
Matrix
pdist <- function(mat){
# @param mat A non-negative matrix with samples by features
# @reference http://r.789695.n4.nabble.com/dist-function-in-R-is-very-slow-td4738317.html
mtm <- Matrix::tcrossprod(mat)
sq <- rowSums(mat^2)
out0 <- outer(sq, sq, "+") - 2 * mtm
out0[out0 < 0] <- 0
sqrt(out0)
}
Rcpp
with/without OpenMP supported
sourceCpp('./distance.cpp')
Source file: https://gist.github.com/Puriney/31ea85c6f08efeb197336e915d8fd072
Result
Whether correct?
mat0 <- mat[1:5, 1:10]
tmat0 <- t(mat0)
all.equal(c(as.matrix(dist(mat0))), c(pdist(mat0)))
## [1] TRUE
all.equal(c(as.matrix(dist(mat0))), c(colEuclid(tmat0, tmat0)))
## [1] TRUE
all.equal(c(as.matrix(dist(mat0))), c(dist_euclidean(mat0, mat0)))
## [1] TRUE
Performance
tm <- microbenchmark(
pdist = pdist(mat),
Rcpp = dist_euclidean(mat, mat),
openmp = colEuclid(tmat, tmat, 4),
times=5
)
ggplotly(viz_microbenchmark(tm))