What do you mean "similarity" (1)

Yun Yan

2018/01/14

Question

Mathematically speaking, \(||OA|| = ||OB||\) holds true in Fig-1, so there is a common inference that A and B are same. But the distance itself actually means both A and B have same amount of numeric changes relative to O, without indicating the direction of the change. This is in particular a problem what if \(x_1\) and \(x_2\) are two genes driving different biological processes, or two factors suggesting distinct customer preferences.

There is a gap between the numeric difference (or similarity) and identity difference (or similarity).

Are A and B same?

Figure 1: Are A and B same?

Further Motivation

Goal

Arbitrarily make up cell-type “O”, “A”, “B” such that “A” and “B” generally have same shifts from “O”. And to see how distance-based method decides whether “A” and “B” are the same types.

Create “O”

Suppose there are \(G=450\) genes in total with three categories of the low (1-10), medium (100-200) and high (800-1000) expressed genes. Each cagegory has 150 genes. The general expression profile is nothing but a vector \(\mathbb E \left(e_1, e_2, e_3, \dots, e_G \right)\).

num_genes <- 450
expect_apple <- round(c(
  # low
  runif(n = num_genes/3, min = 1, max=10),
  # medium
  runif(n = num_genes/3, min = 100, max=500),
  # high
  runif(n = num_genes/3, min = 800, max=1000)
  ))
idx_low <- 1:num_genes/3
idx_med <- (num_genes/3 + 1) : (num_genes/3*2)
idx_hig <- (num_genes/3*2 + 1) : num_genes

Observed cells (\(N=300\)) are sampled from the profile. Each gene follows its Poisson distribution where \(\lambda_G = e_G\). This is consistent to the UMI-filtered single-cell RNA-seq data when the pure RNA spike-ins are the loading input.

num_cells <- 500
# a matrix with cells x genes
mat_apple <- round(sapply(expect_apple,
                          function(e) rpois(n = num_cells, lambda = e)))
Observed expression matrix with cells on rows and genes on columns

(#fig:viz_expr_mat)Observed expression matrix with cells on rows and genes on columns

Create “A” and “B”

In order to create new cell type “A” and “B”, each of them has ~40 genes being upregulated. There situations were investigated:

num_up_genes <- round((num_genes/3) / 4)
  1. low-expressed genes: \(\lambda\) (\([1, 10]\)) adding \([4, 10]\)
  2. medium-expressed genes: \(\lambda\) (\([100,500]\)) adding \([50, 250]\)
  3. high-expressed genes: \(\lambda\) (\([800,1000]\)) adding \([400, 500]\)

All the add-ups roughly take half of the expression.

Low

idx_up_1 <- head(idx_low, num_up_genes)
idx_up_2 <- tail(idx_low, num_up_genes)
expect_banana <- expect_apple
expect_banana[idx_up_1] <- expect_banana[idx_up_1] + runif(n=num_up_genes,
                                                           min=4, max=10)
mat_banana <- round(sapply(expect_banana,
                           function(e) rpois(n = num_cells, lambda = e)))

expect_cherry <- expect_apple
expect_cherry[idx_up_2] <- expect_cherry[idx_up_2] + runif(n=num_up_genes,
                                                           min=4, max=10)
mat_cherry <- round(sapply(expect_cherry,
                             function(e) rpois(n = num_cells, lambda = e)))

Bonus: Create “AB”

expect_guava <- expect_apple
idx_down <- idx_up_2
num_down_genes <- num_up_genes
# upregulated genes
expect_guava[idx_up_1] <- expect_guava[idx_up_1] +
  runif(n=num_up_genes, min=4* sqrt(2)/2, max=10*sqrt(2)/2)
# downregulated genes
num_down_genes <- num_up_genes
expect_guava[idx_down] <- expect_guava[idx_down] -
  runif(n=num_down_genes, min=4*sqrt(2)/2, max=10*sqrt(2)/2)
expect_guava[expect_guava < 0] <- 0
mat_guava <- round(sapply(expect_guava,
                            function(e) rpois(n = num_cells, lambda = e)))

Visualize the expression profile and the sampled cells

The absolute values of upregulation for low-expressed genes is not obvious

(#fig:image_matrix_ChangeLow)The absolute values of upregulation for low-expressed genes is not obvious

From results of clustering and projection down to 2D PCA space, the different populatons are not considered as different.

Populations are not separated at all

(#fig:pheatmap_allcells_ChangeLow)Populations are not separated at all

Med

Similar procedure is performed again to upregulate med-expressed genes and do sampling cells from expression profile.

Visualize expression matrix of the sample cells

The absolute value of upregulation of med-expressed genes could be observed.

(#fig:image_matrix_ChangeMed)The absolute value of upregulation of med-expressed genes could be observed.

Populations are detected.

High

Not surprisingly, populations are found.

Summary

I did not make a good example to hack distance, but re-learned the following messages:

Possible to follow-up:

  1. Gene-wise normalize
  2. Focus on low-expressed genes