Mathematically speaking, \(||OA|| = ||OB||\) holds true in Fig-1, so there is a common inference that A and B are same. But the distance itself actually means both A and B have same amount of numeric changes relative to O, without indicating the direction of the change. This is in particular a problem what if \(x_1\) and \(x_2\) are two genes driving different biological processes, or two factors suggesting distinct customer preferences.
There is a gap between the numeric difference (or similarity) and identity difference (or similarity).
Arbitrarily make up cell-type “O”, “A”, “B” such that “A” and “B” generally have same shifts from “O”. And to see how distance-based method decides whether “A” and “B” are the same types.
Suppose there are \(G=450\) genes in total with three categories of the low (1-10), medium (100-200) and high (800-1000) expressed genes. Each cagegory has 150 genes. The general expression profile is nothing but a vector \(\mathbb E \left(e_1, e_2, e_3, \dots, e_G \right)\).
num_genes <- 450 expect_apple <- round(c( # low runif(n = num_genes/3, min = 1, max=10), # medium runif(n = num_genes/3, min = 100, max=500), # high runif(n = num_genes/3, min = 800, max=1000) )) idx_low <- 1:num_genes/3 idx_med <- (num_genes/3 + 1) : (num_genes/3*2) idx_hig <- (num_genes/3*2 + 1) : num_genes
Observed cells (\(N=300\)) are sampled from the profile. Each gene follows its Poisson distribution where \(\lambda_G = e_G\). This is consistent to the UMI-filtered single-cell RNA-seq data when the pure RNA spike-ins are the loading input.
num_cells <- 500 # a matrix with cells x genes mat_apple <- round(sapply(expect_apple, function(e) rpois(n = num_cells, lambda = e)))
Create “A” and “B”
In order to create new cell type “A” and “B”, each of them has ~40 genes being upregulated. There situations were investigated:
num_up_genes <- round((num_genes/3) / 4)
- low-expressed genes: \(\lambda\) (\([1, 10]\)) adding \([4, 10]\)
- medium-expressed genes: \(\lambda\) (\([100,500]\)) adding \([50, 250]\)
- high-expressed genes: \(\lambda\) (\([800,1000]\)) adding \([400, 500]\)
All the add-ups roughly take half of the expression.
idx_up_1 <- head(idx_low, num_up_genes) idx_up_2 <- tail(idx_low, num_up_genes)
expect_banana <- expect_apple expect_banana[idx_up_1] <- expect_banana[idx_up_1] + runif(n=num_up_genes, min=4, max=10) mat_banana <- round(sapply(expect_banana, function(e) rpois(n = num_cells, lambda = e))) expect_cherry <- expect_apple expect_cherry[idx_up_2] <- expect_cherry[idx_up_2] + runif(n=num_up_genes, min=4, max=10) mat_cherry <- round(sapply(expect_cherry, function(e) rpois(n = num_cells, lambda = e)))
Bonus: Create “AB”
expect_guava <- expect_apple idx_down <- idx_up_2 num_down_genes <- num_up_genes # upregulated genes expect_guava[idx_up_1] <- expect_guava[idx_up_1] + runif(n=num_up_genes, min=4* sqrt(2)/2, max=10*sqrt(2)/2) # downregulated genes num_down_genes <- num_up_genes expect_guava[idx_down] <- expect_guava[idx_down] - runif(n=num_down_genes, min=4*sqrt(2)/2, max=10*sqrt(2)/2) expect_guava[expect_guava < 0] <- 0 mat_guava <- round(sapply(expect_guava, function(e) rpois(n = num_cells, lambda = e)))
Visualize the expression profile and the sampled cells
From results of clustering and projection down to 2D PCA space, the different populatons are not considered as different.
Similar procedure is performed again to upregulate med-expressed genes and do sampling cells from expression profile.
Visualize expression matrix of the sample cells
Populations are detected.
Not surprisingly, populations are found.
I did not make a good example to hack distance, but re-learned the following messages:
- Highly expressed genes (with high variance) have more weights when counting the sample distance, which implied the need of variance stablization.
Possible to follow-up:
- Gene-wise normalize
- Focus on low-expressed genes