# Question

Mathematically speaking, \(||OA|| = ||OB||\) holds true in Fig-1, so there is a common inference that A and B are same. But the distance itself actually means both A and B have same amount of *numeric* changes relative to O, without indicating the *direction* of the change. This is in particular a problem what if \(x_1\) and \(x_2\) are two genes driving different biological processes, or two factors suggesting distinct customer preferences.

There is a gap between the numeric difference (or similarity) and identity difference (or similarity).

# Further Motivation

# Goal

Arbitrarily make up cell-type “O”, “A”, “B” such that “A” and “B” generally have same shifts from “O”. And to see how distance-based method decides whether “A” and “B” are the same types.

# Create “O”

Suppose there are \(G=450\) genes in total with three categories of the low (1-10), medium (100-200) and high (800-1000) expressed genes. Each cagegory has 150 genes. The general expression profile is nothing but a vector \(\mathbb E \left(e_1, e_2, e_3, \dots, e_G \right)\).

```
num_genes <- 450
expect_apple <- round(c(
# low
runif(n = num_genes/3, min = 1, max=10),
# medium
runif(n = num_genes/3, min = 100, max=500),
# high
runif(n = num_genes/3, min = 800, max=1000)
))
idx_low <- 1:num_genes/3
idx_med <- (num_genes/3 + 1) : (num_genes/3*2)
idx_hig <- (num_genes/3*2 + 1) : num_genes
```

Observed cells (\(N=300\)) are sampled from the profile. Each gene follows its Poisson distribution where \(\lambda_G = e_G\). This is consistent to the UMI-filtered single-cell RNA-seq data when the pure RNA spike-ins are the loading input.

```
num_cells <- 500
# a matrix with cells x genes
mat_apple <- round(sapply(expect_apple,
function(e) rpois(n = num_cells, lambda = e)))
```

# Create “A” and “B”

In order to create new cell type “A” and “B”, each of them has ~40 genes being upregulated. There situations were investigated:

`num_up_genes <- round((num_genes/3) / 4)`

- low-expressed genes: \(\lambda\) (\([1, 10]\)) adding \([4, 10]\)
- medium-expressed genes: \(\lambda\) (\([100,500]\)) adding \([50, 250]\)
- high-expressed genes: \(\lambda\) (\([800,1000]\)) adding \([400, 500]\)

All the add-ups roughly take half of the expression.

## Low

```
idx_up_1 <- head(idx_low, num_up_genes)
idx_up_2 <- tail(idx_low, num_up_genes)
```

```
expect_banana <- expect_apple
expect_banana[idx_up_1] <- expect_banana[idx_up_1] + runif(n=num_up_genes,
min=4, max=10)
mat_banana <- round(sapply(expect_banana,
function(e) rpois(n = num_cells, lambda = e)))
expect_cherry <- expect_apple
expect_cherry[idx_up_2] <- expect_cherry[idx_up_2] + runif(n=num_up_genes,
min=4, max=10)
mat_cherry <- round(sapply(expect_cherry,
function(e) rpois(n = num_cells, lambda = e)))
```

Bonus: Create “AB”

```
expect_guava <- expect_apple
idx_down <- idx_up_2
num_down_genes <- num_up_genes
# upregulated genes
expect_guava[idx_up_1] <- expect_guava[idx_up_1] +
runif(n=num_up_genes, min=4* sqrt(2)/2, max=10*sqrt(2)/2)
# downregulated genes
num_down_genes <- num_up_genes
expect_guava[idx_down] <- expect_guava[idx_down] -
runif(n=num_down_genes, min=4*sqrt(2)/2, max=10*sqrt(2)/2)
expect_guava[expect_guava < 0] <- 0
mat_guava <- round(sapply(expect_guava,
function(e) rpois(n = num_cells, lambda = e)))
```

### Visualize the expression profile and the sampled cells

From results of clustering and projection down to 2D PCA space, the different populatons are *not* considered as different.

## Med

Similar procedure is performed again to upregulate med-expressed genes and do sampling cells from expression profile.

### Visualize expression matrix of the sample cells

Populations are detected.

## High

Not surprisingly, populations are found.

# Summary

I did not make a good example to hack distance, but re-learned the following messages:

- Highly expressed genes (with high variance) have more weights when counting the sample distance, which implied the need of variance stablization.

Possible to follow-up:

- Gene-wise normalize
- Focus on low-expressed genes