在试图操作<代码>cor()时, function子(无论属于dgCMatrix还是dgTMatrix) 我有以下错误:

Error in cor(x) : supply both  x  and  y  or a matrix-like  x 

Converting my matrix to be dense will be very inefficient. Is there an easy way to calculate this correlation (without an all pairs loop?).


  • Ron

这是我最后使用的。 感谢!

www.un.org/Depts/DGACM/index_french.htm 假定其尺寸为n * p,n=200k,p=10k


Version 1, is more straightforward, but less efficient on time and memory, as the outer product operation is expensive:

sparse.cor2 <- function(x){
    n <- nrow(x)

    covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix

第2版在时间上(有几个业务)和记忆上都更有效率。 还需要大量记忆用于<代码>p=10k矩阵:

sparse.cor3 <- function(x){
    n <- nrow(x)

    cMeans <- colMeans(x)
    cSums <- colSums(x)

    # Calculate the population covariance matrix.
    # There s no need to divide by (n-1) as the std. dev is also calculated the same way.
    # The code is optimized to minize use of memory and expensive operations
    covmat <- tcrossprod(cMeans, (-2*cSums+n*cMeans))
    crossp <- as.matrix(crossprod(x))
    covmat <- covmat+crossp

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix


> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> object.size(x)
11999472 bytes
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   0.50    0.06    0.56 
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   0.17    0.00    0.17 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   0.13    0.00    0.12 
> system.time(correg <-cor(as.matrix(x)))
   user  system elapsed 
   0.25    0.03    0.29 
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx2)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE


> X <- sample(0:10,1e8,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> object.size(x)
120005688 bytes
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   1.47    0.07    1.53 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   1.18    0.09    1.29 
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   5.43    1.26    6.71

EDITED ANSWER - optimized for memory use and rate.

你的错误是逻辑的,因为cor功能不作为矩阵加以确认,而且Matrix 包裹中没有任何相关方法。


sparse.cor <- function(x){
  n <- nrow(x)
  m <- ncol(x)
  ii <- unique(x@i)+1 # rows with a non-zero element

  Ex <- colMeans(x)
  nozero <- as.vector(x[ii,]) - rep(Ex,each=length(ii))        # colmeans

  covmat <- ( crossprod(matrix(nozero,ncol=m)) +
  sdvec <- sqrt(diag(covmat))

<代码>covmat是你的差异-一致性矩阵,因此也可以计算。 计算依据是选择至少一个元素是非零的行,而在这个行的交叉产品中,你补充说,col子乘以所有零行的数量。 相当于


差异一,而您的差别矩阵。 其余部分比较容易。

A test case :

X <- sample(0:10,1e8,replace=T,p=c(0.99,rep(0.001,10)))
xx <- Matrix(X,ncol=5)

> system.time(out1 <- sparse.cor(xx))
   user  system elapsed 
   0.50    0.09    0.59 
> system.time(out2 <- cor(as.matrix(xx)))
   user  system elapsed 
   1.75    0.28    2.05 
> all.equal(out1,out2)
[1] TRUE


sparse.cor4 <- function(x){
    n <- nrow(x)
    cMeans <- colMeans(x)
    covmat <- (as.matrix(crossprod(x)) - n*tcrossprod(cMeans))/(n-1)
    sdvec <- sqrt(diag(covmat)) 
    cormat <- covmat/tcrossprod(sdvec)

简化源自于此:X栏内有X×p矩阵X和Nx p矩阵M:

cov(X) = E[(X-M) (X-M)] = E[X X - M X - X M + M M] 

M X = X M = M M, which have (i,j) elements = sum(column i) * sum(column j) / n

= n * mean(column i) * mean(column j)

or written with a row vector m of the column means,

= n * m m

Then cov(X) = E[X X - n m m]


> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
  1.139   0.196   1.334 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
  0.194   0.007   0.201 
> system.time(corx4 <- sparse.cor4(x))
   user  system elapsed 
  0.187   0.007   0.194 
> system.time(correg <-cor(as.matrix(x)))
   user  system elapsed 
  0.341   0.067   0.407 
> system.time(covreg <- cov(as.matrix(x)))
   user  system elapsed 
  0.314   0.016   0.330 
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cor)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cov)),c(as.matrix(covreg)))
[1] TRUE

Using WGCNA::cor(sparseMat) worked for me.

