r - 'ddply' causes a fatal error in RStudio running correlation on a large data set: ways to optimize? -
i need calculate correlations on large dataset (> 1 million of lines) split several columns. try combining ddply
, cor()
functions:
func <- function(xx) { return(data.frame(corb = cor(xx$ysales, xx$bas.sales), cora = cor(xx$ysales, xx$tysales))) } output <- ddply(input, .(ibd,cell,cat), func)
this code works pretty on relatively small data sets (dataframes 1000 lines or 10000 lines), causes 'fatal error' when input file has 100000 lines or more. looks there not enough memory on computer process such big file these functions.
are there opportunities optimize such code somehow? maybe alternatives ddply
work more effectively, or using loops split 1 function several consecutive?
i not have problems ddply
on machine 1e7
rows , data given below. in total, uses approx. 1.7 gb on machine. here code:
options(stringsasfactors=false) #this makes code reproducible set.seed(1234) n_rows=1e7 input=data.frame(ibd=sample(letters[1:5],n_rows,true), cell=sample(letters[1:5],n_rows,true), cat=sample(letters[1:5],n_rows,true), ysales=rnorm(n_rows), tysales=rnorm(n_rows), bas.sales=rnorm(n_rows)) #your solution library(plyr) func <- function(xx) { return(data.frame(corb = cor(xx$ysales, xx$bas.sales), cora = cor(xx$ysales, xx$tysales))) } output <- ddply(input, .(ibd,cell,cat), func)
however, in case problem more complex sample data, try data.table
package. here code (please note not heavy user of data.table
, code below might inefficient)
library(data.table) input_dt=data.table(input) output_dt=unique(input_dt[,`:=`(corb=cor(.sd$ysales,.sd$bas.sales), cora=cor(.sd$ysales,.sd$tysales)) ,by=c('ibd','cell','cat')] [,c('ibd','cell','cat','corb','cora'),with=false]) output_dt=output_dt[order(output_dt$ibd,output_dt$cell,output_dt$cat)]
it gives same result
all.equal(data.table(output),output_dt) #[1] true head(output_dt,3) # ibd cell cat corb cora #1: -6.656740e-03 -0.0050483282 #2: b 4.758460e-03 0.0051115833 #3: c 1.751167e-03 0.0036150088