r - 'ddply' causes a fatal error in RStudio running correlation on a large data set: ways to optimize? -


i need calculate correlations on large dataset (> 1 million of lines) split several columns. try combining ddply , cor() functions:

func <- function(xx) {  return(data.frame(corb = cor(xx$ysales, xx$bas.sales),                     cora = cor(xx$ysales, xx$tysales))) }  output <- ddply(input, .(ibd,cell,cat), func) 

this code works pretty on relatively small data sets (dataframes 1000 lines or 10000 lines), causes 'fatal error' when input file has 100000 lines or more. looks there not enough memory on computer process such big file these functions.

are there opportunities optimize such code somehow? maybe alternatives ddply work more effectively, or using loops split 1 function several consecutive?

i not have problems ddply on machine 1e7 rows , data given below. in total, uses approx. 1.7 gb on machine. here code:

options(stringsasfactors=false)  #this makes code reproducible set.seed(1234) n_rows=1e7 input=data.frame(ibd=sample(letters[1:5],n_rows,true),                  cell=sample(letters[1:5],n_rows,true),                  cat=sample(letters[1:5],n_rows,true),                  ysales=rnorm(n_rows),                  tysales=rnorm(n_rows),                  bas.sales=rnorm(n_rows))  #your solution library(plyr)  func <- function(xx) {   return(data.frame(corb = cor(xx$ysales, xx$bas.sales),                      cora = cor(xx$ysales, xx$tysales))) }  output <- ddply(input, .(ibd,cell,cat), func) 

however, in case problem more complex sample data, try data.table package. here code (please note not heavy user of data.table , code below might inefficient)

library(data.table)  input_dt=data.table(input)  output_dt=unique(input_dt[,`:=`(corb=cor(.sd$ysales,.sd$bas.sales),                                 cora=cor(.sd$ysales,.sd$tysales))                           ,by=c('ibd','cell','cat')]                  [,c('ibd','cell','cat','corb','cora'),with=false])  output_dt=output_dt[order(output_dt$ibd,output_dt$cell,output_dt$cat)] 

it gives same result

all.equal(data.table(output),output_dt) #[1] true  head(output_dt,3)  #   ibd cell cat          corb          cora #1:        -6.656740e-03 -0.0050483282 #2:        b  4.758460e-03  0.0051115833 #3:        c  1.751167e-03  0.0036150088 

Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -