scala - Spark RDD Lifecycle: whether RDD will be reclaimed out of scope -
in method, create new rdd, , cache it, whether spark unpersist rdd automatically after rdd out of scope?
i thinking so, what's happens?
no, won't unpersisted automatically.
why ? because maybe looks rdd not needed anymore, spark model not materialize rdd until needed transformation, it's hard tell "i won't need rdd" anymore. you, can tricky, because of following situation :
javardd<t> rddunion = sc.parallelize(new arraylist<t>()); // create empty merging (int = 0; < 10; i++) { javardd<t2> rdd = sc.textfile(inputfilenames[i]); rdd.cache(); // since used twice, cache. rdd.map(...).filter(...).saveastextfile(outputfilenames[i]); // transform , save, rdd materializes rddunion = rddunion.union(rdd.map(...).filter(...)); // transform t , merge union rdd.unpersist(); // seems not needed. (but needed actually) // here, rddunion materializes, , needs 10 rdds unpersisted. so, rebuilding 10 rdds occur. rddunion.saveastextfile(mergedfilename); }
credit code sample spark-user ml